[ms]
0.8 100 50
0.6 0.4 0.2
0 0
50 l
[Hz] i
100
150
0 0
50 l
[Hz] i
100
150
Figure 4: Results of numerical simulations, showing mean ISI and CV as a function of inhibitory ring frequency li for a series of excitatory frequencies: le D 50 Hz (dotted line), le D 75 Hz (dash-dotted line), le D 100 Hz (dashed line), le D 125 Hz (solid line), le D 150 Hz (thick solid line). Circles denote the points where hU1 i D Ut , crosses denote the points where hU1 i D Ut C s( U1 ) , and asterisks denote the points where hU1 i D Ut ¡ s( U1 ) (see also Figure 3). Averages are based on 10,000 spikes.
3.2 Numerical Simulations. Now we proceed from the analytical results to simulation results of the conductance-based IF model and thereby link the function lti ( k, le ) to the spiking behavior of the neuron. Figure 4 shows the mean interspike interval hISIi and its coefcient of variation CV as a function of the excitatory and inhibitory ring rates. For increasing inhibitory frequency, hISIi ! 1 and CV ! 1. For low inhibitory ring rates, hISIi is small and CV is relatively small, indicating that ring of the output neuron is frequent and regular. The incline of the graphs becomes less steep for higher excitatory ring rates. This is related to the increase in the intermediate frequency regime lti ( k, le ) · li · lti ( ¡k, le ) for large le , shown in Figure 3. The stars, circles, and crosses denote the approximate mean ISI and CV for which hU1 i D Ut C ks ( U1 ) with k 2 f¡1, 0, 1g, respectively. The CV for a particular value of k is approximately constant with CV( k D 1) ¼ 0.23, CV( k D 0) ¼ 0.45, and CV( k D ¡1) ¼ 0.90. However, the mean ISI for constant k decreases for larger le . This can, at least partly, be explained by the fact that the mean effective time constant htm i decreases for increasing excitatory and inhibitory ring rates (see Figure 2). A small time constant implies a short rise time of the membrane potential toward the threshold potential. Figure 5 shows the prespike averages of the total synaptic conductances, the membrane potential, the steady-state potential, and the time constant for an excitatory ring rate le D 100 Hz and for two inhibitory ring rates: li D lti ( 1) D 29.6 Hz (implying hU1 i D Ut C s( U1 ) ) and li D lti ( ¡1) D 88.0 Hz (implying hU1 i D Ut ¡ s( U1 ) ). In addition to the mean value plus
2016
Sybert Stroeve and Stan Gielen G [nS] e
l =29.6 [Hz] i
30 25
40
6
50
5
m
[ms]
20
20 15 10
i
t
U¥ [mV]
G [nS] 30
70 5
60 0 10
5
4 0 10
5
80 0 10
80
40
5
50
30
60
50
4
55
20
40
60
3
60
l =88.0 [Hz] i
40
0 0 10
10 10
5
t [ms]
20 0 10
U [mV]
60
10 5
50
5
t [ms]
70 0 10
5
t [ms]
2 0 10
5
t [ms]
65 0 10
5
0
5
0
t [ms]
Figure 5: Mean and standard deviation of prespike conductances Ge and Gi , steady-state potential U1 , time constant tm , and membrane potential U for uncorrelated inputs (Ne D Ni D 120, le D 100 Hz) and two inhibitory ring rates: li D lti ( 1, le ) D 29.6 Hz (top row) and li D lti ( ¡1, le ) D 88.0 Hz (bottom row). The solid and dashed lines represent the prespike mean § SD, averaged over 10,000 spikes in numerical simulations. The dash-dotted and dotted lines represent the mean § SD of approximated conductances, steady-state potential, and time constant calculated by equations 3.6 through 3.11. The threshold potential is Ut D ¡55 mV.
or minus the standard deviation of the simulation results (solid and dashed lines), the graphs show the mean plus or minus the standard deviation (dash-dotted and dotted lines) of the approximated statistics according to equations 3.6 through 3.11. For large enough prespike times (t · ¡5 ms) the simulated and approximated values correspond well. For li D 29.6 Hz, the above-threshold value of hU1 i leads to a constant increase in the membrane potential U and to regular and fast ring: hISIi D 8.0 ms, CV D 0.23 (see the crosses on the dashed lines in Figure 4). This mean ISI compares well with hISIi D 8.2 ms obtained with equation 3.14. Spiking is on average preceded by a small increase in U1 . For li D 88.0 Hz, the steady-state potential hU1 i is below threshold, output spiking is rare and irregular (hISIi D 116 ms, CV D 0.89, as is indicated by the asterisks on the dashed lines in Figure 4), and the temporary conductance changes are necessary to induce spiking. Just before spiking, the excitatory conductance exceeds its nonspiking mean value (i.e., the mean value for t ¼ ¡10 ms) by about twice its standard deviation, and the inhibitory conductance undershoots its nonspiking mean value by more than the standard deviation. For both inhibitory ring rates, the time constant tm before an action potential reveals only small variations. This can
Synchronous Firing Due to Common Input
2017
be explained by the fact that, on average, ring is associated with both an excitatory conductance increase and an inhibitory conductance decrease. 4 Single Neuron with Correlated Inputs
For our analysis of the effects of input correlation on the ring characteristics of a single neuron, we choose a single excitatory ring rate le D 100 Hz and vary: 1. The inhibitory ring rate li 2 f0.0, 29.6, 56.7, 88.0g, representing the cases no inhibition, hU1 i D Ut C s( U1 ) , hU1 i D Ut , and hU1 i D Ut ¡ s( U1 ) , respectively. 2. The cluster correlation Ke,i 2 f0.05, 0.10, 0.20, 0.40g, indicating that the ring rate of neurons in a synchronization cluster is lse,i D le,i Ke,i and that the background ring rate is lbe,i D le,i (1 ¡ Ke,i ).
3. The cluster size M 2 f15, 30, 60, 120g, indicating that the size of a synchronization cluster can be 12.5%, 25%, 50%, or 100% of the total number of excitatory or inhibitory inputs. The ring statistics were simulated for these 64 cases (4 inhibitory ring rates £ 4 cluster correlations £ 4 cluster sizes) with correlation in both excitatory and inhibitory clusters (see Figure 6), correlation only in excitatory clusters (see Figure 7), and correlation only in inhibitory clusters (see Figure 8). A number of characteristic features can be observed in Figures 6, 7, and 8. The upper panels show that in all cases, hISIi is remarkably insensitive for changes in both the cluster correlation K and the cluster size M, except for the high inhibitory ring rate li D lti ( ¡1) D 88 Hz, when hISIi decreases for larger values for the cluster correlation K. The decrease in Figures 6 and 7 for li D 88 Hz can be explained by the fact that long-lasting uctuations of the membrane potential near hU1 i (remember hU1 i < Ut ) are prevented by the synchronous ring of the excitatory clusters, and more frequent ring of those clusters thus reduces hISIi. If hU1 i ¸ Ut (for li 2 f0.0, 29.6, 56.7g Hz), presynaptic excitatory cluster ring is also likely to induce threshold passing, but since regular ring is already caused by the background ring and since the total excitatory ring rate le D lse C lbe is constant, the mean ISI is hardly reduced in this case. The decrease in hISIi as a function of the inhibitory correlation in Figure 8 can be understood by noting that an increase in the cluster correlation is associated with a decrease in the inhibitory background activity. Thus, inhibitory activity is most effective in reducing the ring rate of the postsynaptic neuron, if it is independent. In contrast to the ring rate, CV clearly depends on the input correlation for all inhibitory ring rates, except if only the inhibitory inputs are correlated (in Figure 8). Figures 6 and 7 show a very similar dependency of CV on the input correlation: larger cluster sizes and larger cluster correlation tend to increase CV, except for high inhibitory ring rates, where a rise in
2018
Sybert Stroeve and Stan Gielen
á ISI á [ms]
0.0 [Hz]
29.6 [Hz]
56.7 [Hz]
88.0 [Hz]
80 60 40 20 0
CV(ISI)
1.5 1 0.5 0 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4
K ,K e
i
K ,K e
i
K ,K e
i
K ,K e
i
Figure 6: Firing characteristics of a single neuron with excitatory correlation and inhibitory correlation. hISIi and CV as a function of cluster correlation Ke and Ki , for excitatory ring le D 100 Hz and a series of inhibitory ring frequencies li 2 f0, 29.6, 56.7, 88.0g Hz (columns 1–4). The cluster size M varies from 12.5% to 100% of the number of excitatory and inhibitory inputs (N e D Ni D 120): M D 15 (dotted line, circles), M D 30 (dash-dotted line, pluses), M D 60 (dashed line, crosses), M D 120 (solid line, asterisks). Both excitatory and inhibitory clusters are correlated, but excitatory and inhibitory inputs are independent. Averages are over 10,000 spikes.
the cluster correlation does not always lead to an increase in CV. In these simulations, large cluster sizes are required to achieve CV > 1, but the inhibitory ring rates may be relatively low. If only the inhibitory inputs are correlated (see Figure 8), CV is mostly independent of the input correlation, except for li D lti ( ¡1) , where a slight decrease in CV as a function of cluster size and correlation can be observed. This decrease is associated with the decrease in the effect of the independent inhibitory input. 5 Uncoupled Neurons with Common Input
We now consider the correlation between the ring of two neurons with partially common excitatory and/or inhibitory input sources without correlation in the input streams of each single neuron (cluster correlation K D 0). Figure 9 shows the cross-correlation of neural activity of two neurons with both common excitatory and inhibitory sources for various ratios of common input, and both for a low (li D 30 Hz) and a high (li D 90 Hz) inhibitory frequency. These frequencies are close to the ring rates for which hU1 i D Ut § s( U1 ) . The cross-correlation functions reveal three character-
Synchronous Firing Due to Common Input
á ISI á [ms]
0.0 [Hz]
29.6 [Hz]
2019 56.7 [Hz]
88.0 [Hz]
80 60 40 20 0
CV(ISI)
1.5 1 0.5 0 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4
K
e
K
e
K
e
K
e
Figure 7: Firing characteristics of a single neuron with excitatory correlation. hISIi and CV as a function of cluster correlation Ke , for excitatory ring le D 100 Hz and a series of inhibitory ring frequencies li 2 f0, 29.6, 56.7, 88.0g Hz (columns 1–4). The cluster size M varies from 12.5% to 100% of the number of excitatory inputs (Ne D Ni D 120): M D 15 (dotted line, circles), M D 30 (dash-dotted line, pluses), M D 60 (dashed line, crosses), M D 120 (solid line, asterisks). Only excitatory inputs are correlated. Averages are over 10,000 spikes.
istics. First, the cross-correlation is oscillatory for li D 30 Hz, whereas it has only a single central peak for li D 90 Hz. The oscillatory cross-correlation for li D 30 Hz is due to the small coefcient of variation (CV ¼ 0.23; see Figure 4) of the neurons with low inhibitory input. This implies that after synchronous spiking of neurons X and Y due to the common input, there is a large probability that both neurons will re again after a time hISIi. The oscillation frequency of Kxy ( t ) thus equals hISIi¡1 . For li D 90 Hz the CV is much higher (near one; see Figure 4), and therefore the cross-correlation is not oscillatory. Second, the zero-time cross-correlation Kxy ( 0) increases with the proportion of common input (as expected). Third, for the same common contribution, Kxy ( 0) is larger for increased inhibitory ring. The more general picture of this phenomenon becomes clear from the graphs in Figure 10. These show zero-time cross-correlation Kxy ( 0) as a function of the inhibitory ring frequency and the amount of common input for pairs of neurons with only common excitatory input (A), only common inhibitory input (C), and both common excitatory and inhibitory inputs (B). Figure 10A shows that Kxy ( 0) increases for larger amounts of common excitatory contributions. It also increases for small inhibitory ring rates, reaching an optimum at about li ¼ 75 Hz. The maximum correlation,
2020
Sybert Stroeve and Stan Gielen
á ISI á [ms]
0.0 [Hz]
29.6 [Hz]
56.7 [Hz]
88.0 [Hz]
80 60 40 20 0
CV(ISI)
1.5 1 0.5 0 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4 0.05 0.1 0.2 0.4
Ki
Ki
Ki
Ki
Figure 8: Firing characteristics of a single neuron with inhibitory correlation. hISIi and CV as a function of cluster correlation Ki , for excitatory ring le D 100 Hz and a series of inhibitory ring frequencies li 2 f0, 29.6, 56.7, 88.0g Hz (columns 1–4). The cluster size M varies from 12.5% to 100% of the number of inhibitory inputs (Ne D Ni D 120): M D 15 (dotted line, circles), M D 30 (dash-dotted line, pluses), M D 60 (dashed line, crosses), M D 120 (solid line, asterisks). Only inhibitory inputs are correlated. Averages are over 10,000spikes. 10 %
30 %
50 %
70 %
90 %
Kxy(t )
0.3 0.2 0.1 0 20
0
20
20
0
20
20
0
20
20
0
20
20
0
20
20
0
20
20
0
20
20
0
20
20
0
20
0
20
Kxy(t )
0.3 0.2 0.1 0 20
t [ms]
t [ms]
t [ms]
t [ms]
t [ms]
Figure 9: Cross-correlation Kxy ( t ) between output spikes of two conductancebased IF neurons X and Y with partially common excitatory (le D 100 Hz) and inhibitory input (upper row: li D 30 Hz, bottom row: li D 90 Hz). The columns represent 10%, 30%, 50%, 70%, and 90% common excitatory and inhibitory input. Averages are over 16,500 to 25,000 spikes; bin size is 0.5 ms.
Synchronous Firing Due to Common Input
xy
K (0)
A
2021
B
0.1
0.4
0.05
0.2
C
0.02
0 90
l
i
60
30
[Hz]
0 0 25z
50 e
75
0 90 100 60
[%]
l
i
0.01 0
30
[Hz]
0 0 25
z
e
50
75
90 100 60
30
l , z [%] i [Hz] i
0 0 25z
50 i
75
100
[%]
Figure 10: Cross-correlation Kxy ( 0) at time zero between the output spikes of two conductance-based IF neurons with partially common excitatory (le D 100) and inhibitory input as a function of the common contribution and the inhibitory ring rate li . (A) Only common excitatory input. (B) Both common excitatory and inhibitory input. (C) Only common inhibitory input. Notice that the z-axes differ. Averages are over 16,500 to 25,000 spikes; bin size is 0.5 ms.
which is attained for 100% common excitatory input and near li = 75 Hz, is Kxy ( 0) D 0.092. Notice that this value is much lower than the maximum value (Kmax xy D 1) due to the completely independent inhibitory inputs. How can we understand the initial incline of Kxy ( 0) with increasing li and the decrease of Kxy ( 0) for large li ? We observed in section 3 that an increase in the (independent) inhibitory input induces an increase in hISIi and CV, reecting that ring requires the coincidence of a momentary increase in excitatory input and/or a decrease in inhibitory input. The timing of excitatory input pulses is thus more important under the inuence of inhibitory ring. The effect of coincidental excitatory synchronous ring of common inputs becomes more effective for larger inhibitory ring, leading to the initial increase in the cross-correlation. For large inhibitory ring rates, postsynaptic ring requires both an increase in the excitation and a decrease in the inhibition. Since the inhibitory ring series for the two neurons were chosen to be independent for the simulations shown in Figure 10A, this implies that the cross-correlation decreases for large inhibitory ring rates. Figure 10B shows that common inhibition strongly amplies the effect of common excitation, that is, Kxy ( 0) is larger if both common excitatory and inhibitory signals are present. This can be understood by the notion that common inhibition tends to create an equal baseline level from which common excitation has an equal depolarizing effect on both neurons. Clearly, the increase is largest if fe D fi tend to 100%, but a considerable increase of 77% is already attained for fe D fi D 50%. The results in Figure 10C show that common inhibition alone induces only a small correlation between the ring events, giving a peak cross-
2022
Sybert Stroeve and Stan Gielen A
B 1
K (0)
0.5
xy
Kxy(0)
1
0 0.4 0.2
Ke , Ki
0.1 0.05 0
100 60 80 20 40
z
e
[%]
0.5 0 0.4 0.2
Ke , Ki
0.1 0.05 0
100 60 80 20 40
z
e
, z [%] i
Figure 11: Cross-correlation Kxy ( 0) at time zero between output spikes of two conductance-based IF neurons with excitatory and inhibitory input (le D 100 Hz, li D 60 Hz) and partially common input as a function of the common contributions fe, i and the cluster correlations Ke,i . Each of the six input groups represents a cluster. (A) Only excitatory common input. The input correlation leads to a large increase in Kxy ( 0) (compare with Figure 10A). (B) Both excitatory and inhibitory common input. Averages are over 25,000 spikes; bin size is 0.5 ms.
correlation Kxy ( 0) D 0.026. Although an equal baseline is set by the common inhibition, the independent excitatory input is most important to pass the threshold potential. Common inhibition is thus effective only in increasing the cross-correlation in the presence of common excitation. How is the correlation between ring of a pair of neurons affected by synchronization clusters in the input streams? Figure 11 shows the correlation between the spiking of two uncoupled neurons with partially common input and correlation between the neurons in a cluster. The inhibitory ring rate li is set to 60 Hz with Ne D Ni D 120 and le D 100 Hz. The cluster sizes are maximum; all six groups of inputs (excitatory/inhibitory, 2£ unique, 1£ common) form a cluster. This implies that the cluster sizes vary as a function of the amount of common input. Comparison of the results in Figure 11 and Figures 10A and 10B (for li D 60 Hz) leads to the conclusion that the cross-correlation is strongly enhanced by the input correlation, which now comprises both a spatial component (the common inputs) and a temporal component (the cluster synchronization). For instance, 20% common input leads to a correlation peak of 0.013 without synchronous inputs, whereas a correlation peak of 0.06, about ve times larger, is attained for input correlation K D 0.1. 6 Discussion 6.1 Pros and Cons of the Neuron Model. The conductance-based IF model is an extension of the leaky current-based IF model, which uses con-
Synchronous Firing Due to Common Input
2023
ductance pulses instead of current pulses to drive the membrane potential. It is more in line with experimental data and adds the feature of variation of the effective time constant, which may be important in the study of timing effects of the neuron’s input. For both the conductance-based and the current-based leaky IF model, one has to rely on simulations to characterize its ring behavior. Nevertheless, relations for the rst and second moments of the membrane potential for the current-based IF model (Tuckwell, 1988) and for the moments of the steady-state potential and the effective time constant of the conductance-based IF model (this article) provide an analytical link between pre- and postsynaptic ring. Only for the nonleaky IF model are analytical expressions of its ring behavior known (Tuckwell, 1988). However, having innite memory, the nonleaky IF model is not useful for a study on the effects of synchronous input (Kempter, Gerstner, van Hemmen, & Wagner, 1998) or for a study on correlation due to common input, since even in the case of 100% common input, synchronous ring is achieved only if the initial conditions are identical in the absence of noise. The statistical moments of the steady-state potential and the time constant were derived by assuming that a spike induces a constant and nite increase in the conductance for a short time. Such a conductance pulse represents a further abstraction from more biologically plausible conductance functions like the a function (Koch, 1999). However, the total conductance induced by a summation of independent conductance pulses tends to a gaussian distribution for a large number of input spikes, irrespective of the precise form of the conductance pulses. Therefore, the specic shape of the conductance function becomes of minor relevance for the case of a large number of inputs; only its effective amplitude and duration are relevant. For correlated inputs, the relevance of the specic conductance function depends on the size of the synchronization clusters with respect to the total number of inputs. Further abstraction of the conductance function to a delta peak leads to Stein’s model with reversal potentials (Tuckwell, 1988), which was used in related studies of Feng and Brown (1999; Brown & Feng, 2000), ( U ( t ) ¡ Er ) dt tm C ae ( Ee ¡ U ( t ) ) dne ( t ) C ai ( Ei ¡ U ( t ) ) dni ( t ) ,
dU ( t) D ¡
(6.1)
with ne ( t) and ni ( t) the number of excitatory and inhibitory input spikes, ae and ai weights of excitatory and inhibitory inputs and the meaning of the other variables as in section 2.1. Stein’s model with reversal potentials is a rst-order approximation of the conductance-based IF model, which holds well if the pulse duration of the synaptic conductance is much shorter than the effective membrane time constant. However, in the case of massive synaptic input, the effective time constant of the conductance-based IF model can be strongly reduced, such that the change in membrane potential
2024
Sybert Stroeve and Stan Gielen
caused by an input spike is reduced in comparison with Stein’s model with reversal potentials. The conductance-based IF model has no spatial variation (it is a singlecompartment model) or synaptic noise. Inclusion of such mechanisms would induce extra jitter in the output spiking (Tuckwell, 1988) and would reduce the cross-correlation between the ring of neurons with common input. Furthermore, it assumes a constant voltage threshold-passing mechanism for spike generation. In general, the dynamics of the stimulus determine whether this is a good model (Koch, Bernander, & Douglas, 1995). Reasonable agreement between the ring of a voltage-threshold leaky IF model and a compartmental model of a pyramidal cell was attained by Marsalek, Koch, & Maunsell (1997). In conclusion, we consider the conductance-based IF model a suitable compromise between the numerical complexity of a compartmental neuron model with active gates (and the loads of parameters to be chosen) and the analytical tractability of the (oversimplied) nonleaky IF model. 6.2 Single Neuron Firing Characteristics. In this study, we did not make assumptions regarding the balance of excitatory and inhibitory effects in cortical neurons (as in the study by Shadlen & Newsome, 1998), but rather included a considerable variation of inhibitory versus excitatory contributions. Balance roughly depends on ( li / le ) £( Ni / Ne ) £( gO i / gO e ) £( ti / te ) £ ( |Ei ¡Et | / |Ee ¡Et |) (see Troyer & Miller, 1997). This range of excitation versus inhibition was achieved by varying the excitatory and inhibitory ring rates le,i , while the respective population sizes Ne,i , as well as the other parameters of the neuron model, were constant. For independent input spiking, the output ring depends on the effective ring rates le,i Ne,i and thus variations of le,i or Ne,i are equivalent. In line with related modeling studies (Shadlen & Newsome, 1998; Feng & Brown, 1999) we chose Ne D Ni , since this allowed us to use equal sizes of the synchronization clusters and the common input for excitatory and inhibitory synapses. However, it should be realized that anatomical studies show that the ratio of excitatory to inhibitory inputs of cortical neurons is typically about 9:1 (Braitenberg & Schuz, ¨ 1998). Notice that it would be easy to attain equal input and output ring rates in the model by choosing appropriate numbers of excitatory and inhibitory inputs. For example, since hISIi D 10 ms is attained for le D 100 Hz, Ne D 120, li ¼ 33 Hz, and Ni D 120 (as is shown in Figure 4), choosing li D 100 Hz and Ni D 40 results in equal input and output ring rates. The derived expressions for the moments of the steady-state potential U1 and (to a lesser extent) the effective time constant tm provide a valuable link between the parameters of the conductance-based IF model and the input ring rates on the one hand and the output ring characteristics on the other hand. In particular, we found that the CV was about constant for a range of excitatory and inhibitory presynaptic ring rates if the excitation-
Synchronous Firing Due to Common Input
2025
inhibition distribution was such that hU1 i D Ut C ks ( U1 ) (see Figure 4). For k D 0, we observed CV ¼ 0.45, similar to results of Feng and Brown (1999) for the current-based IF model and Stein’s model with reversal potentials. Furthermore, we found that for constant k, the induced ring rate increases as a function of the presynaptic ring rate, especially for small k. The analysis of the effect of synchronization clusters was performed for an excitatory ring rate le D 100 Hz and inhibitory ring rates li D 0 and li such that hU1 i D Ut C ks ( U1 ) for k 2 f¡1, 0, 1g. The choice of the excitatory ring rate hardly affects CV, since CV was shown to be approximately constant for constant k. This choice affects the output ring rate, but we expect the observed trends to be valid for other excitatory ring rates as well. A clear inuence of the inhibitory ring rate was found on the dependence of the postsynaptic ring rate as a function of the input correlation. An increase in the ring rate of excitatory synchronous clusters for low to medium inhibitory input ring rates (leading to CV < 0.5) did not affect the output ring rate, but induced a clear increase in the output ring rate for high inhibitory input frequencies. Furthermore, the postsynaptic ring rate was found to be almost independent of the cluster size, especially for low to medium inhibitory ring rates. Input correlation thus affects the postsynaptic ring rate only if the inhibitory ring rate is sufciently high. In contrast to the effectuated ring rate, the variation of the interspike intervals increased with the input correlation for low to medium inhibitory ring rates and was often constant for large inhibitory ring rates. In addition to the cluster correlation K (Brown & Feng, 2000), the cluster size also appears to be an important determinant of the variation in the spiking times. To achieve CV > 1, both independent excitation and inhibition, as well as sufciently large synchronous excitatory clusters, were essential, although the inhibitory ring rate could still be relatively low. In this case, the impact of the variations in the membrane potential due to the independent excitatory and inhibitory input, as well as due to the synchronization clusters, was sufciently large to exceed the CV of a Poisson process. The effects of the cluster correlation and the cluster size depend on the independent excitatory and inhibitory background spiking. Given constant effective frequencies lbe,i Ne,i , the impact of a synchronization cluster on the membrane potential (and thus on postsynaptic ring) does not depend on the population sizes. We considered correlation between excitatory inputs and/or between inhibitory inputs, but not between excitatory and inhibitory inputs. We expect that a positive correlation between excitatory and inhibitory inputs reduces the effective excitatory and inhibitory ring rates, since excitation and inhibition now tend to cancel each other. A negative correlation is expected to amplify the effect of excitatory input, since it implies that the probability that inhibitory input cancels the effect of excitatory input is reduced and thus the threshold potential can be reached earlier.
2026
Sybert Stroeve and Stan Gielen
In this study, we assumed delta correlation between the ring times in a synchronization cluster, which was motivated by the notion that we did not want to introduce yet another (dispersion) parameter. Figures 6 through 8 show that the postsynaptic ring rate is not greatly affected by the sizes of the synchronization clusters (sizes from 12.5% to 100% of all inputs). In other words, it does not make a large difference whether all inputs receive a spike simultaneously or whether different synchronization clusters are used. Therefore, we do not expect that the “no dispersion” assumption affects our conclusions with respect to the postsynaptic ring rate to a large extent. Only for large inhibitory contributions may some effect of dispersion on the postsynaptic ring rate be expected. In contrast, dispersion of the ring times in a synchronization cluster is expected to have a signicant impact on the CV. Therefore, the size of an excitatory synchronization cluster needed to induce a particular CV, as we found in this study, represents a lower bound of the required cluster size if dispersion of the input spikes is considered. Given the delta correlation, the results of this study already show that large synchronization clusters are likely to be necessary to achieve CV > 1. Direct experimental evidence is needed to determine the size of synchronization clusters in vivo. 6.3 Cross-Correlation Due to Common and Synchronous Input. The cross-correlation between the ring of uncoupled neurons with common input depends on several parameters, such as the excitatory and inhibitory ring rates, the fraction of common input, and the presence of synchronization clusters in the input spike series. Interestingly, we found that common input may induce oscillatory cross-correlation patterns, which resemble the cross-correlation patterns obtained in measurements on neurons in the visual system encoding the same object (Singer & Gray, 1995). However, in our model, these oscillatory cross-correlation patterns were found only for small coefcients of variation (CV < 0.5) in the output spiking, whereas CV ¼ 1 is usually measured for cortical neurons (Shadlen & Newsome, 1998). We found that even in the articial case of completely identical excitatory inputs, the maximum cross-correlation is strongly reduced by independent inhibitory ring to about Kxy ( 0) D 0.09. For 50% common excitatory input Kxy ( 0) D 0.03 and for assumingly even more realistic ratios of less than 20% common input, Kxy ( 0) falls below 0.01. Interestingly, we found that the cross-correlation initially increases with increasing inhibitory ring rate but decreases again for large inhibitory ring rates. The incline of the cross-correlation is due to the increase in the importance of coincidental (common) excitatory ring to achieve postsynaptic ring for low and medium inhibitory ring rates. Its decline illustrates the rising importance of both a coincidental increase in (common) excitatory ring and a decrease in (independent) inhibitory ring for high inhibitory ring rates. If, in addition to the excitatory input, the inhibitory input also is partly
Synchronous Firing Due to Common Input
2027
common, the zero-time cross-correlation is increased to Kxy ( 0) D 0.05 for 50% common input and Kxy ( 0) D 0.013 for 20% common input. These values are likely to be optimistic, though, since noisy synaptic transmission will further diminish the cross-correlation. Much higher cross-correlation can be achieved if (large) synchronization clusters exist in the input streams. For instance, if presynaptic spikes are part of a synchronization cluster 1 out of 10 times (K D 0.1), we found in our simulations Kxy ( 0) ¼ 0.06 for 20% common input and Kxy ( 0) ¼ 0.20 for 50% common input. Even if these correlations were attenuated by noisy synaptic transmissions, they still would represent signicant coincidental ring, similar to experimentally obtained values of about Kxy ( 0) ¼ 0.08 ¡ 0.10 by Vaadia et al. (1995). Thus, if synchronization clusters are part of the common input, the output correlation can be sufcient to maintain synchronization clusters from one neural layer to the next. This conclusion is in line with the results of Diesmann et al. (1999), who furthermore reported on the required relation between the size of synchronization clusters and their dispersion such that synchrony can be maintained. In conclusion, large fractions of common input would be required to induce signicant correlations of output spikes of pairs of neurons for input streams without synchronization clusters, but only modest fractions would be required if synchronization clusters are part of the input spike streams if there exists no correlation between the excitatory and inhibitory inputs. Concerning the ontogenesis of synchronization clusters, this implies that they are unlikely to be generated by common input alone, but that lateral interactions are essential. Given the existence of sufciently large synchronization clusters, signicant postsynaptic synchronization may be maintained by common input. Acknowledgments
We thank Martijn Leisink for his mathematical advice. We also acknowledge the useful comments of the reviewers, which led to an improvement of this article. References Braitenberg, V., & Sch¨uz, A. (1998). Cortex: Statistics and geometry of neuronal connectivity. Berlin: Springer-Verlag. Brown, D., & Feng, J. (2000). Impact of correlated inputs on the output of the integrate-and-re models. Neural Computation, 12, 711–732. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Computation, 11, 1621–1671. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. Journal of Computational Neuroscience, 3, 91–110.
2028
Sybert Stroeve and Stan Gielen
Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proceedings of the National Academy of Sciences USA, 95, 1259–1264. Ernst, U., Pawelzik, K., & Geisel, T. (1998). Delay-induced multistable synchronization of biological oscillators. Physical Review E, 57, 2150–2162. Feng, J., & Brown, D. (1999). Coefcient of variation of interspike intervals greater than 0.5. How and when? Biological Cybernetics, 80, 291–297. Feng, J., Brown, D., & Li, G. (In press). Synchronization due to common pulsed input in Stein’s model. Physical Review E. Gerstner, W. (1998). Populations of spiking neurons. In W. Maas & C. M. Bishop (Eds.), Pulsed neural networks (pp. 261–295). Cambridge, MA: MIT Press. Juergens, E., & Eckhorn, R. (1997). Parallel processing by a homogeneous group of coupled model neurons can enhance, reduce and generate signal correlations. Biological Cybernetics, 76, 217–227. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998). Extracting oscillations: Neuronal coincidence detection with noisy periodic spike input. Neural Computation, 10, 1987–2017. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Koch, C., Bernarder, O., & Douglas, R. J. (1995). Do neurons have a voltage or a current threshold for action potential initiation? Journal of Computational Neuroscience, 2, 63–82. Marsalek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proceedings of the National Academy of Sciences USA, 94, 735–740. Mason, A., Nicoll, A., & Stratford, K. (1991). Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. Journal of Neuroscience, 11(1), 72–84. Matsumura, M., Chen, D., Sawaguchi, T., Kubota, K., & Fetz, E. E. (1996). Synaptic interactions between primate precentral cortex neurons revealed by spiketriggered averaging of intracellular membrane potentials in vivo. Journal of Neuroscience, 16(23), 7757–7767. Philips, W. A., & Singer, W. (1997). In search of common foundations for cortical computation. Behavioral and Brain Sciences, 20, 657–722. Protopapas, A. D., Vanier, M., & Bower, J. M. (1998). Simulating large networks of neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Riehle, A., Grun, ¨ S., Diesman, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18(10), 3870–3896.
Synchronous Firing Due to Common Input
2029
Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971– 983. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology:Vol. 2, nonlinear and stochastic theories. Cambridge: Cambridge University Press. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interaction in the monkey cortex to behavioral events. Nature, 373(9), 515–518. Vreeswijk, C. van (1996). Partial synchronization in populations of pulsecoupled oscillators. Physical Review E, 54, 5522–5537. Received February 2, 2000; accepted October 10, 2000.
LETTER
Communicated by Peter Latham
Attention Modulation of Neural Tuning Through Peak and Base Rate Hiroyuki Nakahara Si Wu Shun-ichi Amari RIKEN Brain Science Institute, Wako, Saitama, 351-0198, Japan This study investigates the inuence of attention modulation on neural tuning functions. It has been shown in experiments that attention modulation alters neural tuning curves. Attention has been considered at least to serve to resolve limiting capacities and to increase the sensitivity to attended stimulus, while the exact functions of attention are still under debate. Inspired by recent experimental results on attention modulation, we investigate the inuence of changes in the height and base rate of the tuning curve on the encoding accuracy, using the Fisher information. Under an assumption of stimulus-conditional independence of neural responses, we derive explicit conditions that determine when the height and base rate should be increased or decreased to improve encoding accuracy. Notably, a decrease in the tuning height and base rate can improve the encoding accuracy in some cases. Our theoretical results can predict the effective size of attention modulation on the neural population with respect to encoding accuracy. We discuss how our method can be used quantitatively to evaluate different aspects of attention function. 1 Introduction
One of the central issues in neuroscience is to understand how an ensemble of neural population encodes the external world. Many theoretical studies have addressed the accuracy of neural population coding with respect to tuning functions (Paradiso, 1988; Lehky & Sejnowski, 1988; Seung & Sompolinsky, 1993; Salinas & Abbott, 1994; Brunel & Nadal, 1998; Zohary, 1992; Sanger, 1998; Zemel, Dayan, & Pouget, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998; Oram, Foldiak, ¨ Perrett, & Sengpiel, 1998). The approaches taken by these studies often use Fisher information in terms of tuning functions (Paradiso, 1988; Seung & Sompolinsky, 1993; Snippe, 1996; Pouget, Zhang, Deneve, & Latham, 1998; Zhang & Sejnowski, 1999; Deneve, Latham, & Pouget, 1999; Abbott & Dayan, 1999). This is because the inverse of the Fisher information gives the CramÂer-Rao lower bound for decoding errors. In particular, the effect of the tuning width as a parameter of the tuning function is investigated (Seung & Sompolinsky, 1993; Brunel & Nadal, 1998; c 2001 Massachusetts Institute of Technology Neural Computation 13, 2031– 2047 (2001) °
2032
H. Nakahara, S. Wu, and S. Amari
Pouget, Deneve, Ducom, & Latham, 1999; Zhang & Sejnowski, 1999). Using this theoretical framework, this study investigates the inuence of attention modulation on encoding accuracy through the heights and the base rates of neural tuning functions. Our investigation is inspired by recent experimental ndings (McAdams & Maunsell, 1999a; Treue & MartÂõ nez-Trujillo, 1999). Experimental studies have shown that attention alters the properties of tuning functions (Moran & Desimone, 1985; Motter, 1993; Treue & Maunsell, 1996; Luck, Chelazzi, Hillyard, & Desimone, 1997; Reynolds, Chelazzi, & Desimone, 1999). Some experimental studies have suggested that attention induces systematic narrowing of tuning widths (Spitzer, Desimone, & Moran, 1988; Haenny & Schiller, 1988), whereas other studies have failed to show evidence of narrowing tuning widths (Vogels & Orban, 1990). Recently McAdams and Maunsell (1999a) investigated this question of whether attention modulation leads to a narrowing of neural tuning widths. Interestingly, they have shown for neurons in V4 that attention modulation does not change the tuning width but rather increases the height measured from the base rate of the tuning function, while attention modulation also increases the base rate. Treue and MartÂõ nez-Trujillo (1999) observed similar phenomena for the neurons in MT. It is then important to ask how attention modulation via the height and base rate instead of the width of the tuning function affects the encoding accuracy of a neural population. This article answers this question by assuming the Poisson spike distribution with stimulus-conditional independence. The study shows, using Fisher information as a measure, that encoding accuracy is improved by an increase in both the height and base rate of neurons when the centers of the tuning function are close to the given stimulus. At the same time, we show that the accuracy can be improved not by an increase but by a decrease in the height and base rate when the centers are not so close to the given stimulus. A mathematical analysis provides us with explicit conditions that determine when the height and base rate should be increased or decreased together to improve encoding accuracy. With this analysis, we show the optimal size of an attention region, in which both the tuning heights and base rates of a neural population are increased, and outside of which both are decreased, in order to optimize the overall accuracy. The analysis treats only the case of stimulus-conditional independent neural responses, which is simplest to begin with. One very important case is correlated ring, which can decrease the total Fisher information (Yoon & Sompolinsky, 1999). For some correlational structures, some of the important results derived in this study still hold qualitatively in the correlated case with attention, provided both the central limit theorem and the law of large numbers are valid. Further discussion is provided later, although a detailed analysis in such cases will have to be performed in the future. Finally, a brief discussion is given in relation to a competitive attention model (Desimone & Duncan, 1995).
Attention Modulation of Neural Tuning
2033
2 Analysis 2.1 Formulation. Let us rst consider the situation where a population of M neurons ( m D 1, . . . , M ) encodes a stimulus with respect to its onedimensional feature, denoted by x. For example, x can be an orientation selectivity scaled by [¡90± , 90± ]. Given x, we suppose that the mth neuron emits nm spikes with probability p( nm I ym ( x) ) , where ym stands for the tuning function of the mth neuron. For example, if p ( nm I ym ( x ) ) is the Poisson distribution, ym ( x ) is the parameter indicating the mean ring rate given the stimuli x. The entire Fisher information I, dened below, provides a good measure for decoding the accuracy of x. This is because the inverse of I sets the Crame r-Rao lower bound on the mean squared decoding error e D xO ¡ x, where xO is an estimated x, decoded from the activities of all the neurons. This is written as 1 E[e 2 ] ¸ , I where E denotes expectation. The bound can be achieved by using the maximum likelihood method, asymptotically in general, exactly when the probability distribution is of exponential family and the expectation parameters are used (Lehmann, 1983). It has been suggested that biologically plausible neural models may almost achieve this bound (Pouget et al., 1998; Deneve et al., 1999). For simplicity, we assume that the joint probability of spikes of a neural population is uncorrelated given a stimulus x, that is, the stimulusconditional independence assumption; this assumption is similar to many previous studies (Paradiso, 1988; Seung & Sompolinsky, 1993; Pouget et al., 1998; Zhang & Sejnowski, 1999). In this case, we can write
I ( x) D
M X
Im ( x ) ,
(2.1)
mD 1
where Im ( x ) is the Fisher information of the mth neuron given x and is dened by "³ ´2 # d log p( nm I ym ( x ) ) (2.2) Im ( x ) ´ E m , dx where E m denotes expectation with respect to probability p( nm I ym ( x) ) . In this study, we suppose that the spike probability distribution p ( nm I ym ( x ) ) is Poisson: p ( nm I ym ( x ) ) D
fym ( x ) D tgn ¡ym ( x) D t e , n!
(2.3)
where D t indicates a unit of time period used to count the number of spikes. To simplify the derivations, we consider the time period as one unit in the following (D t D 1).
2034
H. Nakahara, S. Wu, and S. Amari
2.2 Tuning Function. We set the tuning function ym ( x ) of the mth neuron
as
ym ( x) D y ( xI am , bm , cm , sm ) D am w m ( xI cm , sm ) C bm ,
(2.4)
where bm stands for the background noise, or the spontaneous ring rate, which we call the base rate in this article, and cm and sm are the center and the tuning width of the tuning function, respectively; w m ( x) D w m ( xI cm , sm ) determines the shape of the tuning function normalized by w m ( x ) D 1 at x D cm , which is usually radial symmetric, and monotonically decreases with respect to |x ¡ cm |, satisfying w m ( x) ! 0 ( |x ¡ cm | ! 1) . A typical 2 m) ) . To be more concrete, this form example is given as w m ( x ) D exp( ¡ ( x¡c sm2 will be used in the following analysis, although a specic form of w ( x) plays a minor role in our analysis. The height am of the tuning function is measured from the base rate bm . Figure 1 (top) shows an example of the tuning function. 2.3 Fisher Information of a Single Neuron. Let us rst evaluate the Fisher information of a single neuron. The subscript m is dropped in this section when no confusion occurs. Since the inverse of the Fisher information I ( D Im ) sets the lower bound of the encoding error, a larger value of I implies a smaller encoding error, that is, an increased sensitivity. For attention modulation, then, it is natural to expect that I ( x ) will be increased by attention for stimulus x, so that stimulus x is encoded more accurately. The Fisher information is computed as
I D I ( xI a, b, c, s) "³ ´2 # @ log p ( nI y ( x ) ) D Ep @x ³ ´2 1 @y ( x ) D y ( x) @x D
4(x ¡ c ) 2 w 2 ( x ) a2 . 4 s aw ( x) C b
(2.5)
Figure 1 (middle) shows I ( x) , given the tuning function in Figure 1 (top). Note that the maximum of I ( x) is not at the center of the tuning function. This is because the rst derivative of the tuning function with respect to x becomes zero at the center. Given the form of I ( x) in equation 2.5, it is clear that an increase in the base rate b leads to a decrease in I ( x ). By checking @@aI , it is easy to see that an increase in the height a leads to an increase in I for any x. Hence, if possible, the optimal strategy to alter the tuning function with respect to a and b is to increase a and decrease b at the same time, so that I ( x ) is increased for any
Attention Modulation of Neural Tuning
2035
Figure 1: (Top) An example of tuning function: ym ( x ) D aw m ( xI cm , sm ) C bm 2 m) ) and a D 30, b D 3, s D 30, c D 0. (Middle) Fisher where w ( x ) D exp( ¡ (x¡c sm2 information I ( x ) given the above tuning function. (Bottom) Distance l¤ is indicated as a vertical line measured from c D EXc, where (d a, db) D ( § 4, 2) is used. With these parameters, l¤ D 32.92. The form of tuning function and Fisher information is also superimposed with scaling to illustrate their relationship with l¤ in this example.
x. In contrast to this optimal strategy, the experimental results (McAdams & Maunsell, 1999a; Treue and MartÂõ nez-Trujillo, 1999) indicate that by attention modulation, the height and base rate, ( a, b) , increase or decrease together. This may be caused by physiological constraints on the neurons. In the rest of the article, therefore, we will be concerned only with the two cases: a and b increase or decrease together.
2036
H. Nakahara, S. Wu, and S. Amari
Let us inspect the variational change of I, denoted by dI, with respect to a small change of the height and the base rate, denoted by da and db as dI D
@I @a
da C
@I @b
db,
(2.6)
from which we obtain 4a2 ( x ¡ c) 2 w 2 ( x) dI D s 4 y 2 ( x)
»³ ´ ¼ 2b w ( x) C da ¡ db . a
(2.7)
When does dI increase? Let us denote a specic value of an attended stimulus with x¤ for clarity. Attention modulation gives changes in a and b, that is, da and db, which are, for example, positive in a neighborhood of x¤ and negative outside it. That is, these changes may depend on the relative position (c) of the neuron and the attended stimuli x¤ , that is, kx¤ ¡ ck. Here, we search for the condition that I ( x ) is enhanced by attended stimulus x¤ . In order to see if the attention to the stimulus x¤ enhances the estimation of x, we analyze the case when the sensory stimulus x is close to it (x ¼ x¤ ), that )2 ), is, when attention is approximately correct. Provided w ( x ) D exp( ¡ ( x¡c s2 an inspection of equation 2.7 leads to the following condition: (
when l¤ > kx¤ ¡ ck, when l < ¤
kx¤
¡ ck,
dI( x¤ ) > 0
for da, db > 0
> 0,
for da, db < 0,
dI( x
¤)
(2.8)
where l¤ is dened by
l¤ D
8 q > < s ¡ log( db da ¡ > : 1
2b ) a
± ±
db da
¡
2b a
db da
¡
2b a
² > 0 ² ·0 ,
(2.9)
where db /da < 1 is presumed based on an experimental nding (McAdams 2b & Maunsell, 1999a), which implies db da ¡ a < 1. Note that a decrease in the height and base rate at the same time can increase the Fisher information when an attended stimulus x¤ , and hence x, which is supposed to be in a small neighborhood of x¤ , is relatively far 6 1. from the center of the tuning function (i.e., l¤ < kx¤ ¡ ck), provided l¤ D This result is illuminating. We tend to interpret a decrease in the neural ring, or a decrease in the height and base rate of the neural tuning curve, as an indication that the neuron is modulated to not contribute to further information processing. In contrast, this result suggests that the decrease in the height and base rate of a neuron under some conditions indeed helps the neuron encode the attended stimulus more accurately. In Figure 1 (bottom), as an example, l¤ is plotted with respect to the tuning function, and the Fisher
Attention Modulation of Neural Tuning
2037
information in Figure 1 (top) and (middle), respectively, with a specic value of da and db. In this example, the peak of the Fisher information is located inside l¤ . The above analysis is done by inspecting the variational change of I ( x ). This indicates that the Fisher information I ( x) increases by a decrease in the height and base rate for neurons whose center c of the tuning function satises l¤ < kx¤ ¡ ck. Since the magnitude of I ( x ) decreases as kx¤ ¡ ck becomes larger after attaining the maximum of I ( x ), one may wonder if the region, R ¡ D fc | l¤ < kx¤ ¡ ckg,
is signicant. For example, if I ( x¤ ) ¼ 0 for neurons in the region x¤ 2 R ¡, the increase in I ( x¤ ) by da, db < 0 does not seem so important. In fact, however, R¡ is not trivial, at least in some cases. To demonstrate this, let us dene » xmax D arg maxx I ( x, c) lmax D kxmax ¡ ck, where the distance lmax is given by solving
dI(x,c) dx
D 0, or
s lmax D s
aw ( xmax ) C b . aw ( xmax ) C 2b
(2.10)
The relationship between l¤ and lmax (i.e., whether l¤ > lmax or lmax > l¤ ) can vary, depending on parameters. In other words, there are some cases when, even for xmax that gives the maximum Fisher information, not an increase but a decrease in the height and base rate yields the increase in the Fisher information. 2b It is only when db da ¡ a > 0 that a decrease in the height and base rate may induce an improvement in the encoding accuracy for x¤ for which 2b 1 db da l¤ < kx¤ ¡ ck. The condition db da ¡ a > 0, or equivalently 2 b > a , hence, may be useful in investigating the property of each neuron in experimental data. The distance l¤ is a useful measure in determining whether to increase or decrease a and b for neuron c to attain dI( x¤ ) > 0 for an attended stimulus x¤ , given the tuning function parameters of a neuron. Figure 2 shows different examples of the tuning functions with respect to l¤ . Figure 2A shows a tuning function and its l¤ , using the same parameters in Figure 1 (top). Figure 2B shows the tuning function that has the same parameters as the function in Figure 2A except for the smaller width s. Figures 2C and 2D show two cases where l¤ D 1 so that I ( x ) always increases by da, db > 0 for any x. Figure 2E indicates an example where db da » 1. The plot in Figure 2F is inspired by the observations of McAdams and Maunsell (1999a). This tuning function has a high base rate and a low height in the Examples.
2038
H. Nakahara, S. Wu, and S. Amari
normal (unattended) condition and can have a relatively larger height increase in the attended condition. These two characteristics together or either of them can easily lead to l¤ D 1 because 12 dbb < da . a 2.4 Population Coding. Let us now consider a population of M neurons ( m D 1, . . . , M ) , where the heights, base rates, and widths of all neurons are the same—am D a, bm D b, sm D s—in a normal or unattended condition. We search for the condition when, in this neural population, attention modulation leads to an improvement in the encoding accuracy for the attended stimulus x¤ , where the sensory stimulus is assumed to be close to the attended stimulus (x ¼ x¤ ). In other words, the Fisher information PM ¤ I ( x ) D m D1 I ( x¤ , cm ) (by equation 2.1) should increase. The analysis in the previous section leads us to conclude that the optimal attention modulation is, for each neuron m,
» when kx¤ ¡ cm k < l¤ , when kx¤ ¡ cm k > l¤ ,
dam , dbm > 0 dam , dbm < 0.
Now let us assume dam D §da and dbm D §db (da, db > 0) for simplicity. Then, » (d a, db), when cm 2 R C ( x¤ ) (d am , dbm ) D ( ¡da, ¡db) , when cm 2 R ¡ ( x¤ ) , where s
( ¤)
C
R (x
D
³
db 2b ¡ c | kx ¡ ck < l , l D s ¡ log da a ¤
¤
¤
´) .
Thus, to increase the Fisher information, the attention modulatory effect should be positive (i.e., da, db > 0) on the neurons whose center cm is within the region R C ( x¤ ) , that is, cm 2 R C ( x¤ ) , whereas it should be negative in R¡ ( x¤ ). In this manner, the distance l¤ is also useful in determining an effective attention region for x¤ . Figure 3 (top) shows an example of the neural population that uses the same parameters in Figure 1 with different centers. Provided the attended stimulus x¤ D 0, then we obq tain R C ( x¤ ) D fc | kck < l¤ g, where l¤ D s ¡ log(º ¡
º D ºm D
db da
D
1 2
(see Figure 3, bottom).
2b ) a
D 32.92 and
2.5 Experimental Testing. This section briey describes how our analysis can be used to design experiments for future studies. First, x in our analysis represents any one-dimensional stimulus parameter falling in the receptive eld of a neuron, such as the orientation of a grating, the direction of random dots’ motion, or the location of a stimulus. Second, we acknowledge two typical experimental paradigms with attention (Moran
Attention Modulation of Neural Tuning
2039
Figure 2: Examples of different tuning functions with respect to distance l¤ . Vertical solid lines indicate distance l¤ in each graph, measured from the center of the tuning function at x D 0. The tuning function in the normal condition is plotted with a solid line, and possible tuning function modulated by attention is plotted with a dotted line. The vertical dotted line indicates base rate in the normal condition. When Fisher information always increases by da, db > 0 for any x (that is, l¤ D 1) , distance l¤ is put at x D § 90 for purposes of presentation.
2040
H. Nakahara, S. Wu, and S. Amari
Figure 3: An example of population coding. (Top) The tuning functions of the neural population are plotted, using the same parameters in Figure 1. Centers of neurons are allocated at c D 0, § 15, § 30, § 45, § 60. The horizontal dotted line indicates base rate b. (Bottom) Using the same parameters (d a, db) D ( § 4, § 2) in Figure 1, tuning P height and base rates have been changed to increase Fisher information I D m I ( x¤ , cm ) for attended stimulus x¤ D 0, according to distance
q
1 m . If l¤ < cm , da D 4 (and l¤ D s ¡ log(º ¡ 2ba ) D 32.92 where º D ºm D db dam D 2 db D 2), and da D ¡4 otherwise. The vertical dotted line indicates l¤ D 32.92.
& Desimone, 1985; Vogels & Orban, 1990; Motter, 1993; Treue & Maunsell, 1996, 1999; Luck et al., 1997; McAdams & Maunsell, 1999a; Reynolds et al., 1999). In these paradigms, subjects are presented with identical stimuli (or sets of stimuli) in a receptive eld of a neuron in trials, but attention may shift between two different conditions across trials. In one paradigm, attention shifts to different values of x. Typically, attended and unattended
Attention Modulation of Neural Tuning
2041
conditions are those when the most preferred and least preferred values x are attended by subjects, respectively. In the other paradigm, trials that require attention to any x value are interleaved with trials that may require attention at a different modality or feature, unrelated to x. Here, the former and the latter are attended and unattended conditions, respectively. Our analysis is formulated to test how the peak and base rate of the tuning curve, plotted in normal condition where no specic x value is attended by subjects, should be changed to improve the accuracy of x in the attended condition, where a specic value of x is attended. It is interesting to check the effective size l¤ of attention by experimentation. Strictly speaking, however, there is no such normal condition in the above experimental paradigms because it is difcult to dene the normal condition in experimental manipulations. In the second paradigm, if attention paid to a different modality does not alter the tuning function, the unattended condition can be treated as the normal condition, so our analysis is directly applicable. For example, McAdams & Maunsell (1999a, Fig. 4) provide the values of the peaks and base rates, averaged over the neural population in V4, in attended and unattended conditions in this paradigm. From their data, we obtain ( a, b) D ( 0.551, 0.295) in the unattended condition and (d a, db) D ( 0.674, 0.333) ¡ ( 0.551, 0.295) D ( 0.123, 0.038) and then obtain da / a > db / 2b. According to our analysis, the encoding accuracy of this tuning curve increases for any stimuli x by increasing the peak and base rate, which matches the experimental result in Figure 4 of McAdams and Maunsell (1999a). We must mention caveats in applying our analysis to their data; attention paid to a different modality is assumed not to alter the tuning curve. Their data are the average of neural population, in which not all neurons have identical parameters, whereas our analysis treated their data as if sampled from the same tuning curve. Our analysis is for the case where attention is approximately correct ( x ¼ x¤ ) and based on two assumptions: the stimulus-conditional independence assumption and the assumption that attention modulation does not modify the correlational structures, which seem to be the case (McAdams & Maunsell, 1999b). The analysis is not directly applicable to the rst paradigm. Since the analysis takes a variational approach, however, it can treat a relative change between the two conditions. For example, suppose that the magnitude of changes is the same between two conditions ((da, db) D ( § a, § b ), where a, b > 0). Then the analysis implies that the relative change in the peak should always be nonnegative, 2a or 0. 3 Remarks 3.1 Higher-Dimension Inputs. The analysis so far has performed with respect to a one-dimensional stimulus parameter x. It is straightforward to carry out the same analysis for higher-dimensional parameters. Let us
2042
H. Nakahara, S. Wu, and S. Amari
denote the n-dimensional input vector and a center of tuning function by x and c, respectively. Let the tuning function be
y ( x) D aw (xI c, S ) C b, where we assume w ( x) D exp( ¡( x ¡ c) T S ¡1 ( x ¡ c) ) and T stands for transpose. Given the Poisson distribution for spikes, we obtain I ( x) D
a2 aw ( x) C b
³
@w ( x)
´³
@x
@w ( x)
´T
@x
.
An analysis of this function I ( x) can be performed in the same manner as in the previous section. It should be noted that since I is a matrix, dI( x) > 0 should p be interpreted in the sense of positive-deniteness. Another way is to use det( I ) , which is the information volume element of the parameter space. It is obvious that with this I (x) , the results will lead to similar ones in a one-dimensional input case. If we make an additive assumption (McAdams & Maunsell, 1999a; Treue & MartÂõ nez-Trujillo, 1999) on attention modulation for different stimulus parameters, a variational change in the Fisher information is given as dI D
@I @a
da¤i C
@I @b
db¤i , where da¤i D
X i2S
dai , db¤i D
X
dbi .
(3.1)
i2S
In this study, the effects of the height and base rate on the Fisher information are investigated as parameters, while the tuning width is xed. We note the results of Zhang and Sejnowski (1999). They have shown that sharpening the tuning width does not improve the encoding accuracy if the input dimension is higher than two while the height is xed and the base rate is zero. 3.2 Gaussian Spike Distribution. The Poisson distribution is assumed for neural spikes in this study. Another popular assumption is gaussian. If the spike n » N ( y ( x ) , sn2 ) is assumed, where sn2 is independent from the mean ring rate y ( x ) (and also the stimulus-conditional independence is assumed), interestingly enough the Fisher information does not depend on the base rate. In this case, the conclusions can be different. It is known, however, that the variance of the spikes usually depends on the mean ring rates (Dean, 1981; Tolhurst, Movshon, & Dean, 1983; Britten, Shadlen, Newsome, & Movshon, 1992; Lee, Port, Kruse, & Georgopoulos, 1998; McAdams & Maunsell, 1999b). For example, if we t the spike distribution with n » N ( y ( x) , ky ( x ) p ), where k is a constant, a parameter p is measured experimentally as 1 < p < 2, approximately around 1.5. In this region of p, we can nd l¤ that shows a similar tendency, as in the case of Poisson distribution (Nakahara & Amari, submitted).
Attention Modulation of Neural Tuning
2043
4 Discussion
This study investigates the effect of the height and base rate as parameters of the neural tuning curve on the encoding accuracy in a framework of population coding. Using the Fisher information as a measure to access the encoding accuracy, we derived explicit conditions that can determine whether the tuning height and base rate should be increased or decreased together to improve the encoding accuracy for an attended stimulus, where the sensory stimulus is close to the attended stimulus. Thus, the analysis indicates how a population of neural activities should be changed by attention modulation to improve encoding accuracy. Notably, encoding accuracy can be improved by decreasing the tuning height and base rate for some neurons (that is, in case of kx¤ ¡ ck > l¤ ). This result, on the surface, may seem to be against the “biased competition model” of attention (Desimone & Duncan, 1995), or against an idea of attention serving to select neurons for further information processing in a more general term (Posner & Petersen, 1990), for example, the “spotlight” hypothesis (Crick, 1984; Koch & Ullman, 1985; Olshausen, Anderson, & Van Essen, 1993). Our analysis, however, treats the encoding accuracy of population coding by attention, whereas the competition model concerns the decoding in population coding. Hence, our results can be considered complementary to the biased competition model. Our analysis is testable in experiments. Since the analysis is applicable to the tuning curve of each neuron, it may help to reveal different roles in attention modulation among neurons of different tuning parameters. Any discrepancy in l¤ between experimental observations and the predictions by the analysis quantitatively indicates roles of attention function other than encoding accuracy. Let us briey discuss the limitations and future goals of this study. First, the analysis is developed for the case where there is only one stimulus x in the receptive eld of a neuron. It has been experimentally shown that attention modulation is more prominent with two stimuli in the receptive eld, either of which subjects pay attention to in different trials (Moran & Desimone, 1985; Treue & Maunsell, 1996; Luck et al., 1997; Reynolds et al., 1999). This fact is considered to indicate biased competition (Desimone & Duncan, 1995; Luck et al., 1997; Reynolds et al., 1999). There does not seem to be a general agreement on a parametric form of a neural tuning curve in the two-stimuli case. Once the form of the tuning curve is determined, it is very interesting to carry out our approach (Zohary, 1992; Anderson & Van Essen, 1994; Luck et al., 1997; Zemel et al., 1998; Reynolds et al., 1999). Second, in the future we need to investigate the interaction of the two tuning curve parameters in this study, the height and base rate, with other parameters such as the width and even the center of tuning functions (Moran & Desimone, 1985; Spitzer et al., 1988; Connor, Gallant, Preddie, & Van Essen, 1996; Connor, Preddie, Gallant, & Van Essen, 1997; Lee, Itti, Koch, & Braun, 1999). Third, our analysis is performed under the stimulus-conditional independence assumption.
2044
H. Nakahara, S. Wu, and S. Amari
When there is correlation over neural activities, the total Fisher information is not a mere addition of the component ones. This discrepancy has been shown to be large when some correlation structures are assumed but to be small under other structures (Zohary, Shadlen, & Newsome, 1994; Abbott & Dayan, 1999; Yoon & Sompolinsky, 1999). In general, when the correlation decreases rapidly as two neurons become far away, we can apply the standard asymptotic technique of statistics. In such a case, our results still hold qualitatively. However, when attention increases correlation in addition to changes in tuning curves, we should be careful in carrying out a quantitative evaluation of l¤ . Whether the nature of l¤ would change, provided some correlation structures, is an important future study (Nakahara & Amari, submitted). Finally, our analysis is limited to encoding accuracy. We are mostly interested in how attention-modulated neural responses are decoded to reach decision making (Britten et al., 1992; Shadlen, Britten, Newsome, & Movshon, 1996; Thompson, Hanes, Bichot, & Schall, 1996; Zemel et al., 1998; Pouget et al., 1998; Kim & Shadlen, 1999; Wu, Nakahara, Murata, & Amari, 2000). In this aspect, it is necessary to consider what happens with neural activities when only some dimensional feature values are relevant (Lehky & Sejnowski, 1988, 1999; Zohary, 1992).
Acknowledgments
H. N. thanks W. Suzuki, K. Doya, O. Hikosaka, A. Robert, M. Shadlen, J. A. Movshon, and N. Katsuyama for their helpful discussions. H. N. is supported by Grants-in-Aid 11780589 and 12210159 from the Ministry of Education, Science, Sports and Culture.
References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Anderson, C. H., & Van Essen, D. C. (1994). Neurobiological computational systems. In J. M. Zureda, R. J. Marks, & C. J. Robinson (Eds.), Computational intelligence imitating life (pp. 213–222). New York: IEEE Press. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12, 4745–4765. Brown, E. N., Frank, L. M., Tang, D., Quirk, C. M., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. Journal of Neuroscience, 18(18), 7411–7425. Brunel, N., & Nadal, J-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757.
Attention Modulation of Neural Tuning
2045
Connor, C. E., Gallant, J. L., Preddie, D. C., & Van Essen, D. C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. Journal of Neurophysiology, 75, 1306–1308. Connor, C. E., Preddie, D. C., Gallant, J. L., & Van Essen, D. C. (1997). Spatial attention effects in macaque area V4. Journal of Neuroscience, 17, 3201–3214. Crick, F. (1984). Function of the thalamic reticular complex: The searchlight hypothesis. Proceedings of National Academy of Sciences, USA, 81, 4586–4590. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Experimental Brain Research, 44, 437–440. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2(8), 740– 745. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18, 193–222. Haenny, P. E., & Schiller, P. H. (1988). State dependent activity in monkey visual cortex. I. Single cell activity in V1 and V4 on visual tasks. Experimental Brain Research, 69, 225–244. Kim, J. N., & Shadlen, M. N. (1999). Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nature Neuroscience, 2, 176–185. Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Lee, D. K., Itti, L., Koch, C., & Braun, J. (1999).Attention activates winner-take-all competition among visual lters. Nature Neuroscience, 2(4), 375–381. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18(3), 1161–1170. Lehky, S. R., & Sejnowski, T. J. (1988). Network model of shape-from-shading: Neural function arises from both receptive and projective elds. Nature, 333(6172), 452–454. Lehky, S. R., & Sejnowski, T. J. (1999). Seeing white: Qualia in the context of decoding population codes. Neural Computation, 11, 1261–1280. Lehmann, E. L. (1983). Theory of point estimation. London: Chapman & Hall. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2 and V4 of macaque visual cortex. Journal of Neurophysiology, 77, 24–42. McAdams, G. J., & Maunsell, J. H. R. (1999a). Effects of attention on orientationtuning functions of single neurons in macaque cortical area V4. Journal of Neuroscience, 19(1), 431–441. McAdams, G. J., & Maunsell, J. H. R. (1999b).Effects of attention on the reliability of individual neurons in monkey visual cortex. Neuron, 23, 765–773. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919. Nakahara, H., & Amari, S. (submitted). Attention modulation of neural tuning through peak and base rate in accelerated ring.
2046
H. Nakahara, S. Wu, and S. Amari
Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13(11), 4700–4719. Oram, M. W., F¨oldiak, P., Perrett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. Trends in Neuroscience, 21(6), 259–265. Paradiso, M. A. (1988). A theory for use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Posner, M. I., & Petersen, S. E. (1990). The attention system of the human brain. Annual Review of Neuroscience, 13, 25–42. Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. E. (1999). Narrow versus wide tuning curves: What’s best for a population code. Neural Computation, 11, 85–90. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efcient estimation using population coding. Neural Computation, 10, 373–401. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. Journal of Neuroscience, 19, 1736–1753. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. Journal of Computational Neuroscience, 1, 89–107. Sanger, T. D. (1998). Probability density methods for smooth function approximation and learning in populations of tuned spiking neurons. Neural Computation, 10(6), 1567–1586. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90, 10749–10753. Shadlen, M. N., Britten, K. H., Newsome, W. T., & Movshon, J. A. (1996). A computational analysis of the relationship between neuronal and behavioral responses to visual motion. Journal of Neuroscience, 16, 1486–1510. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Spitzer, H., Desimone, R., & Moran, J. (1988). Increased attention enhances both behavioral and neuronal performance. Science, 240, 338–340. Thompson, K. G., Hanes, D. P., Bichot, N. P., & Schall, J. D. (1996). Perceptual and motor processing stages identied in the activity of macaque frontal eye eld neurons during visual search. Journal of Neurophysiology, 76(6), 4040– 4055. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research, 23(8), 775–785. Treue, S., & MartÂõ nez-Trujillo, J. C. M. (1999). Feature-based attention inuences motion processing gain in macaque visual cortex. Nature, 399, 575–578. Treue, S., & Maunsell, J. H. R. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. Treue, S., & Maunsell, J. H. R. (1999). Effects of attention on the processing of motion in macaque middle temporal and medial superior temporal visual cortical areas. Journal of Neuroscience, 19(17), 7591–7602.
Attention Modulation of Neural Tuning
2047
Vogels, R., & Orban, G. A. (1990). How well do response changes of striate neurons signal differences in orientation: A study in the discriminating monkey. Journal of Neuroscience, 10(11), 3543–3558. Wu, S., Nakahara, H., Murata, N., & Amari, S. (2000). Population decoding based on an unfaithful model. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing, 12. Cambridge, MA: MIT Press. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing, 11. Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10, 403–430. Zhang, K., & Sejnowski, T. J. (1999). Neural tuning: To sharpen or broaden. Neural Computation, 11, 75–84. Zohary, E. (1992). Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66(3), 265–272. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370(6485), 140–143.
Received September 8, 1999; accepted October 31, 2000.
LETTER
Communicated by Steven Nowlan
Democratic Integration: Self-Organized Integration of Adaptive Cues Jochen Triesch Deptartment of Computer Science, University of Rochester, Rochester, NY, 14627, U.S.A. Christoph von der Malsburg ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, Bochum, Germany, and Institut fur Laboratory for Computational and Biological Vision, University of Southern California, Los Angeles, CA, U.S.A. Sensory integration or sensor fusion—the integration of information from different modalities, cues, or sensors—is among the most fundamental problems of perception in biological and articial systems. We propose a new architecture for adaptively integrating different cues in a selforganized manner. In Democratic Integration different cues agree on a result, and each cue adapts toward the result agreed on. In particular, discordant cues are quickly suppressed and recalibrated, while cues having been consistent with the result in the recent past are given a higher weight in the future. The architecture is tested in a face tracking scenario. Experiments show its robustness with respect to sudden changes in the environment as long as the changes disrupt only a minority of cues at the same time, although all cues may be disrupted at one time or another. 1 Introduction
Many perceptual tasks can be solved robustly and precisely only by integrating or fusing different sources of information. Correspondingly, the integration of different sensors, cues, or modalities has been studied extensively in the biological and engineering literature (Murphy, 1996; Luo & Kay, 1989; Hall & Llinas, 1997). 1 For agents (autonomous robots or biological organisms) operating in complex environments, one of the foremost problems of perception is that particular cues work only in particular situations. The agent has to adapt constantly to changing environmental conditions (lighting, background, 1
Unfortunately, researchers in different communities use different terms for this general problem. Sensory integration and cue integration are used in the biological and cognitive literature, and sensor fusion, data fusion, and information fusion are favored in the engineering literature. We use the term sensory integration throughout this article. c 2001 Massachusetts Institute of Technology Neural Computation 13, 2049– 2074 (2001) °
2050
Jochen Triesch and Christoph von der Malsburg
noise, and so on), which affect different cues in a different manner. A cue that is usually reliable for computing a particular piece of information may become useless or even harmful in certain situations and is better suppressed. Studies with humans suggest that suppression and recalibration of discordant sensory modalities do occur (Bower, 1974; Murphy, 1996); when there is disagreement among cues, the brain reacts by adapting its integration strategy. Importantly, no external teacher tells the brain which cues have to be suppressed or recalibrated. The adaptation works in a self-organized manner. This phenomenon has not been explained satisfactorily yet. In the engineering literature, little work has been done on self-organized adaptive sensory integration either, although the importance of adaptation has recently been stressed by several authors (Murphy, 1996; Dasarathy, 1997; Kam, Zhu, & Kalata, 1997). Recently we have proposed the principle of Democratic Integration for adaptively integrating several cues in a self-organized manner (Triesch, 1999, 2000; Triesch & von der Malsburg, 2000). In a nutshell, the idea is that the cues agree on a result, and each cue adapts toward the result agreed on. A system with this property will try to maintain maximal agreement among its different cues. The usefulness of this scheme rests on two underlying assumptions. First, concordances among the different cues must prevail in the environment. If there are no statistical dependencies among the cues, there is no point in trying to integrate them. Second, the environment must exhibit a certain temporal continuity. Without temporal continuity, any adaptation is useless. In this article, we test the idea of Democratic Integration in a face tracking application, where a single stationary camera is used to detect and track people in a room. Face tracking technology is important for many multimedia and security applications. Sections 2 and 3 describe our architecture and the cues used for face tracking. Sections 4 and 5 present experiments with different versions of the system in a broad range of situations involving sudden lighting changes. Finally, section 6 discusses our results from a broader perspective. 2 Method
Our system for detecting and tracking faces integrates ve different visual cues (see Figure 1). The cues operate on the basis of two-dimensional saliency maps. This shared data format makes integration of different cues particularly simple. However, we expect the principle of Democratic Integration also to be useful in cases where the cues operate on fundamentally different representations (e.g., visual and auditory). In a more neural style of formulation, the saliency maps would correspond to activity distributions of neurons in visual cortical areas. Consider the ve cues from Figure 1. Each cue (indexed by i D 1, . . . , N) computes a two-dimensional saliency map Ai ( x , t) expressing how similar
Democratic Integration
2051
Result
Motion Detection
Color
Position Prediction
Shape
Contrast
Figure 1: The architecture for face tracking adaptively integrates ve cues: motion detection, color, position prediction, shape, and contrast. The result, a weighted average of saliency maps derived by the different cues, is fed back to the cues and serves as the basis for two types of adaptations. First, the weights of the cues are adapted according to their agreement with the overall result. Second, cues are allowed to adapt their internal parameters in order to have their output better match the overall result.
the image region around pixel x is to the tracked face. We assume 0 · Ai ( x , t ) · 1. To this end, each cue may compare a prototype Pi , which describes the appearance of the face with respect to this cue to the current image. The Pi shall be xed for the moment. We denote the current image by I ( x , t) ; subregions of size d centered at x are denoted by Id ( x , t) : A i ( x , t ) D S i ( P i , Id ( x , t ) ) .
(2.1)
The functions Si measure the similarity of an image region Id ( x , t) centered at point x to the prototype Pi of the cue. For the P combination of different cues, we introduce their reliabilities ri ( t) , with i ri ( t) D 1. The use of the term reliability will be justied later. The saliency maps of the cues are combined to the result R( x, t) by computing a weighted sum with the reliabilities acting as weights:
R ( x, t ) D
N X
ri ( t) Ai ( x , t ).
(2.2)
iD 1
The estimated target position xO ( t ) is the point yielding the highest response in the result: xO ( t) D arg maxfR( x , t)g. x
(2.3)
Now a quality qQ i ( t ) is dened for each cue, which measures how successful the cue predicts the result or how much it agrees with it. We require 0 · qQi ( t) · 1, with values close to zero indicating little agreement and values close to one indicating good agreement. There are many ways that such
2052
Jochen Triesch and Christoph von der Malsburg
a quality measure can be dened, and we compare several measures in section 5. To give an example, one approach that we experimented with compares a cue’s saliency at xO with its average saliency: ¡ ¢ qQ i ( t) D R Ai ( xO ( t) ) ¡ hAi ( x, t) i ,
(2.4)
where h¢ ¢ ¢i denotes an average over all positions x, and R is the ramp function: » 0:x · 0 R ( x) D (2.5) x : x > 0. In words, the response of the cue at the estimated position of the face is compared to its response averaged over the whole image. If the response at the estimated position is above average, then the quality is given by the distance to that average; otherwise the quality is zero. Normalized qualities qi ( t) are dened by qQ i ( t) . q i ( t ) D PN ( ) j D 1 qQj t
(2.6)
This denition ensures that the sum of the normalized qualities of all cues equals one: N X iD 1
qi ( t ) D 1.
(2.7)
The system is made adaptive by dening dynamics for the reliabilities, which couple them to the normalized qualities: t rPi ( t ) D qi ( t) ¡ ri ( t ).
(2.8)
The ri ( t ) compute a running average of the qi ( t). They express how reliably a cue could predict the overall result in the recent past. Equation 2.8 is often referred to as a leaky integrator. Due to the normalization of the qi ( t) , the ri ( t) are also normalized. The sum of the reliabilities converges to one, as can be easily seen by summing equation 2.8 over i and using equation 2.7: t
X i
d rPi ( t) D t dt
³
X i
´ ri ( t)
D
X i
qi ( t) ¡
X i
ri ( t) D 1 ¡
X
ri ( t) .
(2.9)
i
If the reliabilities are initialized such that their sum is one, the sum will remain one throughout. Thus, the reliabilities can indeed be regarded as weights. A cue with a normalized quality higher than its current reliability
Democratic Integration
2053
will tend to increase its reliability, and a cue with a normalized quality lower than its current reliability will have its reliability lowered. In particular, a cue the quality of which remains zero will end up with a reliability of zero; it is completely suppressed. The time constant t is considered to be large enough so that the dynamics of the ri ( t) are not spoiled by high-frequency noise and small enough to allow for quick adaptation to new situations. The qQ i ( t) and hence the qi ( t) will depend on the shapes of the results Ai ( x , t ), which are dened by the particular problem at hand and on the position xO of the maximum response in R( x, t) . As R( x , t) is a superposition of the results Ai ( x, t) with the weights ri ( t) , the qualities qQ i ( t) are indirectly inuenced by the reliabilities ri ( t) . This indirect or hidden inuence is a function of the particular problem and usually unknown: qQ i ( t) D qQ i ( r1 ( t) , . . . , rN ( t) ) .
(2.10)
In the appendix we discuss how this unknown relation can give rise to different system characteristics. In particular, we ask under what circumstances a single cue can take over the whole system. 2.1 Relation to Nonadaptive Case. The use of weighted averages (see equation 2.2) is very common when integrating a number of cues. In the classical nonadaptive case, one considers N cues that estimate a scalar quantity u. Often one assumes that the estimates of the cues are corrupted by additive, zero-mean gaussian noise with standard deviation si :
xi D u C ºi , ºi » N
( 0, si ).
(2.11)
Furthermore, one often assumes that the noise in different cues is independent. The task usually is to compute an estimate xO of u that is optimal with respect to a certain criterion. We assume a weighted average: xO D
X
w i xi ,
i
X i
wi D 1.
(2.12)
The problem then is to nd the set of optimal weights wi . One criterion is to minimize the expected squared estimation error E[( xO ¡ u ) 2 ]. Another criterion is maximizing the probability of xO equaling u under the assumption that a priori all values of u are equally likely. Both approaches lead to 1 s2
wi D P i
1 j s2 j
.
(2.13)
The weights are chosen inversely proportional to the variances of the cues’ noise. This is identical to equation 2.6 if we identify the qualities qQi with
2054
Jochen Triesch and Christoph von der Malsburg
the inverse variances 1 / si2 . Note that the classical approach requires the si to be known in advance and does not address systematic errors of the sensors. In our approach, the weights are subject to dynamics that push them toward a quality value that is proportional to their current agreement or compatibility to the agreed-on estimate. Unlike the static case, which can handle only a priori variances, our method can suppress cues that disagree from the majority due to intermittent high noise or systematic errors due to sensor failure. Below, we describe how to allow for recalibration of cues by adapting internal parameters of the cues. 2.2 Adaptation of Cue Parameters. The reliability dynamics, equation 2.8 allow the system to exhibit suppression of a discordant cue. Now we introduce additional dynamics for the parameters Pi ( t) of the cues, which in the previous discussion were assumed to be xed. The cues shall change their parameters in such a way that their saliency maps match the result R( x , t) better. This leads to recalibration of discordant cues. Consider a function fi ( Id ( x , t) ) extracting a feature vector P i ( x , t) from the image region around x:
P i ( x , t ) D fi ( I d ( x , t ) ) .
(2.14)
The feature vector extracted at the current target position is dened by PO i ( t ) D Pi ( xO , t) .
(2.15)
The prototype dynamics adapt a cue’s prototype toward the target’s current appearance: ti PP i ( t) D POi ( t) ¡ Pi ( t) .
(2.16)
Note that these dynamics are formally equivalent to the prototype dynamics. The time constants ti , however, are not to be mistaken for the time t describing the adaptation of the reliabilities. The prototype moves toward the feature vector PO i ( t ) extracted at the estimated target position in the current image. If the scene is stationary, that is, PO i ( t ) is constant, then Pi ( t) will converge to POi ( t) with the time constant ti . The ti thus determine how quickly the prototype of a cue adapts toward changes in the object appearance with respect to this cue. In general, a tracked object may change its appearance quickly with respect to one cue and slowly with respect to another. If these circumstances are known, the ti can be chosen accordingly. The saliency map of a cue is computed by comparing the current prototype to all positions in the current image. To this end, we assume there are similarity functions Si for comparing a cue’s prototype to an image region Id ( x , t) with the property 0 · Si ( Pi , Id ( x , t) ) · 1.
(2.17)
Democratic Integration
2055
Now, the computation of a cue’s saliency map can be rewritten as: A i ( x , t ) D S i ( Pi ( t ) , I d ( x , t ) ) .
(2.18)
In special cases, this will take the form A i ( x , t ) D S i ( Pi ( t ) , P i ( x , t ) ) ,
(2.19)
that is, the prototype is compared to a homologous feature vector extracted at each image position. 3 Description of Cues
Our system combines ve different cues extracted from the camera images for tracking faces. Typical results of the cues are depicted in Figure 2. We deliberately chose to employ very simple cues, as our emphasis is on our method of integrating them in a democratic manner. The red-green-blue (RGB) color images are transformed to HSI (hue, saturation, intensity) color space (Gonzales & Woods, 1992). We denote the full color image by I ( x , t), while the HSI components are given by h( x , t ), s( x , t ), and i( x , t), respectively. 3.1 Motion Detection Cue. This cue tries to detect moving objects by looking for temporal changes in the image’s intensity component. It is the only cue that is not adaptive. It rst computes the thresholded difference of the intensity components i( x, t) of subsequent images,
M ( x , t) D H ( |i( x , t) ¡ i( x , t ¡ 1) | ¡ C ) ,
(3.1)
where C D 5 is a xed threshold corresponding to the camera noise and H is the Heaviside function: » 0:s · 0 H ( s) D (3.2) 1 : s > 0. The thresholded difference image is convolved with a 6£6–pixel normalized binomial lter in order to smooth the result and detect larger blobs of motion: Aposition ( x, t) D B6 ( M ( x, t) ) .
(3.3)
3.2 Color Cue. The color cue is computed by comparing the color of each pixel in the current image to a prototype color h 0 ( t) , s0 ( t ) , i0 ( t) describing the color of the tracked face. If a pixel’s color does not deviate from the prototype too much, the result is one; otherwise, it is zero: 8 <1 : |h( x , t) ¡ h 0 ( t) | < sh , ( ) |s( x , t) ¡ s0 ( t ) | < ss , |i( x , t) ¡ i0 ( t) | < si x (3.4) C ,t D : 0 : otherwise.
2056
Jochen Triesch and Christoph von der Malsburg
Original Image Shape Pattern
Motion Detection
Color
Position Prediction
Shape
Contrast
Result
Figure 2: Overview of the cues. (Top row) Original image with circle marking the estimated face position. The large and small white squares inside the circle dene the image area where we extract the information for updating the prototypes of the shape and color cue, respectively. The 6 £6–pixel shape pattern is the current prototype used by the shape cue. (Center and bottom rows) Cues’ saliency maps Ai ( x ) and the averaged result R ( x ) . At the depicted moment, the weights of the cues were all between 0.1 and 0.3. The circle indicates the position of the global maximum xO .
The allowed deviations sh , ss , si have been predened. This cue’s saliency map is obtained by smoothing C ( x , t) with a 6 £ 6–pixel binomial lter: Acolor ( x , t) D B6 ( C ( x , t) ).
(3.5)
The prototype color is adapted toward a color average taken from a 3 £ 3– pixel region around the estimated position of the target xO . The time constant is tcolor D 0.5 second. If the standard deviation of the colors in the 3 £3–pixel region exceeds a certain threshold, indicating that there is no homogeneous color around the estimated target position, the prototype is not adapted.
Democratic Integration
2057
3.3 Position Prediction Cue. The position prediction produces an estimate of the current position of the target based solely on its positions in the preceding two frames:
¡
¢
O ( t) D xO ( t ¡ 1) C xO ( t ¡ 1) ¡ xO ( t ¡ 2) . X
(3.6)
Its output Aposition ( x , t) is given by:
³
( x ¡ XO ( t) ) 2 Aposition ( x, t) D exp ¡ 2s 2
´ ,
(3.7)
with s D 5. This cue’s adaptation lies in the linear prediction of the head’s position rather than in the adaptation of a prototype as for the other cues. 3.4 Shape Cue. The shape cue computes the correlation of a 6 £ 6–pixel gray-level template of the target to same-size patches id ( x , t) of the image’s intensity component. High correlations indicate a high likelihood of the target’s being at this particular position,
Ashape ( x , t ) D Sshape ( Pshape ( t) , id ( x, t) ) D
¢ 1 ¡ C Pshape ( t) , id ( x , t) C 1, 2
(3.8)
where C is the correlation operator: P ¡
C ( P, id ( x ) ) D
x0
¢¡ ¢ i( x C x 0 ) ¡ iNd ( x ) P ( x0 ) ¡ PN . D id ( x ) D P
(3.9)
iNd ( x ) and PN are the averages of the gray-level values of a 6 £ 6–pixel image region around x and the average gray level of the 6£6–pixel shape pattern P, respectively. D id ( x ) and D P are the corresponding standard deviations. The sum runs over a 6 £ 6–pixel region. The shape template Pshape ( t) is adapted to the gray-level distribution at the estimated target position according to equation 2.16 with the time constant tshape D 0.5 second. 3.5 Contrast Cue. The contrast cue compares the contrast of patches of the image’s intensity component centered at position x to a contrast prototype Pcontrast ( t) , where contrast is dened as the standard deviation of the gray-level values of the pixel distribution of a 6 £ 6–pixel image region:
Acontrast ( x, t) D |Pcontrast ( t) ¡ D id ( x, t) |.
(3.10)
This formulation has the advantage that the contrast cue is computed as a by-product of the shape cue. The prototypical contrast of the target is also adapted to the image contrast at the current position estimate D id ( xO , t) according to equation 2.16 with the time constant tcontrast D 0.5 second.
2058
Jochen Triesch and Christoph von der Malsburg
4 Experiments 4.1 Detection and Tracking. The system’s dynamic equations from section 2 were put in discrete form with the Euler method (D t D 1 / 8 sec), giving a set of difference equations suited for computer implementation. The system is initialized with the reliabilities of the color and motion detection cue set to 0.5 and all other reliabilities set to zero. This choice reects the limited a priori knowledge about the target. The person’s face is assumed to be a moving, skin-colored object, but the place of its appearance, its shape, and its contrast are unknown. Correspondingly, the prototype for the shape and contrast cues are set to canonical values. A detection threshold T was dened for deciding whether a person is in the scene. If the maximum result R( xO ) is greater than T, then a person is assumed to be present. After some experiments, the threshold was set to T D 0.7. In the beginning, the system is looking for a moving skin-colored object of sufcient size. Sometimes a subject’s hand enters the scene before his face and is detected. When the face enters the scene shortly after the hand, however, the tracking result always jumps on the face, which gives a much stronger response due to its size. In the case of the detection of a person, the reliabilities ri ( t) and the prototypes P i ( t ) are adapted to the current qualities and current image as described in the previous sections. The time constant for the reliability adaptation t was set to 0.5 second. If the result R( xO ) is smaller than the detection threshold, the reliabilities and prototypes are adapted back toward the default values with a slower time constant of 5 seconds. This ensures that after a person has left the scene, the system will return to its unbiased default expectations after a while. Thus, the system can rather quickly learn about and adapt to new situations (a new person, a new lighting situation) but forgets about them somewhat more slowly, which is useful if the face is temporarily lost. 4.2 Image Sequences. We recorded 84 image sequences of people crossing a room. The original resolution of 768 £ 568 pixels was downsampled with a rectangular lter of 8 £ 8 pixels, yielding images 96 £ 71 pixels in size. The frame rate was 8 Hz, and sequences consisted of 100 frames resulting in 12.5 seconds recording time. The sequences are evenly distributed among the six different classes summarized in Table 1, that is, there were 14 sequences belonging to each of the following classes:
Normal: A person crosses the room starting from either side with little to normal speed. The person moves in front of different backgrounds; in some places the background is predominantly light, and in others it is predominantly dark. Lighting: Same as Normal but when the person reaches the middle of the room, the lighting is abruptly changed to green. For this purpose,
Democratic Integration
2059
Table 1: Six Classes of Image Sequences. Class
Description
Normal Lighting Turning Occlusion Light&Turn. Light&Occl.
Person crosses room from one side to the other As Normal, but lighting abruptly changed to green Person goes up and down in the room, turning 1 to 3 times As Normal, but person occluded by other person As Turning, but lighting changed as in Lighting As Occlusion, but lighting changed at moment of occlusion
Figure 3: Example of a sequence of the Lighting condition. The images summarize a 5 second stretch from the middle of the sequence. Individual images are 1.25 seconds apart (10 frames). The circles mark the estimated face positions.
a transparent green piece of plastic was held in front of a 500 W lamp illuminating the scene. Turning: The person goes up and down in the room, turning one to three times. Although people were asked to turn both facing the camera and facing the wall, the rst was observed more often. Occlusion: Same as Normal but shortly after the rst person enters the room, a second person enters from the opposite side. The people meet in the middle of the room, shake hands, and then continue on their course. The tracked person is occluded by the second person while passing. Light&Turn. (Lighting&Turning): Same as Turning but in the middle of the sequence, the illumination is abruptly changed as in the Lighting condition. Light&Occl. (Lighting&Occlusion): Same as Occlusion but at the moment of occlusion, the illumination is changed as in the Lighting condition. 4.3 An Example. An example of successful tracking is given in Figures 3 and 4. (More examples can be viewed online at http://www.cs.rochester. edu/u/ triesch/trackingDemos/overview.html.) Figure 3 shows a 5 second stretch from the middle of a sequence of the Lighting class. Two signicant changes occur during this stretch. Between the second and third image in Figure 3, the lighting is changed as described above, and between the fourth and fth image, the person steps in front of a different background. Figure 4
2060
Jochen Triesch and Christoph von der Malsburg
Motion Detection 1
0.5
0 0 1
Color
0.5
50 Shape
Position Prediction 1
1
100
0 0 1
0.5
50 100 Contrast
0 0 1
50 Result
100
50
100
quality reliability 0.5
0 0
0.5
50
100
0 0
0.5
50
100
0 0
Figure 4: Graphs showing the cues’ qualities (solid line) and reliabilities (dashdotted line) as a function of time for the sequence of Figure 3. The bar on the ordinate marks the 5 second stretch displayed in Figure 3. The graph labeled “Result” shows the result R ( xO ) (solid line) and the binary detection signal (dotted line) as a function of time. The dashed line marks the detection threshold.
shows graphs of the cues’ qualities qQ i (solid) and reliabilities ri (dash-dotted) as a function of time. The black bar on the ordinate marks the part of the sequence corresponding to Figure 3. In the beginning, the system relies on the motion detection and color cues only. The reliabilities of the other cues are zero. As soon as the person is detected, the reliability dynamics set in, and responsibility starts being shared among all cues. The lighting change and the background change lead to temporarily diminished qualities of some of the cues, whose reliabilities are decreased in response. As both changes affect only a minority of cues, tracking remains stable. If both changes had occurred at the same time, they could have posed a serious threat to the system. The graph labeled “Result” in Figure 4 shows graphs of R( xO ) (solid) and the binary detection signal (dotted) as a function of time. The dashed horizontal line marks the detection threshold. Note that the uctuations in R( xO ) are much smaller than those seen in the qualities of the individual cues in Figure 4.
Democratic Integration
2061
Table 2: Tracking Results. Class
Total
Adaptation
No Adaptation
Normal
14
12 (86%)
10 (71%)
Lighting
14
13 (93%)
0 ( 0%)
Turning
14
10 (71%)
6 (43%)
Occlusion
14
4 (29%)
0 ( 0%)
Light&Turn.
14
8 (57%)
0 ( 0%)
Light&Occl.
14
2 (14%)
0 ( 0%)
Total
84
49 (58%)
16 (19%)
4.4 Results. The performance of the system was carefully analyzed for the 84 sequences. Two results for a sequence are distinguished:
1. Good tracking. The person’s face is correctly tracked during the whole sequence. In fewer than ve frames the face remains undetected, or the position estimate is poorly placed. 2. Poor tracking. Continuous tracking errors occur. For more than ve frames, the face is undetected; other body parts of the person, the other person’s face; or parts of the background are mistaken for the face; or the person’s exit from the scene is not detected. The results according to this categorization are summarized in Table 2. It compares the system’s performance to a nonadaptive version, where the dynamics of the reliabilities and prototypes had been switched off. As a consequence, the nonadaptive system had to rely on the color and motion cue alone. The column labeled “Total” gives the number of sequences of a particular class. The column labeled “Adaptation” gives the absolute number and percentage of sequences with good tracking as a function of sequence class for the adaptive system. The column labeled “No Adaptation” shows the same data for the nonadaptive system. Not surprisingly, the adaptive version outperforms the nonadaptive version for all sequence classes. For the adaptive system, the results indicate that it can cope fairly well with sudden changes in the illumination or with the person turning around. On the other hand, occlusion by another person is extremely harmful. However, it turned out that tracking errors could have unexpected causes. For example, in the only “poor ” sequence for the Lighting condition, the error was produced by a background change rather than the change of illumination. Therefore, a more careful analysis of the results was in order. For all erroneous sequences, an analysis was made of which changes in the sequence were involved in the error. Four changes were identied as contributing to errors: background changes, turns, lighting changes, and occlusions (see Table 3). Each change affects a particular
2062
Jochen Triesch and Christoph von der Malsburg
Table 3: Changes in the Sequences Identied as Error Sources. Change
Cues Affected
Number
Background
Shape, contrast
Lighting
Color
» 84
Turn
Motion, color, shape, contrast
Occlusion
Motion, color, shape, contrast
Contribute
Responsible
5 (5%)
2 (2%)
» 42
13 (31%)
0 (0%)
16 (57%)
6 (21%)
» 28
22 (79%)
7 (25%)
» 28
Notes: Each change type can inuence a particular set of cues. The Number column estimates the number of sequences where a change type was present based on the classes of the sequences. However, there are some sequences that do not contain a background change, and there are sequences where changes occur that would not be expected from the sequence class. For example, in some sequences, the person unexpectedly turns enough to hide her face. The Contribute column displays the number of erroneous sequences where the change contributed to the tracking error. The Responsible column displays the number of erroneous sequences where the change was the only cause for the error.
set of cues. For instance, a change in the lighting primarily disrupts the color cue. Table 3 summarizes how far the changes contribute to tracking errors. Altogether, 56 changes contributed to 34 tracking errors, which gives an average of 1.65 changes per error. (Note that a single change typically disrupts several cues.) Changes affecting only a small number of cues, such as background changes and lighting changes, produce only a few errors, while changes affecting many cues simultaneously produce many errors. Although turns and occlusions can possibly inuence the same set of cues, occlusions were more harmful, since they introduce a competing target (the other person), which “attracts” the system. Differences between the two persons are usually seen only by the shape cue and the position prediction. The most important parameters of the system are the time constants t and ti for the adaptation of reliabilities and prototypes, on the one hand, and the detection threshold T, on the other hand. If the system adapts too fast, it has no memory and will be easily disturbed by high temporal frequency noise. If adaptation is too slow, the system has problems with harmless changes occurring in quick succession. If there are two subsequent changes in the scene, as in Figure 3, both of which would be tolerated by the system if occurring in isolation, their appearance in quick succession poses a serious threat to the system. If T is too high, relatively small changes in the scene can result in missed detection. If T is too low, the system will often continue tracking the background when the person leaves the room. Interestingly, a good choice of the parameters seems to depend on the types of changes occurring in the scene. By using a higher detection threshold and slower adaptation, performance increased for the Occlusion sequences but declined for the Turning sequences.
Democratic Integration
2063
Table 4: Tracking Results for Different Cue Quality Measures. No Raw Distance Total Adaptation Uniform Saliency Normalized to Average Correlation
Class Normal
14
10 (71%)
14 (100%) 11 (79%)
11 (79%)
12 (86%)
12 (86%)
Lighting
14
0 (0%)
13 (93%) 5 (36%)
0 (0%)
13 (93%)
11 (79%)
Turning
14
6 (43%)
12 (86%) 12 (86%)
8 (57%)
10 (71%)
11 (79%)
Occlusion
14
0 (0%)
4 (29%) 7 (50%)
3 (21%)
4 (29%)
6 (43%)
Light&Turn. 14
0 (0%)
7 (50%) 3 (21%)
1 (7%)
8 (57%)
7 (50%)
Light&Occl.
14
0 (0%)
5 (36%) 4 (29%)
1 (7%)
2 (14%)
3 (21%)
Total
84
16 (19%)
55 (66%) 42 (50%)
24 (29%)
49 (58%)
50 (60%)
Note: The number of sequences with good tracking is given for different versions of the system employing different quality measurements.
5 Alternative Quality Measures
The qualities qQ i govern the adaptation of the reliabilities ri . They are supposed to measure a cue’s agreement to the overall result R ( x , t) . Generally there are many ways of dening such a measure. In order to get a sense of the differences among several choices, we tested the system for different denitions of the qQ i . All other parameters of the system were left unchanged. Table 4 summarizes the results. For the different quality denitions described below, the absolute number and percentage of sequences with good tracking are given. As a comparison, the results for the nonadaptive system are given in the column “No Adaptation.” 5.1 Quality Denitions.
5.1.1 Uniform Qualities. The simplest denition for the cues’ qualities is to assign equal quality to all cues: qQ i ( t) D
1 , N
(5.1)
N being the number of cues. With this denition, the ri will quickly converge to N1 once a target has been detected. Thus, the system initially relies on the intensity change and color cues to detect the target, but it relies on all ve cues equally in the following. With this quality measure, the system tracks the person properly in 55 out of the 84 sequences (compare Table 4). 5.1.2 Raw Saliency. qQ i ( t) D Ai ( xO ( t) , t) .
(5.2)
2064
Jochen Triesch and Christoph von der Malsburg
A cue’s quality is the saliency that it attributes to the image location agreed on. Using this quality measure, the system achieves good tracking on 42 of the 84 sequences (50%). Compared to the case of uniform qualities, the system is much worse for lighting changes and appears somewhat better for occlusions. 5.1.3 Normalized Saliency. With the previous method, a (rather dumb) cue, which would give all image locations a high saliency, would be assigned a high quality, although it could not differentiate among different positions at all. Therefore, some normalization seems appropriate. Here, we normalize the saliency maps before computing the quality by requiring their elements to sum to one: qQ i ( t) D A0i ( xO ( t) , t) ,
(5.3)
Ai ( x, t) . A0i ( x, t) D P x A i ( x, t )
(5.4)
with
The sum runs over the entire image. Note that the normalized saliency maps may be regarded as a pseudo-probability distribution. With this quality measure, however, the results are quite bad (although still much better than for the nonadaptive system). For only 21 of the sequences is tracking good (29%). The reason is that the shape and contrast range cue remain virtually unused by the system. These cues typically assign intermediate saliencies to many image locations (compare Figure 2), leading to a very small qQi ( t) for these cues, since the correct position does not stand out. Our results demonstrate that not using these cues diminishes performance drastically. 5.1.4 Distance to Average Saliency. This is the approach described in section 2. It compares the saliency at xO with the average saliency of that cue: ¡ ¢ qQ i ( t) D R Ai ( xO ( t) ) ¡ hAi ( x, t) i .
(5.5)
This denition gives good tracking on 49 sequences (58%), an improvment over the raw saliency and normalized saliency measures but still inferior to the uniform quality approach. In contrast to the previous normalization technique, the shape and contrast range cue are assigned higher qualities, albeit usually still smaller ones than those of the color, intensity change, and motion continuity cues. 5.1.5 Correlation. Finally, as a measure of the agreement between an individual cue and the overall result, we compute the coefcient of correla-
Democratic Integration
2065
tion between a cue’s saliency map Ai ( x , t) and the overall result R( x, t) , and rescale it linearly to the interval [0, 1], qQ i ( t) D 2 C ( Ai ( x, t) , R ( x, t) ) C 1, where C is the operator computing the coefcient of correlation: ²± ² P ± 0 R ( x 0 , t ) ¡ R ( x, t ) x0 Ai ( x , t ) ¡ A i ( x , t ) C ( Ai ( x , t) , R( x, t) ) D . D A i ( x, t ) D R ( x, t )
(5.6)
(5.7)
The sum over x runs across the whole image. For this quality measure, tracking is good on 50 of the 84 sequences (60%). Note that this quality denition does not assign a special role to the estimated target position xO . All image locations contribute equally. We also experimented with applying the ramp function to C ( Ai ( x, t) , R ( x , t) ) to have a quality measure in [0, 1], but the results were not as good. 5.2 Comparison of the Quality Measures. A result that is common to all versions of the system is that performance tends to be much better for the set of Normal, Lighting, and Turning sequences than for the set of Occlusion, Light&Turn., and Light&Occl. sequences. This is not surprising since the sequences of the latter set introduce difcult occlusions or combinations of the changes present in the rst set. A second result is that performance for the nonadaptive system is worse than for any of the adaptive systems. The changes in the environment clearly require adaptation. Comparing the different adaptive approaches, there does not seem to be a clear winner since no quality measure performed best for all sequence classes. Averaged over all classes, the different quality measures work about equally well, with the exception of the normalized saliency, which turned out to be inferior due to virtually not using two of the cues at all. Furthermore, there seems to be a trend favoring quality denitions that attribute similar qualities to all cues. In fact, the simple system assigning equal qualities to all cues performed best when results are averaged over all sequence classes. This is due to the many sudden changes in our data set. If all cues can be subject to sudden problems, it is best not to rely on any of them too much. It will be interesting to explore this issue with other data sets and in other application domains. 6 Discussion
Autonomous agents, biological or articial, often rely on many cues (sensors, modalities) for extracting a particular piece of information from the environment. In complex environments, the usefulness of different cues depends on the perceptual task and the current situation and can be subject to
2066
Jochen Triesch and Christoph von der Malsburg
sudden changes. It is generally not feasible to foresee all situations in which an agent might nd itself and to gather comprehensive statistics about how the agent should integrate its cues in particular situations for achieving a particular perceptual goal. We have presented an architecture for integrating adaptive cues in a self-organized manner. In Democratic Integration, all cues agree on a result, and each cue adapts toward the result agreed on. The system tries to reach maximal coherence among its cues. We have tested the idea in a face tracking system integrating ve simple visual cues. Our experiments showed that the system can handle environmental changes as long as they affect only a minority of the currently used cues at the same time, although all cues may be disrupted at one time or another. The system partly relies on cues for which there was no a priori knowledge about the task available. These cues were bootstrapped by the others and could then keep the system on the right track when the original cues failed. We compared a nonadaptive version of the system with adaptive versions employing different quality measures for estimating the agreement between a cue and the overall result. Every adaptive version outperformed the nonadaptive one. Between the adaptive versions, no quality measure showed superior performance for all sequence classes. However, there seemed to be a trend favoring quality measures that give the cues about equal weights. We suggest that this is due to the many sudden changes in our data set. When all cues are likely to fail suddenly, it is best not to trust any one of them too much. 6.1 Related Work. Many behavioral and neurophysiological studies (which cannot be reviewed here) have asked how the brain integrates different modalities or cues for a particular task. This research effort has shown that the answer usually depends on the species under consideration, the task to be solved, and the cues involved. For example, in the domain of human depth perception, Fine and Jacobs (1999) recently summarized the state of the literature: “In summary, the current state of the literature suggests that the degree of interaction between cues may depend on the cues, the experimental conditions, and the task.” Despite this frustrating plethora of observations, we feel that there must exist underlying principles with general applicability, creating order in the diverse ndings reported. We believe that Democratic Integration is a good candidate for such an underlying principle of the integration of sensory information. The simple system presented here must be understood as merely a metaphor in this context. In previous work (Triesch, 1999, 2000), we have argued how the adaptation phenomena postulated by Democratic Integration might be neurally implemented by short-term synaptic plasticity (von der Malsburg, 1981) in the hierarchical cortical architecture. Synapses between feature detectors representing different cues in different modalities can quickly become associated or disassociated, giving rise to transient assemblies. In this framework, the formally equivalent dynamics of the reliabilities and the prototypes become
Democratic Integration
2067
two sides of the same coin—a reversible Hebbian plasticity. The hypothesis of Democratic Integration should be tested in behavioral and neurophysiological experiments. Of particular importance will be studies requiring subjects to re-adapt their cue integration strategy constantly. Very recently, two groups have addressed how far haptic feedback can inuence the integration of visual cues (Ernst, Banks, & Bulthoff, ¨ 2000; Atkins, Fiser, & Jacobs, 2001). In these studies, visual cues could be either consistent or inconsistent to information present in a haptic cue. Both groups found that subjects altered the weighting of visual cues such that visual cues being consistent with the haptic cue would be given a higher weight, while inconsistent cues would be given a lower weight. These effects required a considerable amount of training, and it would be interesting to see if similar effects can also be observed on a faster timescale. Sensory integration plays an important role in many elds of technology, and diverse techniques have been employed for fusing different sensors. Among them are weighted averages, Kalman ltering, Bayesian approaches, statistical decision theory, Dempster-Shafer evidential reasoning, fuzzy logic, and production rules with condence factors (Luo & Kay, 1989). More recently, neural networks (Sundareshan & Amoozegar, 1997; Tiean, Shaikh, Azimi-Sadjadi, Haar, & Reinke, 1999; Schnittger & Handmann, 1998) or dynamical systems (Vorbruggen, ¨ 1995; Steinhage, Winkel, & Gorontzi, 1999) have been used. Also, there is a recent tendency to combine several approaches in one architecture. Despite the varied approaches to the problem, little work has explicitly dealt with the need for adaptation. Some exceptions will be discussed in the following. In terms of adaptation, we are interested only in systems with self-organized or unsupervised adaptive capabilities because biological and technical autonomous systems cannot not afford to rely on an external teacher (Williams, Pachowicz, & Ronk, 1998), for adapting to quick changes in the environment. Also, it is usually not feasible to attempt to estimate a proper weighting for the different cues from the input scene directly, say in a mixture of experts architecture (Jacobs, Jordan, Nowlan, & Hinton, 1991). It seems too formidable a task to foresee all possible situations that an agent living in a complex environment might nd itself in and to train a system for all of them in advance in a supervised fashion. Murphy’s (1996) SFX architecture stresses the importance of adapting the fusion strategies to the currently perceived object or current perceptual task, thus suggesting the use of different fusion strategies for perceiving different objects. The fusion strategies to employ for a particular task are constructed manually by the designer. Adaptations to environmental changes are handled by a separate “exception-handling mechanism” that observes the fusion process. It uses a set of preprogrammed “state failure conditions” to detect discordances between the sensors and decides whether to recalibrate or suppress offending sensors. In Democratic Integration, the recalibration and suppression are an emergent property of the simple dynamics of the cues’ reliabilities and prototypes, not requiring prespecied rules for the de-
2068
Jochen Triesch and Christoph von der Malsburg
tection and resolution of discordances. Also, SFX does not attempt to predict a sensor’s future reliability on the basis of its performance in the recent past. Dasarathy (1997) considers a decision fusion architecture, while noting that his idea may be applicable to other levels of sensor fusion as well. He envisions (without implementing it) an architecture where feedback of the fused decision result is used to ne-tune the decision boundaries of individual decision processes leading to a “self-improvement” of the overall system. Democratic Integration goes beyond this scheme in the sense that it also allows for drastic recalibrations and even complete suppression of sensors, which have been found to be discordant to the overall result in the recent past, thus assuming a certain temporal continuity in the qualities of the different sensors. Kalman lters have been used for sensor fusion in a number of studies. Some of the problems with classical Kalman lters are that they consider only linear systems, handle only noise that is distributed according to a gaussian distribution, and assume a xed noise model, which make them ill suited for environments where the noise statistics of the sensors can vary signicantly in different situations and systematic errors may occur. Correspondingly, many efforts have focused on creating adaptive Kalman lters, few of which, however, are related to sensor fusion. An exception is the work of Jetto, Longhi, and Venturini (1999), who propose an adaptive Kalman ltering method for fusing different sensors for mobile robot localization. In their approach, the input- and model-noise covariance matrices are adapted on-line in order to cope with changing noise characteristics. Their method relies on a nearly stationary assumption for input and modelnoise. In an alternating fashion, one of them is being estimated, while the other one is assumed to remain constant. De Sa (1994; de Sa & Ballard, 1998) considers a classication problem. She introduces the notion of self-supervised learning of two classiers, where output labels assigned by the one classier are used to train the second one, and vice versa. Her results show that this type of learning outperforms unsupervised learning of the individual classiers and approaches the quality of supervised learning. The main difference from Democratic Integration is that in her architecture, no attempt is made to combine or integrate the two classiers to arrive at an even better classier, and hence another individual modality provides the learning information rather than an agreed-on result as in Democratic Integration. In its current guise, the system is not specic to the task of face tracking, its only a priori assumptions being to look for a moving object of a particular color. It should be regarded as a rst step toward a full-edged system for detecting and tracking human faces. There are a number of other systems (Steffens, Elagin, & Neven, 1998; Feyrer & Zell, 1998; Boehme et al., 1998), which are more specic to the task of face tracking, since they introduce more a priori knowledge about the problem. Often the characteristic shape of the head-shoulder contour is used (Feyrer & Zell, 1998; Boehme et al.,
Democratic Integration
2069
1998). However, these systems lack an adaptive component. The system by McKenna, Gong, and Raja (1998) is adaptive, but it relies on a single color cue. It can adapt to only very slow, continuous changes. In contrast, the system presented here can also cope with sudden changes as long as they affect only a minority of cues. 6.2 Future Work. For future work, two avenues of research seem most promising. First, we would like to study Democratic Integration in more complex architectures composed of a hierarchy of modules. Second, we would like to consider adaptation on several timescales, endowing the system with the capability of storing information over longer timescales. In this way, naive cues without a priori knowledge about the problem could accumulate knowledge about the task, thereby removing the need to be bootstrapped from scratch by other cues every time. Appendix: Democracy versus Despotism
The cues’ qualities are inuenced by their reliabilities because the latter inuence the result R ( x , t) : qQ i D qQi ( r1 ( t) , . . . , rN ( t ) ) .
(A.1)
The relation between qualities and reliabilities is usually unknown. It is of interest to study under what circumstances the system can be “taken over” by a single cue a 2 f1, . . . , Ng. In that situation, we have ra ( t) D 1 and 6 a. We call this situation despotic. It is a xed point of the rj ( t) D 0 for all j D dynamics if rPa D 0 and
6 a. rPi D 0 8 i D
(A.2)
Inserting this into equation 2.8 gives: qa D 1
and
6 a. qi D 0 8 i D
(A.3)
Using equation 2.6 leads to: qQ a > 0
and
6 a. qQi D 0 8 i D
(A.4)
Thus, a despotic solution is a xed point of the dynamics only if the raw qualities qQi of all the suppressed cues vanish. In order to be able to say anything more, particular relations for equation A.1 must be assumed. We consider a linear relation: qQ i ( t) D ai C bi ri ( t)
with
ai ¸ 0 , bi ¸ 0 8i.
(A.5)
2070
Jochen Triesch and Christoph von der Malsburg
Note that we can locally interpret this linear relation as the rst terms of a Taylor series expansion in the vicinity of the despotic state. Depending on the coefcients ai and bi , several cases can be distinguished, which give rise to very different system behaviors. ai ¸ 0, bi D 0.
In this case a cue’s quality is independent of its reliability:
qQ i ( t) D ai D const.
(A.6)
(A special case of this is the uniform quality measure from section 5.) Substituting into equation 2.6 gives: ai qi D P
j aj
D const.
(A.7)
Inserting this into equation 2.8 yields ai t rPi ( t ) D P
j aj
¡ ri ( t).
(A.8)
The solution is simple. The ri ( t) converge exponentially to their normalized P qualities qi D ai / j aj with the time constant t : ³ ´ ¡ ¢ t ( ) ( C 0) ¡ exp ¡ . (A.9) ri t D qi ri qi t 6 a. In particular, a despotic solution requires that all ai vanish for i D
ai D 0, bi ¸ 0. In this case we nd a proportional dependence between qualities and reliabilities: qQ i ( t) D bi ri ( t) .
(A.10)
In particular, a cue whose reliability is zero has a quality of zero. In practice, this situation should occur only if the cues are in total disagreement and should be temporary due to the adaptation of the cue’s prototypes. Substituting into equation 2.6 gives bi ri ( t) . qi ( t) D P ( ) j bj rj t
(A.11)
Inserting this expression into equation 2.8 yields: bi ri ( t) t rPi ( t ) D P ¡ ri ( t) ( ) j bj rj t
³
D
´
b P i ¡ 1 ri ( t) . b j j rj ( t )
(A.12)
Democratic Integration
2071
This equation is similar (although not equivalent) to Eigen’s evolution equation and represents a winner-take-all competition between the ri with their 2 tness given Pby bi . An ri will grow if its bi is greater than the current average tness j bj rj ( t) and shrink otherwise. Eventually only the ri with the highest bi will survive, with all other ri going to zero. Thus, under this assumption, the system will always assume a despotic solution. A situation where ra D 1 but ba is not the biggest of the bi represents an unstable xed point of the dynamics. ai ¸ 0 and bi ¸ 0.
In this case equation 2.6 becomes
bi ri ( t ) C ai qi ( t) D P j bj rj ( t ) C aj D
bi ri ( t) C ai , c ( r( t) )
with c ( r( t) ) ´
X j
bj rj ( t ) C aj .
(A.13)
Inserting this expression into equation 2.8 yields t rPi ( t) D
bi ri ( t) C ai ¡ ri ( t) c ( r( t) )
(A.14)
Let us consider a despotic solution and see when it is a xed point of the dynamics. In this case, c ( t) is given by c ( t) D ba C
X
aj .
(A.15)
j
For a suppressed cue (ri D 0), we nd that rPi is proportional to ai /c ( r ( t) ) : rPi / ai /c ( r( t) ) .
(A.16)
Assuming ai > 0, the reliability of the suppressed cue will increase to a nite value. Conversely, we nd rPa /
2
ba C aa P ¡1 , ba C i ai
The original evolution equation has the form xP i D Fi xi ¡ xi N1
behavior of a winner-take-all competition, however, is identical.
(A.17)
P j
Fj xj . The qualitative
2072
Jochen Triesch and Christoph von der Malsburg
6 which is smaller than zero as long as there is at least one ai > 0 for i D a. Thus, despotism is generally not a xed point of the dynamics. In summary, despotism can occur only if the zeroth-order coefcients of the Taylor expansion vanish for all but the suppressing cue at the same time. This situation would correspond to total disagreement of the suppressed cues to the suppressing cue, which should be largely ruled out by the cues’ ability to adapt their prototypes. Indeed, we never observed such behavior in our experiments.
Acknowledgments
We thank the Army Research Ofce (Adaptive Optoelectronic Eyes: Hybrid Sensor/Processor Architectures MURI) and the German BMBF (NEUROS project) for support. J. T. is currently supported by NIH/PHS research grant P41 RR09283. We express our gratitude to the developers of the FLAVOR software environment, which served as the platform for this research. We are grateful to Alexander Heinrichs for help in recording the image sequences. We thank Dana H. Ballard, Robert A. Jacobs, and two anonymous reviewers for helpful comments on the manuscript. References Atkins, J. E., Fiser, J., & Jacobs, R. A. (2001). Experience-dependent visual cue integration based on consistencies between visual and haptic percepts. Vision Research, 41, 459–471. Boehme, H.-J., Braumann, U.-D., Brakensiek, A., Corradini, A., Krabbes, M., & Gross, H.-M. (1998). User localisation for visually-based human-machineinteraction. In Proceedings of FG’98, The IEEE Third International Conference on Automatic Face and Gesture Recognition (pp. 486–491). Bower, T. G. R. (1974). The evolution of sensory systems. In R. B. MacLeod, & H. L. J. Pick (Eds.), Perception: Essays in honor of James J. Gibson (pp. 141–152). Ithaca, NY: Cornell University Press. Dasarathy, B. V. (1997). Sensor fusion potential exploitation—innovative architectures and illustrative applications. Proceedings of the IEEE, 85(1), 24–38. de Sa, V. R. (1994). Unsupervised classication learning from cross-modal environmental structure. Unpublished doctoral dissertation, University of Rochester, New York. de Sa, V. R., & Ballard, D. H. (1998). Category learning through multi-modality sensing. Neural Computation, 10(5). Ernst, M. O., Banks, M. S., & Bulthoff, ¨ H. H. (2000). Touch can change visual slant perception. Nature Neuroscience, 3(1), 69–73. Feyrer, S., & Zell, A. (1998). Ein integrierter Ansatz zur Lokalisierung von Personen in Bildfolgen. In S. Posch & H. Ritter (Eds.), Dynamische Perzeption (pp. 183–190). Sankt Augustin, Germany: inx. Fine, I., & Jacobs, R. A. (1999). Modeling the combination of motion, stereo, and vergence angle cues to visual depth. Neural Computation, 11, 1297–1330.
Democratic Integration
2073
Gonzales, R. C., & Woods, R. E. (1992). Digital image processing. Reading, MA: Addison-Wesley. Hall, D. L., & Llinas, J. (1997). An introduction to multisensor data fusion. Proceedings of the IEEE, 85(1), 6–23. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Jetto, L., Longhi, S., & Venturini, G. (1999). Development and experimental validation of an adaptive extended Kalman lter for the localization of mobile robots. IEEE Transactions on Robotics and Automation, 15, 219–229. Kam, M., Zhu, X., & Kalata, P. (1997). Sensor fusion for mobile robot navigation. Proc. of the IEEE, 85(1), 108–118. Luo, R. C., & Kay, M. G. (1989). Multisensor integration and fusion in intelligent systems. IEEE Transactions on Systems, Man, and Cybernetics, 19(5), 901–931. McKenna, S. J., Gong, S., & Raja, Y. (1998). Modelling facial colour and identity with gaussian mixtures. Pattern Recognition, 31(12), 1883–1892. Murphy, R. R. (1996). Biological and cognitive foundations of intelligent sensor fusion. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 26(1), 42–51. Schnittger, T., & Handmann, U. (1998). Fusion von Bildanalyseverfahren mittels einer Neuronalen Kopplungsstruktur (Internal Rep. 98-01). Bochum, Germany: Institut fur ¨ Neuroinformatik, Ruhr-Universtit a¨ t Bochum. (in German) Steffens, J., Elagin, E., & Neven, H. (1998).Personspotter—fast and robust system for human detection, tracking and recognition. In Proceedings of FG’98, The IEEE Third International Conference on Automatic Face and Gesture Recognition (pp. 516–521). Steinhage, A., Winkel, C., & Gorontzi, K. (1999). Attractor dynamics to fuse strongly perturbed sensor data. In Proceedings of the IEEE 1999 International Geoscience and Remote Sensing Symposium, Hamburg (pp. 1217–1219). Sundareshan, M. K.,& Amoozegar, F. (1997). Neural network fusion capabilities for efcient implementation of tracking algorithms. Optical Engineering, 36(3), 692–707. Tiean, B., Shaikh, M., Azimi-Sadjadi, M., Haar, T. V., & Reinke, D. (1999). A study of cloud classication with neural networks using spectral and textural features. IEEE Transactions on Neural Networks, 10(1), 138–151. Triesch, J. (1999). Vision-based robotic gesture recognition. Germany: Shaker-Verlag Aachen. Triesch, J. (2000). Democratic integration: A theory of adaptive sensory integration (Tech. Rep. No. NRL TR 00.1). Rochester, NY: National Resource Laboratory for the Study of Brain and Behavior, University of Rochester. Triesch, J., & von der Malsburg, C. (2000). Self-organized integration of adaptive visual cues for face tracking. In Proceedings of the Fourth International Conference on Automatic Face and Gesture Recognition, Grenoble, France, March 28–30 2000 (pp. 102–107). IEEE Computer Society. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. 81-2), Gottingen, ¨ Germany: Department of Neurobiology, Max-PlanckInstitute for Biophysical Chemistry.
2074
Jochen Triesch and Christoph von der Malsburg
Vorbruggen, ¨ J. C. (1995).Zwei Modelle zur datengetriebenenSegmentierung visueller Daten. Frankfurt am Main, Germany: Verlag Harri Deutsch, Thun. Williams, A., Pachowicz, P., & Ronk, L. (1998). Expert assisted, adaptive, and robust fusion. Optical Engineering, 37(2), 378–390.
Received May 31, 2000; accepted November 14, 2000.
LETTER
Communicated by Gary Cottrell
An Autoassociative Neural Network Model of Paired-Associate Learning Daniel S. Rizzuto Michael J. Kahana Volen Center for Complex Systems, Brandeis University, Waltham, MA 02454, U.S.A. Hebbian heteroassociative learning is inherently asymmetric. Storing a forward association, from item A to item B, enables recall of B (given A), but does not permit recall of A (given B). Recurrent networks can solve this problem by associating A to B and B back to A. In these recurrent networks, the forward and backward associations can be differentially weighted to account for asymmetries in recall performance. In the special case of equal strength forward and backward weights, these recurrent networks can be modeled as a single autoassociative network where A and B are two parts of a single, stored pattern. We analyze a general, recurrent neural network model of associative memory and examine its ability to t a rich set of experimental data on human associative learning. The model ts the data signicantly better when the forward and backward storage strengths are highly correlated than when they are less correlated. This network-based analysis of associative learning supports the view that associations between symbolic elements are better conceptualized as a blending of two ideas into a single unit than as separately modiable forward and backward associations linking representations in memory. 1 Introduction
To account for performance in standard memory tasks, formal mathematical models of human memory typically employ both autoassociative and heteroassociative mechanisms (Brown, Dalloz, & Hulme, 1995; Humphreys, Bain, & Pike, 1989; Metcalfe, 1991; Murdock, 1993, 1997). Autoassociative information supports the processes of recognition and pattern completion (Metcalfe, 1991; Weber & Murdock, 1989), whereas heteroassociative information supports the processes of paired-associate learning and sequence generation (Chance & Kahana, 1997; Murdock, 1993). The mathematics of matrix memories (Anderson, 1972), often coupled with nonlinear retrieval dynamics (Buhmann, Divko, & Schulten, 1989; Carpenter & Grossberg, 1993; Hopeld, 1982), provides a mechanistic foundation for these models of human associative memory. This article presents an attractor neural network model of paired-associate learning and uses a model-based analysis of experimental data to c 2001 Massachusetts Institute of Technology Neural Computation 13, 2075– 2092 (2001) °
2076
Daniel S. Rizzuto and Michael J. Kahana
shed light on some basic unresolved questions concerning the nature of associations in human memory. The paired-associate learning task is one of the standard assays of human episodic memory. Typically subjects are presented with randomly paired items (e.g., words, letter strings, pictures) and asked to remember each A-B pair for a subsequent memory test. At test, the A items are presented as cues, and subjects attempt to recall the appropriate B items. Attractor network models and classic linear associators provide simple accounts of associative learning. Storing an autoassociation enables pattern completion, and storing a heteroassociation enables recall of pairedstimulus representations. However, if the representations of the to-belearned items (A and B) are combined into a single composite representation (by summation or concatenation), it is possible to use an autoassociative network to accomplish heteroassociation. In a simple matrix model, where the representation of the A-B pair is given by the sum of the constituent item vectors (a C b ), the storage equation for an autoassociative model would be Wt D W t¡1 C ( a C b ) ( a C b ) T D W t¡1 C aaT C abT C ba T C bb T . Alternatively, one could construct a recurrent heteroassociative matrix memory in which A is associated with B and B is associated back to A. In this model, the forward A!B C ba T , and the association would be stored in the matrix WtA!B D Wt¡1 B!A B!A C ab T . backward association would be stored in the matrix Wt D W t¡1 Although both versions support heteroassociative recall, the autoassociative version assumes symmetric forward and backward associations, whereas the heteroassociative version allows for asymmetric, and possibly independent, forward and backward associations. This distinction between symmetric and asymmetric associations has a long history in the experimental psychology of human memory and learning. 2 Associative Symmetry Versus Independent Associations
According to the classic associationist view, the strength of an association is sensitive to the temporal order of encoding (Ebbinghaus, 1885/1913; Robinson, 1932). If A and B are encoded successively, the forward association, A ! B, is hypothesized to be stronger than the backward association, B ! A. The strengths of forward and backward associations are further hypothesized to be independent (Wolford, 1971). We refer to the strong version of this view as the independent associations hypothesis (IAH). In contrast to this position, representatives of Gestalt psychology (Asch & Ebenholtz, 1962; Kohler, ¨ 1947) viewed symbolic associations as composite representations, incorporating elements of each to-be-learned item into a new entity. We refer to this view as the associative symmetry hypothesis (ASH). According to this position, the strengths of forward and backward associations are approximately equal and highly correlated. In support of the associative symmetry view, early studies of pairedassociate learning found approximate equivalence between forward and
Associative Symmetry
2077
backward recall probabilities (see Ekstrand, 1966, for a review). That is, order of study did not seem to inuence the strength of forward and backward associations. The equivalence of forward and backward recall was particularly striking when highly imagable words were used as stimuli (Murdock, 1962, 1965, 1966). Data from these experiments showed nearly identical forward and backward recall across variations in presentation rate, serial position, delay between study and test, list length, and number of repetitions of the study pairs. More recent studies have shown that the classic symmetry results are readily replicated (see Kahana, 2001, for a review). This surprising result, supporting gestalt and cognitive ideas, inspired signicant debate for nearly two decades. The problem was not resolved, and it resurfaced when mathematical models of associative memory appeared on the scene (Murdock, 1985; Pike, 1984). Although retrieval in paired associates is approximately symmetric (with respect to order of study), retrieval in free recall and serial recall shows marked asymmetries. For example, when subjects are asked to name the letter that precedes or follows a probe letter in the alphabet, backward retrieval is typically 40 to 60% slower than forward retrieval (Klahr, Chase, & Lovelace, 1983; Scharroo, Leeuwenberg, Stalmeier, & Vos, 1994). In the free recall task, subjects are presented with a series of items and asked to recall them in any order. Analysis of output order reveals that forward transitions are signicantly more frequent than backward transitions (Kahana, 1996). This is true of immediate, delayed, and continuous-distractor free recall (Howard & Kahana, 1999). Although many studies have found symmetry between forward and backward recall, there are still conditions under which symmetric retrieval is violated. For example, in studies where pairs of items are chosen from different linguistic classes, strong asymmetry is often observed. Lockhart (1969) showed that cued recall of noun-adjective pairs was superior when cued with the noun than when cued with the adjective, independent of the order of study. However, when subjects study noun-noun pairs, associative symmetry is observed. Wolford (1971) examined both digit-word and worddigit pairs and found an advantage when using words as the retrieval cue, irrespective of the order of presentation. Kahana (2001) presented an analysis of the implications of associative symmetry for linear matrix and convolution models of human associative memory. In his treatment, he showed that models assuming symmetry (Metcalfe, 1991; Murdock, 1997) can still account for empirical asymmetries in overall forward and backward recall performance. Consider, for example, items with different numbers of preexperimental associates. If item A has many more preexperimental associates than item B, then probing memory with item A will result in signicant associative interference and impair subjects’ performance. This will occur even if the A-B association is stored symmetrically. Although models that assume symmetry can account for ndings of retrieval asymmetries, models that assume independent associations can also
2078
Daniel S. Rizzuto and Michael J. Kahana
account for the equivalence of forward and backward recall performance. In this case, if associations are formed independently but with equal strength on average, symmetric retrieval probabilities would be realized. It is thus unlikely that forward and backward recall probabilities will discriminate among competing theories. To address the question of the dependence or independence of forward and backward associations, Kahana (2001) employed the method of successive tests, which involves successively testing the stored associations in both the forward and reverse directions. With this method, one can directly estimate the correlation between forward and backward recall of a given studied pair. Kahana reported experimental evidence consistent with the ASH. 3 A Neural Network Model of Paired-Associate Learning
We developed an autoassociative neural network model to simulate forward and backward associative recall and to model data on the correlation between successive recall tests by human subjects (Metcalfe, 1991; Metcalfe, Cottrell, & Mencl, 1993). Although autoassociation is generally used to encode a single representation, if the representation being stored is the combination of two items, it is possible to store both autoassociative and heteroassociative information within a single memory matrix. Assuming that the representations of the A and B items are concatenated, the general form of the storage equation for a list of L pairs is given by WD
L X ºD 1
( aº © bº) ( aº © bº ) T ,
(3.1)
where aº and bº are binary ( § 1), N-element vectors representing the items to be associated, aº © bº denotes the concatenation of aº and bº, and W is the 2N £ 2N weight matrix. This outer product produces a memory matrix with four quadrants, as shown in Figure 1. Quadrants 1 and 3 of the matrix contain autoassociative information, and quadrants 2 and 4 contain heteroassociative information. We use a probabilistic encoding1 algorithm that enables the network to account for the effect of repetition on performance. Each weight in the matrix has some probability of being stored correctly, depending on the 1 In an analysis of different learning rules for distributed memory models, Murdock (1989) found that probabilistic encoding of the individual “features” of item vectors provided a good t to data on paired-associate learning. In this framework, each feature is encoded with some probability on each presentation of the item. This results in an increase in the number of features encoded with each presentation. For more recent models using probabilistic learning in autoassociative networks, see Amit and Brunel (1995), Amit and Fusi (1992), and Brunel, Carusi, and Fusi (1998).
Associative Symmetry
2079
1 4
A
2 3
B
Figure 1: Graphical representation of the autoassociative network architecture used to simulate the experimental paradigm. Light rectangles indicate heteroassociative quadrants of the weight matrix, and the darker rectangles indicate autoassociative quadrants. Triangles represent the nodes in the network.
quadrant of the matrix in which it resides and the number of repetitions. The heteroassociative quadrants of the matrix drive the associative recall process, and we introduce two random variables, c f and c b , which control learning in the forward and the backward directions, respectively. For a pair in the list, the rule for storing each heteroassociative weight in the quadrant that mediates forward recall (quadrant 2 in Figure 1) is given by
D Wij
D
»ºº si sj 0
with probability with probability
cf 1 ¡c f,
(3.2)
where sº D ( aº © bº ) and c f » N (m , s) . Similarly, the probability of storing each heteroassociative weight that mediates backward recall is given by c b » N (m , s) . The parameter m represents the mean probability of encoding, while the parameter s represents the variability in encoding across pairs.
2080
Daniel S. Rizzuto and Michael J. Kahana
If c f and c b are perfectly correlated, the model implements the ASH. If c f and c b are independent of one another, the model implements the IAH. Rather than pitting the two hypotheses against each other, we can allow the model to determine the correlation between c f and c b , denoted r (c f , c b ) , that best ts the data. The power of this parameter lies in its ability to change the behavior of the model from that approximating the IAH when r approaches zero, to that approximating the ASH when r approaches unity. For the correlation parameter to be meaningful, the values of c f and c b must vary across pairs. This reects the empirical nding that some pairs are easier to learn than others. The correlation represents the degree to which variation in encoding is at the level of pairs within a list versus the level of forward and backward associations within a pair. Despite the variation in r, the mean level of encoding is the same for both the forward and backward associations. This means that average forward recall and backward recall will be equivalent even when r approaches zero. Thus, for each A-B pair, the learning of the forward and backward association is determined by the distributions of c f and c b , as well as the correlation, r, between these two parameters. Because associative recall is driven by the heteroassociative quadrants of the matrix, we make the simplifying assumption of no variability in autoassociative encoding across pairs. Thus, the weights in quadrants 1 and 3 of the matrix are stored equally well for every item, with the probability of storage equal to m . Sampling the learning probabilities c f and c b from a normal distribution of mean m and variance s 2 occasionally produces probabilities that fall outside the range of 0 through 1. These probabilities are meaningless, and when this occurs, both variables are redrawn from the distribution, thus producing a truncated normal distribution. Using this method produces an effective mean and variance that are somewhat different from the parameters used to generate the distribution. Retrieval in this model follows Hopeld (1982). Recall of pair º proceeds asynchronously, with a random node being updated on each iteration, t, and its activation given by 0 sºi ( t C 1) D sgn @
1 X
Wij sºj ( t) A .
(3.3)
j
The initial state of the network, sº( 0), is equal to ( aº © k ) for forward recall and ( k © bº) for backward recall, where k is an N-dimensional, random, binary ( § 1) vector, uncorrelated from trial to trial. The state of the cue vector (aº, in the case of a forward test) is clamped throughout retrieval. After each iteration, the state of the network is compared to the target item. If the similarity between the network state and the target, as measured by the cosine of the angle between the two vectors, is greater than a constant criterion, h , then the target is said to be recovered. If not, the process
Associative Symmetry
2081
repeats until a maximum number of iterations has been reached, denoted Imax . At this time, if the network has not reached the criterion level of item recovery, that item is considered nonrecallable. In the work presented here, the parameters h, Imax , and N were xed to 0.99, 800, and 70, respectively; changing them had very little effect on the results of the simulations. 4 Applying the Model to a Paired-Associate Learning Data Set
We applied our model to data gathered from a paired-associate learning experiment reported in Kahana (2001). First, we briey describe the experiment, and then we report model ts to the data. The experimental procedure is illustrated in Figure 2. Subjects studied a list of 12 unique word pairs. Each pair was presented visually for 2 seconds, and subjects were instructed to read the words aloud from left to right to ensure that they processed the two words in temporal succession. To assess learning, equal numbers of word pairs were presented either one, three, or ve times in each list. The order of presentation was random, subject to the constraint that identical pairs were never repeated successively. After studying the list of word pairs, subjects performed a distractor task (pattern matching) aimed at minimizing the role of recency-sensitive retrieval processes (Howard & Kahana, 1999). Each pair was tested for recall once in each of two successive test phases (test 1 and test 2). In test 1, subjects were tested on each of the studied pairs: half in the forward direction ( A ! ?) and half in the backward direction ( ? à B ) . In test 2, half of the pairs that were rst tested in the forward direction were tested in the backward direction, and the other half were again tested in the forward direction. The same was true of pairs that were tested in the backward direction on test 1. This produced a 2 £ 2 factorial of test 1–test 2 possibilities (forward-forward, forward-backward, backwardforward, and backward-backward). For each of the four combinations of test 1–test 2 possibilities, one can construct a 2 £ 2 contingency table tallying the outcomes of the rst and second tests (success and failure on test 1 crossed with success and failure on test 2; see Figure 3). The marginal probabilities of this contingency table yield the probability of success on test 1 (cell a C cell c) and the probability of success on test 2 (cell a C cell b). Additionally, one can measure the dependency between the two outcomes using a standard measure of association (e.g., Yule’s Q, Goodman-Kruskal’s c , or  2 ). We adopt Yule’s Q because of its extensive use in the study of memory (see Kahana, 1999, for a review) and for its desirable statistical properties (Bishop, Fienberg, & Holland, 1975). This dependency, or correlation, between the outcomes of the two tests provides valuable information that is not contained in the marginal probabilities of recall. The experimental results and model ts are shown in Tables 1 and 2. Table 1 reports recall probability and test 1–test 2 correlations (Yule’s Q) as
2082
Daniel S. Rizzuto and Michael J. Kahana
Same?
...
DISTRACTOR TASK
time
STUDY
ABSENCE - HOLLOW DESPAIR - PUPIL JOURNEY - WORSHIP ABSENCE - HOLLOW FEATURE - COPY DESPAIR - PUPIL ABSENCE - HOLLOW PARTNER - MODEST DESPAIR - PUPIL ABSENCE - HOLLOW COMMAND - SLENDER SACRED - HARNESS ABSENCE - HOLLOW
TEST 1
Same?
...
DISTRACTOR TASK
FEATURE - ? ? - PUPIL ABSENCE - ? ? - WORSHIP
TEST 2
? - PUPIL FEATURE - ? JOURNEY - ? ? - HOLLOW
(B - B) (F - F) (B - F) (F - B) (CONDITION) a
Figure 2: Depiction of the experimental procedure. The study session is followed by a distractor task (pattern matching), which is followed by test 1, another distractor task, and then test 2.
a function of the number of presentations, averaged across subjects. From the behavioral data, it can be seen that there were no signicant differences between forward and backward recall probabilities within any presentation level of the experiment. Also, correlations between identical successive tests were very high (near one); correlations between reverse successive tests were also high, but not as high as for the identical successive tests.
Associative Symmetry
+
2083
Test 1
+ a= 0.25
b = 0.35
c = 0.10
d = 0.30
Test 2
Q=
ad - bc = 0.364 ad + bc
Figure 3: Example of a 2 £ 2 contingency table, illustrating the calculation of the Yule’s Q measure of association between test 1 and test 2 performance. Table 1: Observed Accuracy and Correlation Values for Each Presentation Level in the Experiment, Averaged Across Subjects. Yules Q
Probability of Recall Forward
1p 3p 5p
Backward
Identical
Reversed
Data
Model
Data
Model
Data
Model
Data
Model
0.35 0.65 0.75
0.35 0.65 0.75
0.36 0.65 0.73
0.35 0.65 0.75
0.97 0.96 0.96
0.99 0.99 0.99
0.84 0.81 0.84
0.85 0.83 0.82
In Table 2 we report the average of the individual subject’s contingency tables for the behavioral data and the model.2 We t the model to each individual subject’s contingency table separately. In doing so, we allowed m and s to vary independently for each level of learning (m 1 , m 3 , m 5 , s1 , s3 , and s5 ). This quantitatively ts the learning data without constraining the model to a particular learning mechanism. A single r is used to correlate forward and backward learning for all presentation levels. For parameter estimation, we minimize the model’s root-mean-squared deviation (RMSD) from the behavioral data for each of the 15 subjects who took part in the experiment.3 We t the model concurrently to the contingency tables for 2 Although Tables 1 and 2 are computed from the same underlying behavioral data, computing Yule’s Q values from Table 2 will not necessarily produce the values seen in Table 1. The reason for this lies in the fact that when calculating Yule’s Q, it is customary to add 0.5 to each cell in the contingency table. This is due to the fact that Yule’s Q is a ratio and is thus extremely sensitive to cells having values of zero or near zero. Table 2 contains raw contingency tables averaged across subjects without adding 0.5 to each cell. This is appropriate because no derived measure is being calulated. Thus, Table 2 represents the average contingency table and was not used when calculating the average value of Yule’s Q. 3 Error minimization was accomplished via a genetic algorithm (Mitchell, 1996).Fifteen populations (one for each subject in the experiment) of 100 random points (“individuals”) in parameter space were evolved for many generations until the average tness for the population did not change from one generation to the next. An individual’s tness was calculated to be the negative of its RMSD compared to the behavioral data for a particular
2084
Daniel S. Rizzuto and Michael J. Kahana
Table 2: Contingency Tables from the Behavioral Experiment Compared with Simulated Contingency Tables. Identical Direction
Data
Reversed Directions
Model
Data
Model
1p
0.319 0.006
0.012 0.663
0.321 0.032
0.028 0.620
0.293 0.049
0.122 0.537
0.262 0.097
0.100 0.542
3p
0.583 0.037
0.012 0.368
0.625 0.027
0.026 0.323
0.609 0.043
0.074 0.273
0.575 0.088
0.084 0.254
5p
0.729 0.012
0.018 0.241
0.723 0.019
0.025 0.233
0.681 0.037
0.086 0.196
0.665 0.078
0.083 0.173
the identical and reversed successive tests for all three levels of learning, thereby tting 24 data points with seven free parameters. Figure 4 plots the best-tting model parameters with 95% condence intervals calculated across subjects. By tting each (normalized) quadrant of the contingency tables from the experimental data, the model was able simultaneously to t correlations and accuracy on tests 1 and 2 (RMSD D 0.07). These model ts to the experimental data are interesting in several respects. As expected, we see that the mean level of encoding increases with learning (from m 1 to m 3 to m 5 ). We also see that signicant variability in learning across pairs (s) is required to t the experimental data. Although this is consistent with psychological studies of the role of variability in learning (Hintzman & Hartry, 1990), this important facet of human performance has rarely been incorporated into neural network or even formal mathematical models of learning and memory. Finally, and most important, the fact that the best-tting values of r are very high (near one) tells us that very strong correlations between forward and backward storage are necessary to t the human behavioral data. Once the best parameter set for each subject was found, we completed a comprehensive search of the local parameter space to look at the effects of correlations and variability in learning on the tness landscape. In Figure 5 we plot model tness as a function of r and s3 ; each of the remaining parameters was kept xed. Fitness was calculated using only the contingency tables for three presentations. Underneath each of the simulated data points in the tness landscape, we plot the p-value representing the signicance of the difference between that parameter set and the best-tting parameter set. Lower p-values reect subject. Each individual was run for 300 trials of 12-pair lists in order to generate reliable values. At the end of every generation, each of the 50 least-t individuals were completely regenerated from 2 of the best-tting individuals by randomly drawing their new parameter values from each of their parents. Those 50 individuals with the best tness were mutated by a single, gaussian parameter change.
Associative Symmetry
2085
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 m
1
m
3
m
5
s
1
s
3
s
5
r
Figure 4: Best-tting parameter values averaged across subjects. Error bars represent 95% condence intervals.
an inability of that parameter set to t the behavioral data as well as the best-tting parameter set. Figure 5 shows that the best-tting portion of the parameter space (r ¸ 0.8 and 0.15 · s3 · 0.25) ts signicantly better than the rest of the space. We used the same simulations to assess correlations between forward and backward recall. Figure 6 plots the correlation between simulated successive tests in opposite directions as a function of both r and s3 . In this gure, it is shown that the only way of producing high correlations in the retrieval of forward and backward associations is to have extremely high correlations in the learning of those associations. This gure also shows the monotonic relationship between the correlations in learning the forwardbackward associations and the correlation in recall between successive tests in opposite directions. The correlation between successive tests in the identical direction is always one. It may be seen that Yule’s Q decreases as encoding variability (s) across items decreases. This occurs when s is small compared to the binomial variability introduced during probabilistic encoding. Although there is no difference between the probabilities of encoding the forward and the backward associations when s D 0, there are random uctuations in the actual number of weights stored for each association (e.g., if there are 1000 weights in each quadrant of the matrix and c f D c b D 0.5, one trial might store 512
2086
Daniel S. Rizzuto and Michael J. Kahana
Fitness ( RMSD)
0.1 p value
0.2
0.15
0.14
0.2
0.07 1 0.75
0.01
0.5 0.25
r
0
0
0.05
0.15
0.1
s
0.2
0.25
3
Figure 5: Model tness (negative RMSD) plotted on the z-axis as a function of r and s3 . Below the tness function, we plot the p-values corresponding to the statistical signicance of the difference between the best-tting parameter set and the rest of the parameter space.
weights in the forward weight matrix while storing only 487 weights in the backward weight matrix). As s ! 0, encoding probabilities are very similar for all pairs, and the uncorrelated noise associated with binomial variability leads to much lower correlations in recall. 5 Output Encoding
In the model just described, we assumed that the test trials did not affect information stored in memory. Although this assumption is common in neural network models, behavioral studies suggest that test trials do contribute to learning (Humphreys & Bowyer, 1980). If subjects are reencoding the pair during correct responses to the rst test, this could increase the probability of responding correctly to the second test and artifactually increase the observed correlation.4 To explore this possibility, we augmented our basic model by introducing an output encoding parameter. This parameter, w, regulates the reencoding
4 Although it is possible that subjects will encode some representation on incorrect trials as well, subjects rarely make errors of commission in these experiments. They usually respond with “Pass” instead.
Associative Symmetry
2087
Correlation (Yule’s Q)
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2
r
0
0
0.05
0.15
0.1
s
0.2
0.25
3
Figure 6: Simulated correlation (Yule’sQ) between forward and backward recall as a function of s3 and r.
of the original association on correct trials. When a test item is correctly recalled, each weight in all of four matrix quadrants given by equation 3.1 is probabilistically reencoded as
D Wij
»ºº si sj D 0
with probability with probability
w 1 ¡ w.
(5.1)
This implementation of output encoding does not include variability, and the probability of storing a weight is independent of the quadrant it resides in. The fact that we are reencoding both the forward and the backward associations after a correct recall in either direction should increase the simulated correlation between forward and backward recall. In order to examine the contribution of output encoding to the behavioral predictions of our model, we again t all eight parameters of the model to the experimental data for each subject. Figure 7 plots the best-tting values of m , s, r, and w averaged across individual subjects. As before, m increases monotonically with the number of presentations of the studied pair, and s stays relatively constant across the three learning levels, corroborating the simulations without output encoding. These simulations differ, however, with respect to r, which settles into a lower value with output encoding than without (compare Figure 7 with Figure 4). This difference was not
2088
Daniel S. Rizzuto and Michael J. Kahana 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 m
1
m
3
m
5
s
1
s
3
s
5
r
f
Figure 7: Best-tting parameter values for the model that includes output encoding. Error bars represent 95% condence intervals.
statistically signicant (t( 28) < 1, n.s.). Similarly, the augmented model does not t the data signicantly better than the model without output encoding (RMSD D 0.06; t ( 28) < 1, n.s.). It should be noted however, that the besttting value of r was signicantly less than one (t( 14) D 3.15, p < 0.01), suggesting that although the ASH provides a much better description of behavior than the IAH, the true state of the world may fall shy of perfect symmetry. 6 Conclusion
The nding of very strong correlations between forward and backward recall of paired associates in human memory supports the view that both forward and backward associations are products of a single mechanism. The strength of our model lies in its ability to modulate its behavior accurately and exibly between two extreme views of associative learning: ASH and the IAH, via the correlational parameter, r. The fact that the average besttting parameter set (without output encoding) included a value of r close to one is strong evidence for symmetrical learning of associations in humans. These results lend support to models that employ inherently symmetric associative mechanisms, like the holographic models of Murdock (1982, 1997) and Metcalfe (1991; Metcalfe Eich, 1982). Although the neural plau-
Associative Symmetry
2089
sibility of convolution-based models has been called into question (Pike, 1984, but see Murdock, 1985), they have been successfully applied to a wide range of experimental data. We have examined the possibility that output encoding during the rst test is affecting the results of the second test, a hypothesis rst explored by Humphreys and Bowyer (1980) in the context of successive recognitionrecall tests. Although our model can t the human behavioral data marginally better when it includes output encoding than when it does not, this difference is not statistically signicant. Most important, the best-tting value of r does not change signicantly with the inclusion of an output encoding mechanism. This model also addresses the effects of variability in the learning of associations. The best-tting parameter set at all levels of learning contained signicant variability in the distribution of learning probabilities. Clearly there are differences in encoding at the level of individual word pairs; some items are easier to learn than others, and uctuations in attention can provide an additional source of variability. The current results support the idea that variability in learning is an important factor that must be considered in order to model the human learning process accurately. The idea that an autoassociative mechanism may underlie both item retrieval (redintegration) and associative retrieval (cued recall) is not unreasonable given what we know about hippocampal physiology. Recurrent collaterals in the CA3 region of the hippocampus may provide a neurophysiological basis for autoassociative memory (Treves & Rolls, 1994). Each CA3 cell receives input from mossy bers arising in the dentate gyrus and direct perforant path inputs from entorhinal cortex (EC). Yet by far the majority of the connections come from other CA3 cells (Ishizuka, Cowan, & Amaral, 1995). If incoming sensory inputs arriving from association cortex via EC contain a composite representation of both A and B items, this strong recurrent connectivity could provide a symmetric memory storage mechanism. Since synaptic modiability within this system occurs on such a short timescale (Hanse & Gustafsson, 1994), it would be well suited to forming cohesive, episodic representations. Like humans, rats can also learn associations in the forward direction and exhibit transfer of the same association tested in the backward direction. Bunsey and Eichenbaum (1996) trained rats to associate a stimulus odor with one of two response odors in an olfactory learning paradigm. They then assessed the rats’ ability to choose the stimulus odor when presented with its paired response. Although transfer was not perfect, it was substantial ( > 50%). Following hippocampal lesions, these animals still show signicant retention of the forward association but exhibit no transfer when tested in the reverse order. On the basis of these results, they suggest that there may be two mechanisms involved in the formation of associations. A declarative mechanism, mediated by the hippocampal formation, allows for exible usage of associative relations, thus supporting both forward and
2090
Daniel S. Rizzuto and Michael J. Kahana
backward associative learning. A second procedural mechanism supports the acquisition of stimulus-stimulus associations but not the inferential processing needed to retrieve backward associations. This second mechanism does not depend on an intact hippocampal formation. A convergence of anatomical, behavioral, and physiological evidence points to the hippocampus as an integrator of multimodal sensory information, crucial to memory processing. The fact that it provides the necessary infrastructure to support an autoassociative memory system also suggests the intriguing hypothesis that this type of architecture may indeed underly both auto- and heteroassociative memory in humans. In summary, our autoassociative network model of heteroassociative memory implements a stochastic learning algorithm acting at the level of individual weights and allows us quantitatively to t human accuracy data as well as the correlations between successive tests. Fitting these experimental data highlights the importance of variability in the learning process. In addition, data on the correlations between successive forward and backward recall tests support the notion that an autoassociative mechanism may underlie at least some forms of heteroassociative learning. Acknowledgments
We acknowledge support from National Institutes of Health research grants MH55687 and AG15852. Correspondence concerning this article should be addressed to Michael Kahana, Volen Center for Complex Systems, MS 013, Brandeis University, Waltham, MA 02254-9110. Electronic mail may be sent to [email protected]. References Amit, D. J., & Brunel, N. (1995). Learning internal representations in an attractor neural network with analogue neurons. Network: Computation in Neural Systems, 6, 359–388. Amit, D. J., & Fusi, S. (1992). Constraints on learning in dynamic synapses. Network: Computation in Neural Systems, 3, 443–464. Anderson, J. A. (1972). A simple neural network generating an interactive memory. Mathematical Biosciences, 14, 197–220. Asch, S. E., & Ebenholtz, S. M. (1962). The principle of associative symmetry. Proceedings of the American Philosophical Society, 106, 135–163. Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press. Brown, G. D. A., Dalloz, P., & Hulme, C. (1995). Mathematical and connectionist models of human memory: A comparison. Memory, 3, 113–145. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network:Computation in Neural Systems, 9, 123–152.
Associative Symmetry
2091
Buhmann, J., Divko, R., & Schulten, K. (1989). Associative memory with high information content. Physical Review A, 39, 2689–2692. Bunsey, M., & Eichenbaum, H. (1996). Conservation of hippocampal memory function in rats and humans. Nature, 379, 255–257. Carpenter, G., & Grossberg, S. (1993). Normal and amnesic learning, recognition and memory by a neural model of cortico-hippocampal interactions. Trends in Neurosciences, 16, 131–137. Chance, F. S., & Kahana, M. J. (1997). Testing the role of associative interference and compound cues in sequence memory. In J. Bower (Ed.), The neurobiology of computation: Proceedings of the Annual Computational Neuroscience Meeting. Norwell, MA: Kluwer. Ebbinghaus, H. (1885/1913). Memory: A contribution to experimental psychology. New York: Teachers College, Columbia University. Ekstrand, B. R. (1966). Backward associations. Psychological Bulletin, 65, 50–64. Hanse, E., & Gustafsson, B. (1994). Onset and stabilization of NMDA receptordependent hippocampal long-term potentiation. Neuroscience Research, 20, 15–25. Hintzman, D., & Hartry, A. L. (1990). Item effects in recognition and fragment completion: Contingency relations vary for different subsets of words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16, 965– 969. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 84, 8429–8433. Howard, M. W., & Kahana, M. J. (1999).Contextual variability and serial position effects in free recall. Journal of Experimental Psychology: Learning, Memory and Cognition, 25, 923–941. Humphreys, M. S., Bain, J. D., & Pike, R. (1989). Different ways to cue a coherent memory system: A theory for episodic, semantic, and procedural tasks. Psychological Review, 96, 208–233. Humphreys, M. S., & Bowyer, P. A. (1980). Sequential testing effects and the relation between recognition and recognition failure. Memory and Cognition, 8, 271–277. Ishizuka, N., Cowan, W. M., & Amaral, D. G. (1995). A quantitative analysis of the dendritic organization of pyramidal cells in the rat hippocampus. Journal of Comparative Neurology, 362, 17–45. Kahana, M. J. (1996). Associative retrieval processes in free recall. Memory and Cognition, 24, 103–109. Kahana, M. J. (1999).Contingency analyses of memory. In E. Tulving (Ed.), Oxford handbook of human memory (pp. 323–384). New York: Oxford Press. Kahana, M. J. (2001). Associative symmetry and memory theory. Submitted. Klahr, D., Chase, W. G., & Lovelace, E. A. (1983). Structure and process in alphabetic retrieval. Journal of Experimental Psychology: Learning, Memory, and Cognition, 9, 462–477. Kohler, ¨ W. (1947). Gestalt psychology. New York: Liveright. Lockhart, R. S. (1969) . Retrieval asymmetry in the recall of adjectives and nouns. Journal of Experimental Psychology, 79, 12–17.
2092
Daniel S. Rizzuto and Michael J. Kahana
Metcalfe, J. (1991). Recognition failure and the composite memory trace in CHARM. Psychological Review, 98, 529–553. Metcalfe, J., Cottrell, G. W., & Mencl, W. E. (1993). Cognitive binding: A computational-modeling analysis of a distinction between implicit and explicit memory. Journal of Cognitive Neuroscience, 4, 289–298. Metcalfe Eich, J. (1982) . A composite holographic associative recall model. Psychological Review, 89, 627–661. Mitchell, M. (1996). An introduction to genetic algorithms. Cambridge, MA: MIT Press. Murdock, B. B. (1962). Direction of recall in short term memory. Journal of Verbal Learning and Verbal Behavior, 1, 119–124. Murdock, B. B. (1965). Associative symmetry and dichotic presentation. Journal of Verbal Learning and Verbal Behavior, 4, 222–226. Murdock, B. B. (1966). Forward and backward associations in paired associates. Journal of Experimental Psychology, 71, 732–737. Murdock, B. B. (1982). A theory for the storage and retrieval of item and associative information. Psychological Review, 89, 609–626. Murdock, B. B. (1985). Convolution and matrix systems: A reply to Pike. Psychological Review, 92, 130–132. Murdock, B. B. (1989). Learning in a distributed memory model. In C. Izawa (Ed.), Current issues in cognitive processes: The Floweree Symposium on Cognition (p. 69–106). Hillsdale, NJ: Erlbaum. Murdock, B. B. (1993). TODAM2: A model for the storage and retrieval of item, associative, and serial-order information. Psychological Review, 100, 183–203. Murdock, B. B. (1997). Context and mediators in a theory of distributed associative memory (TODAM2). Psychological Review, 104, 839–862. Pike, R. (1984). Comparison of convolution and matrix distributed memory systems for associative recall and recognition. Psychological Review, 91, 281– 294. Robinson, E. S. (1932). Association theory today: An essay in systematic psychology. New York: Century Co. Scharroo, J., Leeuwenberg, E., Stalmeier, P. F. M., & Vos, P. G. (1994). Alphabetic search: Comment on Klahr, Chase, and Lovelace (1983).Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 234–244. Treves, A., & Rolls, E. T. (1994). Computational analysis of the role of the hippocampus in memory. Hippocampus, 4, 374–391. Weber, E. U., & Murdock, B. B. (1989). Priming in a distributed memory system: Implications for models of implicit memory. In S. Lewandowsky, J. Dunn, & K. Kirsner (Eds.), Implicit memory: Theoretical issues (pp. 87–89). Hillsdale, NJ: Erlbaum. Wolford, G. (1971). Function of distinct associations for paired-associate performance. Psychological Review, 78, 303–313. Received April 22, 1999; accepted November 2, 2000.
LETTER
Communicated by C. Lee Giles
Simple Recurrent Networks Learn Context-Free and Context-Sensitive Languages by Counting Paul Rodriguez Department of Cognitive Science, University of California at San Diego, La Jolla, CA 92093, U.S.A. It has been shown that if a recurrent neural network (RNN) learns to process a regular language, one can extract a nite-state machine (FSM) by treating regions of phase-space as FSM states. However, it has also been shown that one can construct an RNN to implement Turing machines by using RNN dynamics as counters. But how does a network learn languages that require counting? Rodriguez, Wiles, and Elman (1999) showed that a simple recurrent network (SRN) can learn to process a simple context-free language (CFL) by counting up and down. This article extends that to show a range of language tasks in which an SRN develops solutions that not only count but also copy and store counting information. In one case, the network stores information like an explicit storage mechanism. In other cases, the network stores information more indirectly in trajectories that are sensitive to slight displacements that depend on context. In this sense, an SRN can learn analog computation as a set of interdependent counters. This demonstrates how SRNs may be an alternative psychological model of language or sequence processing. 1 Introduction
Empirical and theoretical analysis has shown that if a recurrent neural network (RNN) learns to process a regular language, one can extract a nite-state machine (FSM) from its hidden unit activations by treating regions of phase-space as FSM states (Cleeremans, Servan-Shreiber, & McClelland, 1989; Giles, Sun, Chen, Lee, & Chen, 1992; Casey, 1996). However, others have pointed out that mathematically, an RNN is an innite state machine, and one should treat it as a continuous-space dynamical system (Pollack, 1991; Kolen, 1994; Tani & Fukumura, 1995; Casey, 1996; Hoelldobler, Kalinke, & Lehmann, 1997; Blair & Pollack, 1997; Tino, Horne, Giles, & Collingwood, 1998; Tabor & Tanenhaus, 1999; Rodriguez, Wiles, & Elman, 1999). In fact, in the setting of discrete-time continuous-space dynamical systems, several researchers have derived results in analog computation theory (Pollack, 1987; Blum, Shub, & Smale, 1989; Siegelmann & Sontag, 1995; Casey, 1996; Maass & Orponen, 1997; Moore, 1998). For example, one can construct an RNN to implement a Turing machine by using c 2001 Massachusetts Institute of Technology Neural Computation 13, 2093– 2118 (2001) °
2094
Paul Rodriguez
RNN dynamics as counters (Pollack, 1987; Siegelmann, 1993; Moore, 1998). A basic counter in analog computation theory uses real-valued precision. For example, a one-dimensional up-down counter for two symbols fa, bg is the pair of equations fa ( x ) D .5 ¤ x C .5, fb ( x ) D 2 ¤ x ¡ 1.0 where x is the state variable, a is the input variable to count up (push), and b is the variable to count down (pop). Implementation in an RNN would have one hidden unit, x, and two input units, one for a and one for b. Using multiplicative weights (second-order connections) from a and b to the recurrent weight, one can construct the system as the following equation: xt D
£
.5
2
¤
¤ It ¤ xt¡1 C
£
.5
¡1
¤
¤ It ,
where It is the input column vectors at time t: a D [1, 0]I b D [0, 1]. A sequence of input aaabbb has state values (starting at 0) .5, .75, .875, .75, .5, and 0. More generally, a dynamical recognizer is a discrete-time continuous dynamical system with a given initial starting point and a nite set of Boolean output decision functions (Pollack, 1991; Moore, 1998; see also Siegelmann, 1993). The dynamical system is composed of a space,
Simple Recurrent Networks Learn Languages by Counting
2095
somewhere between regular languages and context-sensitive languages, one would also like to consider how an SRN learns context-free languages (CFLs) and context-sensitive languages (CSLs). This article examines how an SRN learns CFLs and CSLs. By using graphical analysis and evaluating xed points, a global view of dynamics is pieced together. In contrast to automatically extracting FSMs when the language is unknown, prior knowledge of the language is used to evaluate network dynamics and reconstruct the network solution. Not surprisingly, the solutions are based on counters, but they are not typically the same as one might construct by hand. Whereas hand-constructed solutions are easily built by duplicating the explicit symbol manipulation of discrete automata, network solutions are often symbol sensitive in the nonobvious way that trajectories can depend on previous input. A variety of case studies are demonstrated. First, two CFL examples that were reported previously are reviewed: a simple up-down counting task and a simple palindrome task. Then four additional cases, showing a fuller range of counting functionality with stack-like properties, are examined: a more complicated palindrome, which requires a fractal solution; a simple CSL, which was previously suggested to be unlearnable; a buffer CFL, which is important in computational linguistics; and a jump CFL, which has a solution more like explicit symbol manipulation. The main contribution of this article is to demonstrate that counter solutions for some CFLs and CSLs are reachable with sufcient training. Although a nite-dimensional RNN cannot process CFLs robustly (Casey, 1996; Maass & Orponen, 1997), this work shows that an SRN can develop the kinds of trajectories that represent a language processing solution. Much of the analysis focuses on the best networks, because in the context of cognitive science, the nature of the solution indicates how a system is viewed with respect to its competence capability. On average, network performance is limited by precision and the complexity of the dynamics, as one would expect for a psychological model. 2 Training SRNs with Deterministic Context-Free Languages 2.1 Prediction Task. The SRNs examined in this article were trained to predict the next letter in articial languages that are simple deterministic CFLs or CSLs. Several examples in connectionist modeling of language have used this task (Elman, 1990, 1991; Cleeremans, 1993; Tabor, Juliano, & Tanenhaus, 1996; Christiansen & Chater, 1999) motivated, in part, by the evidence that humans use prediction and anticipation in language processing (Kutas & Hillyard, 1984; McClelland, 1987). 1 It is easy to observe that
1 Although RAAM models have had some success with CFLs (Pollack, 1990; Kwasny & Kalman, 1995), they require an external controller to rerun the pop network, which is one reason they have not been important to psychological applications.
2096
Paul Rodriguez
using a prediction task with an SRN is a real-time transducer version of a dynamical recognizer, in that the system must produce output at each time step, except that instead of an accepting region, there are prediction regions. The network will always be presented with positive examples so that the possibility of rejection will not explicitly arise. One cannot therefore claim that a network solution is a counterexample to Gold’s theorem, which states that it is impossible to learn a grammar from positive examples only (Angluin, 1982). However, one can construct a dynamical recognizer by possibly adding an extra output layer that accepted strings when a network makes all the right predictions possible and rejected strings on prediction mismatches. 2.2 Training Regimes. The training strategy was to use a simple articial deterministic CFL or CSL; train a set of networks, each with different initial weights, in a prediction task; take the best networks; and evaluate the dynamics. All the networks used the SRN architecture with backpropagationthrough-time (BPTT) training (Rumelhart, Hinton, & Williams, 1986) that had 12 copy units and random initial weights between § .3. Training sets were constructed with geometrically more numerous short strings than long strings and a nite maximum length so that the network would learn the end of string transitions. Strings were presented in random order. The length of training time was not xed but determined by checking when a network was performing most correctly. A threshold criterion of .5 was used for an output node to indicate which output was predicted. If a network made all the right predictions possible for some input string, then it was correctly processing that string. 2.3 The Simplest CFL. Previous work showed that an RNN can learn to process a simple context-free language, an bn , by organizing the trajectories into an up-down counter that is similar to hand-coded dynamical recognizers but also exhibits some novelties (Rodriguez et al., 1999). The network developed an attracting and repelling xed point for the Fa and Fb systems, respectively, with proportional contracting and expanding rates (inversely proportional eigenvalues). 2 Unlike hand-coded solutions, the contraction and expansion axes were distributed among the hidden units, and the Fb repelling xed point was actually a saddle-node point that used the stable eigenvector effectively to copy the last a count value to the region of phase-space near the saddle point. This allowed the system to count in different regions of space. The network generalized to strings of length n · 16, although they were trained only for n · 11. 2 Fx , where x is a symbol in the alphabet of the language, will be used to indicate the discrete-time dynamical system equations with the input vector determined by the symbol.
Simple Recurrent Networks Learn Languages by Counting
2097
2.4 Reconstructing the Network Solution. After training an SRN, it is useful to reconstruct the network dynamics to understand how the trajectories represent the task. In particular, one can replicate and simplify the dynamics without losing sight of the important features that make it work: the type of xed points, their location with respect to each other, and some reasonable initial state. For example, it is easier to observe counting trajectories when they take place along different dimensions than when they are distributed among hidden units. Also, a network trajectory is functioning as an up or down counter if it is used to keep track of counting information, irrespective of whether the hidden unit values are increasing or decreasing or whether the trajectories are oscillating near the xed points (in discrete time, trajectories can “jump” over xed points as they converge or diverge). But in a reconstruction, one can always use increasing values to count up and remove the oscillation without loss of generality. Furthermore, one can construct an idealized version of the network by using innite precision. For example, as shown below, processing strings of unlimited length is achieved if counting up approaches an attracting limit point and counting down moves away from a repelling limit point at the exact same rate, so that trajectory steps can match up to innity without changing prediction regions.3 The idealized reconstruction, with appropriate accept-reject regions included as in a dynamical recognizer, can properly be said to process the language, whereas the trained network correctly processes nite-length strings in the language. The number of weights that require precision give a simple measure of how complex the dynamics have to be to process the language. Using the piecewise linear function f ( net) D net for net 2 [0, 1], 0 for net < 0, 1 for net > 1, the network solution for the simple CFL can be represented by the following:
02
.5 Xt D f @4 2 0
0 2 0
3 2 0 .5 0 5 ¤ Xt¡1 C 4 ¡5 0 #
¡5 ¡1 C5
3 1 ¡5 ¡5 5 ¤ It A , ¡5
where f is applied to each net element, t indexes time, Xt is the threedimensional state vector, It is the four-dimensional input vector using 1in-k encoding of a D [1, 0, 0]I b D [0, 1, 0]I E D [0, 0, 1], where E is an end of string marker, and # indicates a weight parameter that is left unspecied, since for legal strings it may not adversely affect processing. The intuition behind the constructed solution is to assign one dimension to each input symbol, use inverse values for recurrent weights in counting up versus counting down, and set the input weights according to whether a state variable should be turned on or off ( § 5) or left unspecied. For 3 This is one obvious difference from building chains of quasi-attractors in a Hopeld net for counting (Amit, 1988), although that work is intended to be more biological.
2098
Paul Rodriguez
example, reading down the rst column of WHI (input weight matrix) in the equation above indicates that an a input vector will increment x1 , turn off x2 , and do nothing to x3 . Note that the third column of W HI is for the E input to reset all state values to 0. Also, the W HI ( 3, 1) weight could be left unspecied, since for all legal strings, the x3 does not have to be turned off by a. A sample trajectory is presented in appendix A. The constructed system has xed points and dynamics based on what was found in the network, except that they are simplied by using separate dimensions. The expansion and contraction rates are inversely proportional as in the network, but not necessarily the same. The constructed system is in some sense idealized since it has eigenvalues that are exact inverses and the Fa xed point is located exactly on the stable manifold of the Fb saddle node. Importantly, this task requires three weights in the recurrent connections to be coordinated with precision so that the two counters are interdependent: the weight to count up, WHH (1, 1) (in the hidden unit weight matrix), the weight to count down, WHH (2, 2) , and the weight to copy over the a count, WHH (2, 1). Occasionally a network that learns an bn will spuriously overgeneralize to some strings from the balanced parenthesis language, for example, aaabbaabbabb. One can extend the idealized solution above by setting WHH (1, 2) D .5 (so that the Fa systems copies back last b count value), in which case four recurrent weights are coordinated with precision. Also, simulations reveal that for the simple CSL an bn cn , the system divides the task into two simple CFLs by counting up and down for input b on different dimensions. 2.5 A Restricted Palindrome. A palindrome language (mirrorlanguage) consists of a set of strings, S, such that each string, s 2 S, s D wwr , is a concatenation of a substring, w, and its reverse, wr . A mechanism cannot use a simple counter to process such a string but must use the functional equivalent of a stack that enables it to match the symbols in the second half of the string with the rst half. Rodriguez and Wiles (1998) investigated a deterministic palindrome language that uses only two symbols for w and two other symbols for wr , such that the second half of the string is fully predictable once the change in symbols occurs.4 The language is restricted in that one symbol is always present and precedes the other. Formally, the language is: w D an bm , wr D Bm An E, that is, aaaabbbBBBAAAAE, where n > 0, m ¸ 0, and E is the end of string marker. The embedded subsequence bm Bm is just the simple CFL. One way to keep track of the number of as is to copy explicitly and store the a count
4 One should not confuse the term deterministic, which refers to the nature of state transitions in automata, and predictable, which refers to whether a symbol in a string can be determined from the left context.
Simple Recurrent Networks Learn Languages by Counting
2099
across the b . . . B subsequences. For example, the system could use an extra state variable to latch onto the value of the variable that counts a during the b . . . B input. With a slight abuse of input vector notation, the equation could be x5,t D 1 ¤ x5,t¡1 C 1 ¤ x1,t¡1 ¡ 5 ¤ a C 0b C 0B ¡ 5A, where x5 is an extra state variable and x1 is the a count variable. Using ve hidden units, the network developed a counter for b . . . B and a counter for a . . . A, just as in the simple CFL case. More interestingly, after processing an , the activation values are not directly latched onto by any hidden units. Instead, the last a input displaces the dynamics for b . . . B such that the last B state values are clustered as a function of the number of as. Furthermore, as indicated by the stable eigenvectors, the saddle-node point for the F A system contracted the different coordinate points, one for an and one for an bm Bm , to nearly the same location in phase-space, treating those points as having the same information. In other words, the displacement carries enough information so that the network can count the A input and predict the end of the string. In a solution that replicates the network, rather than add an extra dimension for latching, one would include recurrent weights that copy and scale the a count into the b . . . B counter (Rodriguez & Wiles, 1998). 3 An Unrestricted Palindrome
The unrestricted version of the deterministic palindrome language is the set of strings where w D ( a _ b) n , wr D reverse of w in which b D B, a D A, for example, babbaABBABE, where E is the end of string marker. Since there is a change in symbols in the middle of the string, the language is still deterministic, and the second half is fully predictable. Making all the right predictions requires that a counting mechanism not only store each count but also keep track of subsequences. For example, a possible solution is to interweave the counting for a and b so that, for example, Fa both counts a up and counts b down. Note that for n · 1, w is the set of all possible a,b sequences. In the framework of symbolic dynamics (Devaney, 1986; Barnsley, 1993), if an iterated function system uniquely encodes all possible sequences, then the trajectories consist of a dense set of points (or unique symbolic addresses) in the attractor. Therefore, one might expect that trajectories in an RNN will have a fractal encoding (see Barnsley, 1993, esp. pp. 167 and 382; for an application to neural networks, see Pollack, 1991; Kolen, 1994). 3.1 Training Set. A network that is trained on the unrestricted palindrome would have a hard time learning that the order of A and B is not random. Therefore, more strings were included in the training set that had unmixed subsequences of a or b as in the restricted palindrome case, where unmixed is dened as cases in which a only precedes b, or b only precedes a. The training set consisted of unmixed sequences: w D an bm , wr D Bm An ,
2100
Paul Rodriguez
and w D bn am , wr D Am B n , for n · 7, m · 7, and n C m · 10. 5 The training set also included one copy each of all possible strings w D ( a _ b) k, wr D the obvious reverse with a=A,b=B, where k · 6. An end-of-string marker, E, was included in every string. 3.2 Results. Using ve hidden units, performance on strings from the training set was only about 28% correct § 11 with a learning parameter of .001. However, in a sample of 10 networks, performance could be further improved in 4 networks to about 60% correct with a learning parameter of .0001. These networks correctly processed mixed and unmixed strings of various lengths, whereas other networks seemed to learn only shorter mixed and unmixed strings. The best solution that was found out of 250 networks learned to process about 78% of the training set correctly.6 More revealing is that the network generalized to 39 out of 114 mixed sequences of length k D 7, and 45 out of 240 mixed sequences of length k D 8. Although this is far from perfect, one can still extract an idealized solution from the network. 3.3 Analysis. A time-series plot gives some insight into how processing is divided among hidden units. Figure 1 shows the time series of activation values for two mixed sequence strings that were correctly processed. Note that the hidden units 2, 3, and 4 are nearly identical for both strings. In particular, hidden unit 3 is counting up in the same manner for both a and b input. In fact, the weights from the input nodes that code a and b are nearly identical. Also, hidden units 2 and 4 are counting down for the A and B input, and again the weights from the input nodes that code A and B are nearly identical. This implies that hidden units 1 and 5 are holding onto the information about the subsequences within the strings. Hidden unit 5 turns on for a input and turns off for b input, but that does not show how the mixed sequences are tracked. The processing of a and b is analyzed rst. Figure 2A shows the trajectories for a7 and b7 in hidden units 1 and 3. Hidden unit 3 is counting up, whereas hidden unit 1 indicates a or b. The eigenvalues around the xed points for both Fa and Fb are approximately .5 and ¡.5, with corresponding eigenvectors that are in the directions of x1 and x3 . In other words, hidden units 1 and 3 are keeping track of the a _ b subsequences. Figure 2B shows the set of coordinate points for all the trajectories w D ( a _ b) k for k · 7, in which the last input was a. The points form layers according to the length of
5 The string frequencies were biased to include more strings with m D 0: 48 copies of strings of length n D 1, m D 0; 32 copies of n D 2, m D 0; 16 strings of n D 3, m D 0; and 2 copies of n D 4, m D 0; and 1 copy each for 5 · n · 7, m D 0. For cases with m > 1, there was one copy each for 3 · n C m · 10 and 8 strings of n D 1, m D 1. 6 The network was trained for 2.3 million sweeps (each sweep is one letter), at a learning parameter of .001, then 1.0 million with .0001, then 1.3 million with .00001. In a different set of 50 networks, at least one network achieved about 76% and had similar dynamics.
Simple Recurrent Networks Learn Languages by Counting
A.
B. HU1 1
HU1 1
0
Step 14
14Step
14Step
14Step
Step 14
0
Step 14
0
Step 14
HU5 1
HU5 1
0
0
HU4 1
HU4 1
0
Step 14
HU3 1
HU3 1
0
0 HU2 1
HU2 1
0
2101
Step 14
0
Step 14
Figure 1: A time-series plot gives some insight into how the processing of strings from the unrestricted palindrome language is divided among hidden units. Hidden unit activation values for two different trajectories are shown as a timeseries plot: (A) aabbabbBBABBAA and (B) bbaabaaAABAABB. Notice that units 2, 3, and 4 are nearly identical for both strings.
the string, and each layer has 2k¡1 points. As k increases, the points form a fractal encoding of all possible sequences, as suggested above. In Figure 2C the points are labeled for sequences of length k · 4. The sequences that end with a b have a similar picture except that hidden unit 5 is turned off. Figure 3 shows that the processing for A and B is analogous to the processing for a and b. Figure 3A shows the trajectories for A7 and B7 in hidden units 2 and 5. Comparable to Figure 1A, hidden unit 2 is functionally count-
2102
Paul Rodriguez
A.
HU3 1 0.8 0.6 0.4
a7
b7
0.2 b1 b2
0 0
B.
a3
b3 a1
a2
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
HU1
HU3 1 0.8 0.6 0.4 0.2 0 0
C.
HU1
HU3 0.4
baba
0.2
aaba
bbba
aba
a
0 0
0.2
abba aaaa abaa baaa bbaa
bba
aaa
ba
baa
aa 0.4
0.6
0.8
1
HU1
Figure 2: For the unrestricted palindrome, the network solution keeps track of all w D ( a _ b) n sequences with a fractal encoding. (A) The trajectory in hidden units 1 and 3 for a7 and b7 . Each trajectory increases in hidden unit 3 but approaches different xed points that depend on hidden unit 1. (B) All coordinate points for all strings of length n · 7 in which a is the last input. Notice that there is a fractal encoding, and the points are layered according to the number of input, where the layer of coordinate points for strings of length n D 7 is higher in hidden unit 3. (C) A subset of points in B for n · 4, with points labeled according to the input sequence. As the number of input increases, the points are organized according to the rst symbol input. Also, because of the oscillation, as Fa and Fb approach xed points, another symbol in a sequence sometimes indicates a coordinate point label to the left or the right of the previous label. For example, the point labeled baa is to the right of aa (higher in hidden unit 1), and the point labeled bbaa is to the left of baa.
Simple Recurrent Networks Learn Languages by Counting
2103
ing A or B down (and decreasing in value), and hidden unit 5 indicates A or B. Figure 3B shows the coordinate points for all the trajectories wr D ( A_B ) k for k · 7, in which the last input was A. Again, there is a fractal encoding, and it is organized according to output predictions. Figure 3C labels some of the points. Based on the analysis, one can replicate the network conguration of xed points and contraction and expansion rates. As before, it is an idealization because rates are exactly inversely proportional and attracting xed points are aligned with saddle-node-stable manifold.7 If one assigns dimensions 1 and 2 to count a or b and dimensions 3 and 4 to count A or B, the network solution can be represented by the following: 02
.5 B6 0 B Xt D f @6 4 2 0
0 .5 0 2
0 0 2 0
3 2 0 .5 6 .4 0 7 7 ¤ Xt¡1 C 6 4 ¡5 0 5 2 ¡5
.5 .1 ¡5 ¡5
¡5 ¡5 ¡1 ¡.8
3 1 ¡5 C ¡5 7 7 ¤ It C , 5 A ¡1 ¡.2
where It is coded as a D [1, 0, 0, 0]I b D [0, 1, 0, 0]I A D [0, 0, 1, 0]; and B D [0, 0, 0, 1]. A dimension for processing E is unspecied since it would be handled as in the simple CFL case. Notice that x1 will count up either a or b, but x2 is set up to have a different value at the xed points for Fa and Fb (.2 and .8, respectively). Also, the saddle-node points for F A and FB are coordinated to have different values of x4 that correspond to x2 xed-point values. A sample trajectory is shown in appendix B. In the idealized system, the recurrent weights are coordinated in four counting weights and two copy weights in the hidden unit weight matrix. This may not seem especially hard in four dimensions, but note that there are no unspecied weights in WHI . There are four input weights that are used to count the a or b combinations, WHI ( i, j) , where i, j D 1, 2. If one added more symbols to the problem, then by induction the matrix would require O( l2 ) coordinated weights, where l D number of symbols in w although the system would need only O( 2l) number of dimensions.8 The weight complexity is problematic only to the extent one demands perfect competence. Training an SRN with fewer hidden units, more English-like data, and without BPTT training produces a limited counting version of the idealized solution found here (Rodriguez & Elman, 1999).
7 The network solution did not actually have saddle-node points for FA and FB , but it had small trajectory steps for the rst few A or B input, as if it were near a saddle node, as in the restricted palindrome solution. Therefore, saddle-node xed points were added to the idealized solution to make the system work for unlimited-length strings, as discussed in section 2.3. 8 A hand-constructed solution based on my previous suggestion of interweaving counters would also have the same weight complexity.
2104
Paul Rodriguez
A.
HU5 1
a1
0.8
A1
0.6 0.4
B1
B7
0.2
0.2
0.4
0.6
0.8
b7 HU2 1
0.2
0.4
0.6
0.8
1
b1
0 0
B.
a7
A7
1
HU5
0.8 0.6 0.4 0.2 0 0
HU2
aabBA
C. HU5 1 aabaABAA 0.8 aaaaAAAA 0.6 aA 0.4
abaaA bbaaA aaaaA
aabaABA
baaaA babaABA babBA baaaAAA
abbaA bbbaA aabaA babaA
0.2 0
aabbBBA aaaAA aabbBBA aaaaAAA
0
0.2
0.4
0.6
HU2 0.8
babbBBA baaAA baabBAA
Figure 3: The network solution keeps track of the second half of strings, wr , analogous to Figure 2. (A) The trajectory in hidden units 2 and 5 for a7 A7 and b7 B7 . Each trajectory decreases in hidden unit 2 but approaches different xed points that depend on hidden unit 5. (B) All coordinate points for all wr strings of length · 7 in which A is the last input. There is a fractal encoding layered according to the number of inputs remaining, where the lowest values in hidden unit 2 indicate the end of string expected. The layers are not as well dened as in Figure 5B because the trajectories of wr do not perfectly overlap (e.g., aaAA and aA have two different a values but three different A values). (C) A subset of the points from 3B for length · 4 with some points labeled. Within the graph, the column of points on the right is reached after the rst A, and the column on the left is reached after the last A.
Simple Recurrent Networks Learn Languages by Counting
2105
4 A Simple Cross-Serial Dependency CSL
Steijvers and Grunwald (1996) examined the CSL ( ban ) m , n ¸ 1, m > 0 with an RNN. In this language, the end of the string is not predictable, but after the rst set of ( ba . . . a ) , the rest of the string is predictable. Abstractly, this language has some cross-serial dependency in that the number of a input in the beginning of the string must be matched across the ba . . . a subsequences. This dependency resembles a syntactic construction in Dutch, which is one of the few phenomena in any natural language in the class of CSLs (Christiansen & Chater, 1999). Although Steijvers and Grunwald did not report on training a network, they constructed a solution using secondorder connections. They suggested that a rst-order RNN could not learn this task because it seemed that the system needed multiplicative weights to reset the a counter after the rst ba . . . a cycle. However, one can construct a rst-order solution similar to their second-order solution by counting the rst set of a input in one dimension, latching onto it in another dimension, and then altering the next set of a input trajectories. 4.1 Training Set. The training set consisted of one copy of each string, ( ban ) m E, where E is the end-of-string marker, and n · 7, m · 10. 9 4.2 Results. Using 3 hidden units, the average performance was about 44% (3.1 out of 7) §20 on the training strings n D 1, . . . , 7, m D 10, with a learning parameter of .001. In a sample of 10 networks, performance could be further improved to about 70% in at least three networks with a lower learning parameter of .0001. These networks typically made just a few errors for b prediction on a couple of strings, whereas other networks seemed to predict b correctly only for n · 4. The best solution out of 50 networks learned all the strings in the training set, and for each case of ( ban ) m for n · 7, it generalized to m > 10, even though the maximum length in training had m D 10. 10 It did not generalize to cases of n > 7; however, it will be easy to see how to construct an idealized solution that does generalize based on the network dynamics. 4.3 Analysis. The network solution turns out to be similar to that suggested by Steijvers and Grunwald (1996) in that after the rst ba . . . a, the network enters into a limit cycle. Figure 4 shows the limit cycles for n D 1, 2, 4, 7. Notice how the location of the limit cycle changes depending on n, though there is no latching onto the a count in any state variable. How does the network represent the a count so that a b can be predicted 9 More specically there was one copy each of string of length n D 1, 2; m · 10, n D 3, m · 7, n D 4, m · 6, n D 5, m · 5, n D 6, m · 4, n D 7, m · 4. 10 The network was trained for 1.29 million sweeps with a learning parameter of .001. The better-performing networks had similar dynamics.
2106
Paul Rodriguez A.
HU2
1
HU3
HU2
HU3 0
a2 0 HU1
1 4 4
1
(b a )
10
D.
1 7 4
(b a )
1
HU2 10
b2,3,4
HU3 0
b2,4
a1
0 HU1
C.
(b a )
10
b2
b1 0
2 4
1
HU2
b3,4
a2
10
B.
(b a1 ) 4
a3,4
b2,3,4
HU3 0
0
a4 HU1
1
0
a7 HU1
1
Figure 4: In the solution for the simple CSL there is a different limit cycle for each string. In phase-space the ba . . . a trajectories slowly settle into a limit cycle whose period and location change as a function of how many a are input. (A) Trajectory for ( ba ) 4 . (B) Trajectory for ( ba2 ) 4 . (C) Trajectory for ( ba4 ) 4 . (D) Trajectory for ( ba7 ) 4 . One must be careful not to call this a bifurcation since it is not necessarily the change of a parameter that changes the period of the limit cycle, but rather a different composite function. For example, the Fba function has a limit cycle of period 2, whereas Fbaa is a different function that happens to have a period 3 limit cycle.
after the rst ba . . . a subsequence? Instead of latching, the network counts a up and down in each ba . . . a limit cycle. The a count up value is used to regenerate the initial value for the a down counter. Figure 5A shows the set of coordinated points for the last trajectory step of the strings ban , for n D 1, . . . , 7. Hidden units 1 and 3 are both involved in counting a. Figure 5B shows that when a b is input, the a count information is still maintained by hidden unit 2. Figure 5C shows the set of last coordinate points for the continuation of each string, ban ban for n D 1, . . . , 7. Notice that hidden unit 3 still has information about how many as are counted, but hidden unit 1
Simple Recurrent Networks Learn Languages by Counting
A.
2107
b a1
HU3 1 0.8 0.6 b a7
0.4 0.2 0 0
B.
0.2
0.4
0.6
0.8
1
HU1
HU3 1 0.8 0.6
b a1 b
0.4 b a7 b 0.2 0 0
0.2
0.4
0.6
0.8
1
HU2
0.8
1
HU1
HU3
C.
b a1 b a1
1 0.8 0.6 0.4 0.2
b a7 b a7
0 0
0.2
0.4
0.6
Figure 5: The solution for the simple CSL uses two counters for a: one that counts up, and one that counts down. (A) The last coordinate point of each trajectory for the string ban , for n D 1, . . . , 7. Hidden units 1 and 3 count a input. (B) The last coordinate point of each trajectory for the string ban b, for n D 1, . . . , 7. Hidden unit 2 copies the a count value. But in Figure 4 one can see that hidden unit 2 does not latch onto that value. (C) The last coordinate point of each trajectory for the string ban ban , for n D 1, . . . , 7. Notice that the hidden unit 1 value does not differentiate the points, suggesting that the system counted down to predict the next b. But hidden unit 3 still has the information about how many a were input.
2108
Paul Rodriguez
values are all in the same region, less than .5, which enables a b prediction. In other words, the network dynamics always use hidden unit 3 functionally to count up the a input, and hidden unit 1 will count down the a input after the rst set of ban . Based on the dynamics found in the network, one can code a piecewise linear system solution that counts a up and down in different dimensions for every ba . . . a subsequence. An extra state variable is used to copy the a countup value for one time step upon receiving a b input. In the representation of the network solution below, x1 counts a up, x2 copies the a count, and x3 counts a down: 02
.5 Xt D f @4 1 0
0 0 2
3 2 0 .5 0 5 ¤ Xt¡1 C 4 ¡5 2 ¡1
3 1 ¡5 0 5 ¤ It A . ¡5
Notice that the weights from the b input (column 2 in W HI ) will turn off x1 and x3 but allow x2 to copy the x1 value. The solution requires four weights coded with precision in W HH and three in WHI , which is less than that required for the unrestricted palindrome. A sample trajectory is presented in appendix C. It is worth pointing out that in languages with cross-serial dependencies that have combinations of symbols, one would need counters that are more queue-like (Moore, 1998) so that new input is stored in the least signicant digits. 5 A Buffer Language
An important language in computational linguistics and compiler theory is one that requires a small buffer to determine the phrase structure rule (Marcus, 1980; Fischer & LeBlanc, 1988). For example, the language fan bn [ an ( bbbc) n g requires a buffer of three for a push-down automata (PDA) to choose deterministically whether to unstack the a while processing b input. A counting solution for the language could store the a or c count across the bbb subsequence. 5.1 Training Set. The strings in the training set consisted of a maximum of n D 8 for an ( bbbc) n E, a maximum of n D 10 for an bn E, and no cases of an bn for n · 3.11 5.2 Results. Using ve hidden units, the average performance was only 20% (about 3 out of 15) §16 for the training strings an bn , n ¸ 4, and an ( bbbc) n , 11
More specically, for strings an (bbbc)n , there was 1 copy each of length n D 4, . . . , 8; 2 copies of n D 3; 8 copies of n D 2, and 16 copies of n D 1. For strings an bn , there were 3 copies each of length n D 7, . . . , 10, 6 copies of n D 6, 8 copies of n D 5, and 16 copies of n D 4.
Simple Recurrent Networks Learn Languages by Counting
2109
n · 8. using a learning parameter of .001. In a sample of 10 networks, with further training using .0001, performance could be improved to 30% to 50% on the training set for ve networks. For example, one network predicted the end of string correctly only for the an bn strings. The best network in a sample of 50 that learned some of both string types processed 6 out of 15 strings correctly and made small errors only for predicting the end character (e.g., > .4 but < .5) on three other strings. 12 The network correctly processed both a4 b4 and a4 ( bbbc) 4 . Although the network did not generalize beyond the training set, it is still possible to evaluate the counting mechanism for correctly processed strings. 5.3 Analysis. The network solution uses a displacement of the b trajectory to carry information about the a count. Figure 6 shows the trajectories for correctly processed strings an ( bbbc) n for n D 1, 2, 4, 8. The system is almost in a limit cycle for the bbbc input, except that hidden unit 5 is slowly changing. Notice especially in Figure 6D that the c trajectory steps are functionally counting down in hidden unit 5. Inspecting the hidden unit values also shows that b is actually being counted down in some dimensions and counted to 3 in another. The presence of two counters suggests that rather than storing the a count, the network counts b down to match a input and counts b up to 3 for the ( bbbc ) subsequences. The value of the a count displaces the b count-up trajectory so that the coordinate point of b3 can carry information about the number of a. In turn, the Fc system can use the information in the coordinate point of b3 to count down. This solution is similar to the restricted palindrome case in that the system is sensitive to the displacement of a counting trajectory. 6 Jump Context-Free Language
There are some deterministic CFLs that require a jump PDA in order to be processed in real time (Rozenberg & Salomaa, 1997). For example, the language that consists of the set of strings fan b¤ An [ a¤ bn Bn g requires that the a count match the A count or the b count match the B count. But the b input is irrelevant for the a . . . A sequence. A stack machine must have the ability to pop (jump over) or erase all the b items if it turns out that it needs to count down A. However, this is easily coded in a piecewise linear system by using some input weight to turn off (set to 0) the state variable that has the b count information.
12 The network was trained for 1.34 million sweeps with a learning parameter of .001. Other networks that processed some an (bbbc)n strings had similar dynamics.
2110
Paul Rodriguez
A.
B. HU5 1
HU5 1
b
b
0.8
0.6
b
0.4
0.4 c
0.2
0.2
a 0
0.2
0.4
0.6
0.8
a c
b
a
0
0.2
0.4
0.6
0.8
HU1 1
HU5 c 1 1 c2
0.6 c3 c4
c7
0.4 0.2 a
0 0
b
0.8
c1,2
0.6
0.2
c1
0
HU1 1
D.
HU5 1
0.4
b
2
0
0.8
b
0.8
0.6
C.
b
b
0.2
0.4
0.6
0.8
HU1 1
c8
a8
0 0
0.2
0.4
0.6
0.8
HU1 1
Figure 6: The solution for the buffer CFL uses a displacement of the bbbc trajectory. The graphs show the trajectories for an ( bbbc) n , n D 1, 2, 4, 8, for which the network made all correct predictions. By inspection, hidden unit 1 counts a, and the network activations are almost periodic for bbbc except that hidden unit 5 changes across cycles. Therefore, the trajectory is plotted in units 1 and 5. (A) Trajectory for an ( bbbc ) n , n D 1. (B) Trajectory for an ( bbbc) n , n D 2. (C) Trajectory for an ( bbbc) n , n D 4. Only the last a trajectory step and the c trajectory step are marked. (D) Trajectory for an ( bbbc) n , n D 8. Notice that the entire cycle is slowly shifting such that the last c trajectory step ends near the same region of space, which allows the network to predict an end of string.
6.1 Training Set. The training set consisted of strings from the jump CFL=fan bm An E [ an bm Bm Eg, where n, m > 0, n · 7, m · 7, and n C m · 11.13 13 More specically, not all strings of n C m · 11 were presented. For strings an bm An , there was one copy of each string, except for 0 copies of n D 7, m ¸ 2I n D 6, m ¸ 4I n D 5, m D 6I n · 3, m ¸ 8; 3 copies of n D 4, m D 1; 5 copies of n D 3, m D 1; 10 copies of n D 2, m D 1; 5 copies of n D 1, m D 2; and 10 copies of n D 1, m D 1. For strings an bm Bm there was one copy of each string, except 0 copies of n D 6, m D 5I n D 5, m D 6I n D 4, m ¸ 6I n D 3, m ¸ 7I n D 1, m ¸ 8; 5 copies of n D 2, m D 1; 3 copies of n D 1, m D 4; 5 copies of n D 1, m D 3; 10 copies of n D 1, m D 2; and 10 copies of n D 1, m D 1.
Simple Recurrent Networks Learn Languages by Counting
2111
Also included was the set of strings an An , bm Bm , duplicated from the simplest CFL case, which does not change the fact that the language is a jump CFL. 6.2 Results. Using ve hidden units, the average performance was 48% § 17 for the 70 training strings where n, m > 0 and 44% § 22 for the 22 strings where n D 0 or m D 0. Most networks performed better on strings of the form
an bm Bm . In a sample of 10 networks, 3 had about 65% performance, which included more strings of the form an bm An . The best network in a sample of 50 correctly processed 88% (62 out of 70) of training strings, where n, m > 0, and 55% (12 out of 22) for n D 0 or m D 0.14 It generalized to strings such as an bm An for n D 6, m D 5, and an bm Bm for n D 1, m D 8, 9 and n D 4, m D 8. 6.3 Analysis. In contrast to the previous language tasks, the solution for the jump CFL implements an explicit latching mechanism. Figure 7A shows the last coordinate points for the set of sequences an bm , n D 1, . . . , 6, m D 1, . . . , 6. Notice that the points a1 b1 to a1 b6 are changing along hidden unit 1, and the points a1 b1 to a6 b1 are changing along hidden unit 4. In other words, hidden unit 1 is counting the number of b input, whereas hidden unit 4 is storing the information about the number of a input. One can also see that the bands of information are overlapping for n D 4, 5, 6, which is where most of the errors occur. Figure 7B includes an A input. It shows the last coordinate points for the set of sequences an bm A1 , n D 1, . . . , 6, m D 1, . . . , 6. Notice that the points are grouped according to n and that information about the number of b input has been collapsed. Figure 7C includes a B input. It shows the last coordinate points for the set of sequences an bm B1 , n D 1, . . . , 6, m D 1, . . . , 6. Notice that the points are grouped according to m and that information about the number of a input has been collapsed. How does the network achieve the more explicit store mechanism for the a count in hidden unit 4? Figure 8 shows how hidden unit 4 latches onto activation values. The graph shows activation changes in one time step, holding all other state variables constant. The location where the graph crosses the line x4,t C 1 D x4,t gives the xed point at that one time step only. Notice that the location of the xed point depends on the number of a input. The system is not a perfect latching mechanism. In fact, there is only one xed point for the Fb system, which means all the bands in Figure 7A eventually converge.
14 The network was trained for 1.4 million sweeps with a learning parameter of .001. Several of the better-performing networks turned out to have similar dynamics.
2112
Paul Rodriguez
A.
HU4
n m
a b
1
a6 b6
0.8 6 1
a b
0.6 0.4 0.2
1 1
ab
1 6
a b
0 0
B.
HU4
0.2
0.4
0.6
0.8
HU1
1
6 6
a b
n m 1
a b A
1 0.8 6 1
0.6
a b
a1 b6
0.4 a1b1
0.2 0 0
C.
0.2
0.4
0.6
0.8
HU2
1
n m 1
HU3
a b B
a6 b6
1 1 6
0.8
a b
0.6 a6 b1
0.4 0.2
a1b1
0 0
0.2
0.4
0.6
0.8
1
HU1
Figure 7: The solution for the jump CFL is a latching mechanism in hidden unit 4. (A) The last coordinate points of the trajectories for the set of sequences an bm , n D 1, . . . , 6, m D 1, . . . , 6. The points a1 b1 to a1 b6 are changing along hidden unit 1, and the points a1 b1 to a6 b1 are changing along hidden unit 4. (B) The last coordinate points of the trajectories for the set of sequences an bm A1 , n D 1, . . . , 6, m D 1, . . . , 6. Notice that the points are grouped according to n and that information about the number of b input is collapsed in space. (C) The set of last coordinate points of the trajectories for the set of sequences an bm B1 , n D 1, . . . , 6, m D 1, . . . , 6. The points are grouped according to m, and information about the number of a input is collapsed.
Simple Recurrent Networks Learn Languages by Counting A.
HU4
2113
2 3
a b
t+1
1 0.8 0.6 0.4 0.2 0
0
.25
.5
.75
1
.75
1
HU4t
.14
B.
HU4t+1
6 3
a b
1 0.8 0.6 0.4 0.2 0 0
HU4t .25
.5
.67
Figure 8: The latching mechanism is a xed point determined by the a count. The sigmoid activation function of x4,t (hidden unit 4) as a function of x4, t¡1 with the line x4, t D x4,t¡1 drawn. All other hidden unit values were set according to the state value after an b3 , n D 2 or n D 6. (A) Using the state value after a2 b3 , the next value of x4 is almost xed at .14. (B) Using state value after a6 b3 , the next value of x4 is also almost xed at .67, which is a different value than in A.
7 Conclusion
The work here has shown that if an SRN is trained with deterministic CFLs or CSLs in a prediction task and sufcient training is provided, it can learn many strings from the training set and to some degree generalize to longer strings. Average performance can vary widely within a language, but it can be improved far above average in many networks by continued training with a lower learning parameter. Prior knowledge of the language was used to guide the evaluation and help reconstruct the solutions in piecewise linear system equations. The solutions are an implementation of interdependent counters. For one language, the solution used a symbol-like latching mechanism to keep track of dependencies across counters. For several languages, the solution used symbol-sensitive dynamics that are systematically altered as a function of context. The alterations, although slight, have real
2114
Paul Rodriguez
effects in producing output at appropriate time steps. Here is a summary of the ndings: For the simplest CFL, an bn , the solution was an up counter for a and a corresponding down counter for b. For a restricted palindrome, an bm Bm An , the solution was an embedded counter for bm Bm that stored information about the number of a as a systematic displacement of the trajectory. The counter for An used that displacement to count down. In the unrestricted palindrome case, wwr , where w D ( a _ b ) n , the solution used a fractal encoding that counted up a_b in one dimension, while also counting a up and b down in another dimension. Counting the second half of the string used a corresponding fractal encoding. For a simple CSL, ( ban ) m , the solution was an up and down counter for a input so that the a count could be regenerated for each ba . . . a subsequence. For a buffer CFL, an bn [ an ( bbbc) n , the solution was an up counter for a, a corresponding down counter for b, and a cycle for bbbc that was displaced by the a count so that the system could count down ( bbbc) n . For a jump CFL, an bm An [ an bm Bm , the solution latched onto the a count in one of the hidden units. For an RNN, how training conditions determine the nature of a solution is an open issue. But it seems that the reachable types of solutions, at least for the CFLs and CSLs studied here, are typically built on top of counting trajectories. As the network learns context dependencies, it can build trajectories that count, and as a network builds trajectories that count, it is also learning context dependencies. In other words, if upon some input the network does not quickly move to a xed point, then there is potentially some informational content within the trajectory. In fact, for all the network solutions, the input trajectory for the end symbol, E, moved to one relatively small region of space since there were no dependencies between strings. In conclusion, SRNs can learn to represent temporal dependencies of deterministic CFLs and CSLs by counting. But due to precision demands and the complexity of the network dynamics, performance is inherently limited. With respect to performance, an SRN can be seen as a limited version of interdependent counters and hence legitimately serve as an alternative model of computation for language processing. This point is sometimes missed by cognitive scientists who think of SRNs only as nite state machines (e.g., Steedman, 1999).
Simple Recurrent Networks Learn Languages by Counting
2115
Table 1: Sequence of State Variables for aaabbbE in an Idealized Solution of a Trained Network. Variable
t0
a
a
a
b
b
b
E
x1 x2 x3
0 0 0
.5 0 0
.75 0 .0
.875 0 0
0 .75 1
0 .5 1
0 0 1
0 0 0
Table 2: Sequence of State Variables for aaabbaABBAAA in an Idealized Solution of a Trained Network. Variable t 0 x1 x2 x3 x4
a
a
a
b
b
a
A
B
B
A
0 .5 .75 .875 .9375 .96875 .984375 0 0 0 0 .5 .65 .725 .7625 .48125 .340625 .5703125 0 0 0 0 0 0 0 0 0 0 0 .96875 .9375 .875 .75 0 0 0 0 0 0 0 .340625 .48125 .7625 .725
A A 0 0 .5 .65
0 0 0 .2
Appendix A: Simplest CFL
Table 1 shows a sample trajectory of hidden units in the idealized network solution for the simple CFL an bn E, where E is the end marker. Notice that x3 indicates the second half of the string. Also, one could add an output layer so that the system predicts a when x1 ¸ .5, predicts b when x2 ¸ .5, and predicts E when x2 < .5 and x3 ¸ 1. These are also easily coded in simple threshhold equations for the output nodes. For example, the condition that x2 < .5 and x3 ¸ 1 leads to an equation such as x3 ¡ x2 ¸ .5. With some scaling coefcients, the piecewise linear equation could be something like y D f ( .75x3 ¡ 1.25x2 ¡ .5) , where y is the output node for predicting an E. Appendix B: Unrestricted Palindrome
Table 2 shows a sample trajectory of hidden units in the idealized network solution for the unrestricted palindrome CFL. The state variable for the end of string, E, is not included since it would be similar to the simple CFL case. Assuming there was an extra dimension that indicated the second half of the string as in the simple CFL case, the output nodes could be easily added that would make the correct predictions at the correct time step. The last half of the string is able to make the correct predictions if one stipulates that x4 < .5 is prediction B, x4 > .5 is prediction A, and x3 < .5 and x4 < .5 is prediction E.
2116
Paul Rodriguez
Table 3: State Variables for the Input Sequence baaabaaabaaa in an Idealized Solution of a Trained Network. Variable
t0
b
a
a
a
b
a
a
a
b
a
a
a
x1 x2 x3
0 0 1
0 0 1
.5 0 1
.75 0 1
.875 0 1
0 .875 0
.5 0 .75
.75 0 .5
.875 0 0
0 .875 0
.5 0 .75
.75 0 .5
.875 0 0
Appendix C: Simple CSL
Table 3 shows a sample trajectory of hidden units in the idealized network solution for the CSL case. This solution starts out with x3 D 1 as a repelling xed point, which could be included as part of the end marker, E, reset function. More correctly, one could add another dimension to handle E processing, since an E not only resets the system but always predicts a b. In other words, another state variable could help distinguish between an x3 that counts down to predict b and a reset value that predicts b. The advantage of the repelling xed point is that the system does not try to count down until after the rst set of ba . . . a is counted up; otherwise, it would make unwarranted b predictions in the rst ba . . . a subsequence. In this solution, it is much easier to draw a decision plane for the predict b output unit than the hand-constructed solution in Steijvers and Grunwald (1996) since the last a gives a bigger trajectory step in the x3 variable (e.g., predict b if x1 > .5 and x3 < .5). Acknowledgments
I thank Jeff Elman, Janet Wiles, and Gary Cottrell for helpful discussions and three anonymous reviewers for helpful comments. The research was supported by a UC President’s Dissertation Fellowship. References Amit, D. (1988). Neural networks counting chimes. Proceedings of the National Academy of Science, USA, 85, 2141–2145. Angluin, D. (1982). Inference of reversible languages. Journal of the ACM, 29(3), 741–765. Barnsley, M. (1993). Fractals everywhere. Orlando, FL: Academic Press. Blair, A., & Pollack, J. (1997). Analysis of dynamical recognizers. Neural Computation, 9(5), 1127–1142. Blum, L., Shub, M., & Smale, S. (1989). On a theory of computation and complexity over the real numbers, NP-completeness, recursive functions, and universal machines. Bulletin of the American Mathematical Society, 21(1).
Simple Recurrent Networks Learn Languages by Counting
2117
Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction. Neural Computation, 8(6), 1135–1178. Christiansen, M., & Chater, N. (1999). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23(2), 157–205. Cleeremans, A. (1993). Mechanisms of implicit learning. Cambridge, MA: MIT Press. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. Devaney, R. (1986). Introduction to chaotic dynamical systems. Menlo Park, CA: Benjamin/Cummings. Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195–225. Fischer, C. N., & LeBlanc, R. J. (1988). Crafting a compiler. Menlo Park, CA: Benjamin/Cummings. Giles, C., Sun, G., Chen, H., Lee, Y., & Chen, D. (1992). Extracting and learning an unknown grammar with recurrent neural networks. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing, 4. San Mateo, CA: Morgan Kaufmann. Hoelldobler, S., Kalinke, Y., & Lehmann, H. (1997).Designing a counter: Another case study of dynamics and activation in landscapes in recurrent neural networks. In Proceedingsof the KI 97 Advances in Articial Intelligence(pp. 313–324). Berlin: Springer-Verlag. Kilian, J., & Siegelmann, H. (1996).The dynamic universality of sigmoidal neural networks. Information and Computation, 128(1), 48–56. Kolen, J. (1994). Exploring the computational capabilitiesof recurrent neural networks. Unpublished doctoral dissertation, Ohio State University. Kutas, M., & Hillyard, S. A. (1984). Brain potentials during reading reect word expectancy and semantic association. Nature, 307, 161–163. Kwasny, S., & Kalman, B. (1995). Tail-recursive distributed representations and simple recurrent networks. Connection Science, 7, 61–80. Maas, W., & Orponen, P. (1997). On the effect of analog noise in discretetime analog computations. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Marcus, M. (1980).A theoryof syntactic recognitionfor natural language. Cambridge, MA: MIT Press. McClelland, J. (1987). The case for interactionism in language processing. In M. Coltheart (Ed.), Attention and performance XII: The psychology of reading. Hillsdale, NJ: Erlbaum. Moore, C. (1998). Dynamical recognizers: Real-time language recognition by analog computation. Theoretical Computer Science, 201(1), 99–136. Pollack, J. (1987). On connectionist models of natural language processing. Unpublished doctoral dissertation, University of Illinois, Urbana. Pollack, J. (1990). Recursive auto associative memory. Articial Intelligence, 46, 77–105.
2118
Paul Rodriguez
Pollack, J. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Rodriguez, P., & Elman, J. (1999). Watching the transients: Viewing a simple recurrent network as a limited counter. Behaviormetrika (Special Issue), 26(1), 51–74. Rodriguez, P., & Wiles, J. (1998). A recurrent neural network can learn to implement symbol-sensitive counting. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing, 10. Cambridge, MA: MIT Press. Rodriguez, P., Wiles, J., & Elman, J. (1999).A recurrent neural network that learns to count. Connection Science, 11(1). Rozenberg, G., & Salomaa, A. (1997). Handbook of formal language theory. Berlin: Springer-Verlag. Rumelhart, D. E., Hinton, G., & Williams, R. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 2). Cambridge, MA: MIT Press. Siegelmann, H. T. (1993). Foundations of recurrent neural networks. Unpublished doctoral dissertation, State University of New Jersey, New Brunswick, NJ. Siegelmann, H., & Sontag, E. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50(1), 132–150. Steedman, M. (1999).Connectionist sentence processing in perspective. Cognitive Science, 23(4), 615–634. Steijvers, M., & Grunwald, P. (1996). A recurrent neural network that performs a context-sensitive prediction task. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum. Tabor, W., Juliano, C., & Tanenhaus, M. (1996). A dynamical system for language processing. In Proceedings of the 18th Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum. Tabor, W., & Tanenhaus, M. (1999). Dynamical models of sentence processing. Cognitive Science, 23(4). Tani, J., & Fukumura, N. (1995). Embedding a grammatical description in deterministic chaos: An experiment in recurrent neural learning. Biological Cybernetics, 72(4), 365–370. Tino, P., Horne, B., Giles, C., & Collingwood, P. (1998). Finite state machines and recurrent neural networks—automata and dynamical systems approaches. In J. Dayhoff & O. Omidvar (Eds.), Neural networks and pattern recognition. San Diego, CA: Academic Press. Received September 23, 1999; accepted November 3, 2000.
LETTER
Communicated by Bernhard Sch olkopf ¨
Training º-Support Vector Classiers: Theory and Algorithms Chih-Chung Chang Chih-Jen Lin Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan The º-support vector machine (º-SVM) for classication proposed by Sch olkopf, ¨ Smola, Williamson, and Bartlett (2000) has the advantage of using a parameter º on controlling the number of support vectors. In this article, we investigate the relation between º-SVM and C-SVM in detail. We show that in general they are two different problems with the same optimal solution set. Hence, we may expect that many numerical aspects of solving them are similar. However, compared to regular C-SVM, the formulation of º-SVM is more complicated, so up to now there have been no effective methods for solving large-scale º-SVM. We propose a decomposition method for º-SVM that is competitive with existing methods for C-SVM. We also discuss the behavior of º-SVM by some numerical experiments. 1 Introduction
The º-support vector classication (Sch¨olkopf, Smola, & Williamson, 1999; Sch¨olkopf, Smola, Williamson, & Bartlett, 2000) is a new class of support vector machines (SVM). Given training vectors x i 2 Rn , i D 1, . . . , l in two classes and a vector y 2 Rl such that yi 2 f1, ¡1g, they consider the following primal problem: l 1 1X ( Pº) min w T w ¡ ºr C ji 2 l iD 1 yi ( w T w ( x i ) C b ) ¸ r ¡ j i ,
ji ¸ 0, i D 1, . . . , l, r ¸ 0.
(1.1)
Here 0 · º · 1 and training vectors x i are mapped into a higher- (maybe innite) dimensional space by the function w . This formulation is different from the original C-SVM (Vapnik, 1998): ( PC )
min
l X 1 T w wCC ji 2 iD 1
yi ( w T w ( xi ) C b) ¸ 1 ¡ ji ,
ji ¸ 0, i D 1, . . . , l.
(1.2)
c 2001 Massachusetts Institute of Technology Neural Computation 13, 2119– 2147 (2001) °
2120
Chih-Chung Chang and Chih-Jen Lin
In equation 1.2, a parameter C is used to penalize variables ji . As it is difcult to select an appropriate C, in Pº, (Sch¨olkopf et al. (2000) introduce a new parameter º, which lets one control the number of support vectors and errors. To be more precise, they proved that º is an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors. In addition, with probability 1, asymptotically, º equals both fractions. Although Pº has such an advantage, its dual is more complicated than the dual of PC : ( Dº)
min
1 T ® Q® 2 y T ® D 0, e T ® ¸ º, 0 · ai · 1 / l,
i D 1, . . . , l,
(1.3)
where e is the vector of all ones, Q is a positive semidenite matrix, Qij ´ yi yj K ( xi , xj ), and K ( x i , xj ) ´ w ( x i ) T w ( xj ) is the kernel. Remember that the dual of PC is as follows: ( DC )
min
1 T ® Q® ¡ eT ® 2 y T ® D 0, 0 · ai · C, i D 1, . . . , l.
Therefore, it can be clearly seen that Dº has one more inequality constraint. We are interested in the relation between Dº and DC . Though in Sch¨olkopf et al. (2000, Proposition 13), this issue has been studied, we investigate this relation in more detail in section 2. The main result, theorem 5, shows that solving them is like solving two different problems with the same optimal solution set. In addition, the increase of C in C-SVM is like the decrease of º in º-SVM. Based on the work in section 2, in section 3 we derive the formulation of º as a decreasing function of C. Due to the density of Q , traditional optimization algorithms such as Newton and quasi-Newton cannot be directly applied to solve DC or Dº. Currently major methods of solving large DC (for example, decomposition methods (Osuna, Freund, & Girosi, 1997; Joachims, 1998; Platt, 1998; Saunders et al., 1998) and the method of nearest points (Keerthi, Shevade, & Murthy, 2000)) use the simple structure of constraints. Because of the additional inequality, these methods cannot be directly used for solving Dº. Up to now, there have been no implementation methods for large-scale º-SVM. In section 4, we propose a decomposition method similar to the software SVMlight (Joachims, 1998) for C-SVM. Section 5 presents numerical results. Experiments indicate that several numerical properties on solving DC and Dº are similar. A timing comparison shows that the proposed method for º-SVM is competitive with existing methods for C-SVM. Finally, section 6 gives a discussion and conclusion.
Training º-Support Vector Classiers
2121
2 The Relation Between º-SVM and C-SVM
In this section we construct a relationship between Dº and DC ; the main result is in theorem 5. The relation between DC and Dº has been discussed by Sch¨olkopf et al. (2000, Proposition 13), who show that if Pº leads to r > 0, then P C with C D 1 / (r l) leads to the same decision function. Here we provide a more complete investigation. In this section we rst try to simplify Dº by showing that the inequality e T ® ¸ º can be treated as an equality: Let 0 · º · 1. If ( Dº) is feasible, there is at least one optimal solution of Dº that satises e T ® D º. In addition, if the objective value of Dº is not zero, all optimal solutions of Dº satisfy eT ® D º. Theorem 1.
Since the feasible region of Dº is bounded, if it is feasible, Dº has at least one optimal solution. Assume Dº has an optimal solution ® such that e T ® > º. Since eT ® > º ¸ 0, by dening Proof.
® N ´
º eT ®
®,
N is feasible to Dº and e T ® N D º. Since ® is an optimal solution of Dº, with ® e T ® > º,
®T Q ® · ® N T Q® N D
± º ²2 eT ®
®T Q ® · ®T Q ®.
(2.1)
Thus ® N is an optimal solution of Dº, and ®T Q ® D 0. This also implies that if the objective value of Dº is not zero, all optimal solutions of Dº satisfy e T ® D º. Therefore, in general e T ® ¸ º in Dº can be written as e T ® D º. Sch¨olkopf et al. (2000), noted that practically one can alternatively work with eT ® ¸ º as an equality constraint. From the primal side, it was rst shown by Crisp and Burges (1999) that r ¸ 0 in Pº is redundant. Without r ¸ 0, the dual becomes: min
1 T ® Q® 2 y T ® D 0, e T ® D º, 0 · ai · 1 / l,
(2.2)
i D 1, . . . , l.
Therefore, the equality is naturally obtained. Note that this is an example that two problems have the same optimal solution set but are associated with two duals that have different optimal solution sets. Here the primal problem,
2122
Chih-Chung Chang and Chih-Jen Lin
which has more restrictions, is related to a dual with a larger feasible region. For our later analysis, we keep on using Dº but not equation 2.2. Interestingly we will see that the exceptional situation where Dº has optimal solutions such that e T ® > º happens only for those º that we are not interested in. Due to the additional inequality, the feasibility of Dº and DC is different. For DC , 0 is a trivial feasible point, but Dº may be infeasible. An example where Pº is unbounded below and Dº is infeasible is as follows: Given three training data with y1 D y2 D 1 and y3 D ¡1, ifº D 0.9, there is no ® in Dº that satises 0 · ai · 1 / 3, [1, 1, ¡1]® D 0 and e T ® ¸ 0.9. Hence Dº is infeasible. When this happens, we can choose w D 0, j1 D j2 D 0, b D r , j3 D 2r as a feasible solution of Pº. Then the objective value is ¡0.9r C 2r / 3, which goes to ¡1 as r ! 1. Therefore, Pº is unbounded. We then describe a lemma that was rst proved in Crisp and Burges (1999). Dº is feasible if and only if º · ºmax , where
Lemma 1.
ºmax ´
2 min(#yi D 1, #yi D ¡1) , l
and (#yi D 1) and (#yi D ¡1) denote the number of elements in the rst and second classes, respectively. Since 0 · ai · 1 / l, i D 1, . . . , l, with y T ® D 0, for any ® feasible to Dº, we have e T ® · ºmax . Therefore, if Dº is feasible, º · ºmax . On the other hand, if 0 < º · ºmax , min( #yi D 1, #yi D ¡1) > 0 so we can dene a feasible solution of Dº: Proof.
8 > <
º 2( #yi D 1) aj D º > : 2( #yi D ¡1)
if yj D 1, if yj D ¡1.
This ® satises 0 · ai · 1 / l, i D 1, . . . , l and y T ® D 0. If º D 0, clearly ® D 0 is a feasible solution of Dº. Note that the size of ºmax depends on how balanced the training set is. If the numbers of positive and negative examples match, then ºmax D 1. We then note that if C > 0, by dividing each variable by Cl, DC is equivalent to the following problem: ( D0C )
min
1 T eT ® ® Q® ¡ 2 Cl y T ® D 0, 0 · ai · 1 / l, i D 1, . . . , l.
Training º-Support Vector Classiers
2123
It can be clearly seen that D0C and Dº are very similar. We prove the following lemma about D0C : Lemma 2. If D0C has different optimal solutions ®1 and ®2 , then eT ®1 D e T ®2 and ®T1 Q ®1 D ®T2 Q ®2 . Therefore, we can dene two functions e T ®C and ®TC Q ®C
on C, where ®C is any optimal solution of D0C .
6 ®2 are both optimal solutions, Since D0C is a convex problem, if ®1 D for all 0 · l · 1,
Proof.
1 ( l®1 C (1 ¡ l) ®2 ) TQ ( l®1 C ( 1 ¡ l) ®2 ) ¡ e T( l®1 C (1 ¡ l) ®2 ) / ( Cl) 2 ³ ´ ³ ´ 1 T 1 T Dl ®1 Q ®1 ¡ e T ®1 / ( Cl) C (1 ¡ l) ®2 Q ®2 ¡ e T ®2 / ( Cl) . 2 2 This implies
®T1 Q ®2 D
1 T 1 ®1 Q ®1 C ®T2 Q ®2 . 2 2
(2.3)
Since Q is positive semidenite, Q D LT L so equation 2.3 implies kL®1 ¡ L®2 k D 0. Thus, ®T2 Q ®2 D ®T1 Q ®1 . Therefore, eT ®1 D eT ®2 , and the proof is complete. Next we prove a theorem on optimal solutions of D0C and Dº: If D0C and Dº share one optimal solution ®¤ with eT ®¤ D º, their optimal solution sets are the same. Theorem 2.
Proof. From lemma 2, any other optimal solution ® of D0C also satises e T ® D º so ® is feasible to Dº. Since ®T Q ® D ( ®¤ ) T Q ®¤ from lemma 2, all
D0C ’s optimal solutions are also optimal solutions of Dº. On the other hand, if ® is any optimal solution of Dº, it is feasible for D0C . With the constraint e T ® ¸ º D e T ®¤ and ®T Q ® D ( ®¤ ) T Q ®¤ , 1 T 1 ® Q ® ¡ e T ® / ( Cl) · ( ®¤ ) T Q ( ®¤ ) ¡ eT ®¤ / ( Cl) . 2 2
Therefore, all optimal solutions of Dº are also optimal for D0C . Hence their optimal solution sets are the same. If ® is an optimal solution of D0C , it satises the following Karush-KuhnTucker (KKT) condition: Q® ¡
e
Cl
C by D
¸ ¡ »,
2124
Chih-Chung Chang and Chih-Jen Lin
¸T ® D 0, » T
±e
² ¡ ® D 0, y T ® D 0,
l li ¸ 0, ji ¸ 0, 0 · ai · 1 / l, i D 1, . . . , l.
(2.4)
By setting r ´ 1 / ( Cl) and º ´ eT ®, ® also satises the KKT condition of Dº: Q ® ¡ r e C by D ¸ ¡ » , ± ² T e T
l ® D 0, » T
l
¡ ® D 0,
T
y ® D 0, e ® ¸ º, r ( e T ® ¡ º) D 0,
li ¸ 0, ji ¸ 0, r ¸ 0, 0 · ai · 1 / l, i D 1, . . . , l.
(2.5)
From theorem 2, this implies that for each D0C , its optimal solution set is the same as that of Dº, where º D e T ®. For each D0C , such a Dº is unique as 6 º2 , Dº and Dº have different optimal solution sets. from theorem 1, if º1 D 1 2 Therefore, we have the following theorem: For each D0C , C > 0, its optimal solution set is the same as that of one (and only one) Dº, where º D e T ® and ® is any optimal solution of D0C . Theorem 3.
Similarly, we have: If Dº, º > 0, has a nonempty feasible set and its objective value is not zero, Dº’s optimal solution set is the same as that of at least one D0C . Theorem 4.
Proof.
If the objective value of Dº is not zero, from the KKT condition 2.5,
®T Q ® ¡ r e T ® D ¡
l X
ji / l.
iD 1
Then ®T Q ® > 0 and equation 2.5 imply r e T ® D ®T Q ® C
l X iD 1
ji / l > 0, r > 0, and e T ® D º.
By choosing a C > 0 such that r D 1 / ( Cl) , ® is a KKT point of D0C . Hence from theorem 2, the optimal solution set of this D0C is the same as that of Dº. Next we prove two useful lemmas. The rst one deals with the special situation when the objective value of Dº is zero.
Training º-Support Vector Classiers
2125
If the objective value of Dº, º ¸ 0, is zero and there is a D0C , C > 0 such that any its optimal solution ®C satises eT ®C D º, then º D ºmax and all D0C , C > 0, have the same optimal solution set as that of Dº. Lemma 3.
For this Dº, we can set r D 1 / ( Cl) , so ®C is a KKT point of Dº. Therefore, since the objective value of Dº is zero, ®TC Q ®C D 0. Furthermore, we have Q ®C D 0. In this case, equation 2.4 of D0C ’s KKT condition becomes Proof.
¶ be I C ¡ D ¸ ¡ », ¡be J Cl e
µ
(2.6)
where li , ji ¸ 0, and I and J are indices of two different classes. If be I ¸ 0, there are three situations of equation 2.6: µ
¶ > 0 , < 0
µ
¶ < 0 , < 0
µ
¶
D 0
< 0
.
The rst case implies ( ®C ) I D 0 and ( ®C ) J D ( e J ) / l. Hence if J is nonempty, 6 0 causes contradiction. Hence all data are in the same class. Therey T ®C D fore, Dº and all D0C , C > 0, have the unique optimal solution zero due to the constraints y T ® D 0 and ® ¸ 0. Furthermore, eT ® D º D ºmax D 0. The second case happens only when ®C D e / l. Then y T ® D 0 and yi D 1 or ¡1 imply that ( #yi D 1) D ( #yi D ¡1) and eT ®C D 1 D º D ºmax . We then show that e / l is also an optimal solution of any other D0C . Since 0 · ai · 1 / l, i D 1, . . . , l, for any feasible ® of DN 0C , the objective function satises 1 T eT ® eT ® 1 ¸¡ ¸¡ . ® Q® ¡ 2 Cl Cl Cl
(2.7)
Now ( #yi D 1) D ( #yi D ¡1) so e / l is feasible. When ® D e / l, the inequality of equation 2.7 becomes an equality. Thus e / l is actually an optimal solution of all D0C , C > 0. Therefore, Dº and all DC , C > 0 have the same unique optimal solution e / l. For the third case, b D 1 / ( Cl), ( ®C ) J D e J / l,º D e T ®C D 2eTJ ( ®C ) J D ºmax , and J contains elements that have fewer elements. Because there exists such a C and b, for any other C, b can be adjusted accordingly so that the KKT condition is still satised. Therefore, from theorem 3, all D0C , C > 0 have the same optimal solution set as that of Dº. The situation when be I · 0 is similar. Assume ®C is any optimal solution of D0C . Then e T ®C is a continuous decreasing function of C on ( 0, 1) . Lemma 4.
2126
Chih-Chung Chang and Chih-Jen Lin
If C1 < C2 , and ®1 and ®2 are optimal solutions of D0C1 and D0C2 , respectively, we have Proof.
1 T e T ®1 1 eT ®2 · ®T2 Q ®2 ¡ ®1 Q ®1 ¡ 2 2 C1 l C1 l
(2.8)
1 T e T ®2 1 eT ®1 · ®T1 Q ®1 ¡ . ®2 Q ®2 ¡ 2 2 C2 l C2 l
(2.9)
and
Hence e T ®1
C2 l
¡
e T ®2
C2 l
·
1 T 1 eT ®1 e T ®2 ¡ . ®1 Q ®1 ¡ ®T2 Q ®2 · 2 2 C1 l C1 l
(2.10)
Since C2 > C1 > 0, equation 2.10 implies eT ®1 ¡ e T ®2 ¸ 0. Therefore, eT ®C is a decreasing function on ( 0, 1). From this result, we know that for any C¤ 2 ( 0, 1) , limC!( C¤ ) C eT ®C and limC!( C¤ ) ¡ e T ®C exist, and lim
C!( C¤ ) C
eT ®C · e T ®C¤ ·
lim e T ®C .
C!( C¤ ) ¡
To prove the continuity of e T ®C , it is sufcient to prove limC!C¤ e T ®C D e T ®C¤ for all C¤ 2 ( 0, 1) . If limC!( C¤ ) C e T ®C < eT ®C¤ , there is a ºN such that 0·
lim e T ®C < ºN < e T ®C¤ .
C!( C¤ ) C
(2.11)
Hence ºN > 0. If DºN ’s objective value is not zero, from theorem 4 and the fact that e T ®C is a decreasing function, there exists a C > C¤ such that ®C satises e T ®C D º. N This contradicts equation 2.11, where limC!( C¤ ) C e T ®C < º. N Therefore, the objective value of DºN is zero. Since for all Dº, º · º, N their feasible regions include that of DºN , their objective values are also zero. From theorem 3, the fact that e T ®C is a decreasing function, and limC!( C¤ ) C eT ®C < º, N each D0C , C > C¤ , has the same optimal solution set as that of one Dº, where e T ®C D º < º. N Hence by lemma 3, eT ®C D ºmax , for all C. This contradicts equation 2.11. Therefore, limC!( C¤ ) C eT ®C D e T ®C¤ . Similarly, limC!( C¤ ) ¡ e T ®C D T e ®C¤ . Thus, lim eT ®C D eT ®C¤ .
C!C¤
Using the above lemmas, we are now ready to prove the main theorem:
Training º-Support Vector Classiers Theorem 5.
2127
We can dene
lim e T ®C D º¤ ¸ 0 and lim e T ®C D º¤ · 1,
C!1
C!0
where ®C is any optimal solution of D0C . Then º¤ D ºmax . For any º > º¤ , Dº is infeasible. For any º 2 (º¤ , º¤ ], the optimal solution set of Dº is the same as that of either D0C , C > 0, or some D0C , where C is any number in an interval. In addition, the optimal objective value of Dº is strictly positive. For any 0 · º · º¤ , Dº is feasible with zero optimal objective value. Proof. First, from lemma 4 and the fact that 0 · e T ® · 1, we know º¤ and º¤ can be dened without problems. We then prove º¤ D ºmax by showing that after C is small enough, all D0C ’s optimal solutions ®C satisfy e T ®C D ºmax . Assume I includes elements of the class that has fewer elements and J includes elements of the other class. If ®C is an optimal solution of D0C , it satises the following KKT condition:
µ
Q II Q JI
Q IJ Q JJ
¶µ
¶ µ ¶ µ ¶ e ( ®C ) I yI ( ¸C ) I ¡ ( » C ) I C ¡ bC D , ( ®C ) J ( ¸C ) J ¡ ( » C ) J yJ Cl
where ¸C ¸ 0, » C ¸ 0, ®TC lC D 0, and » TC ( e / l ¡ ®C ) D 0. When C is small enough, bC y J > 0 must hold. Otherwise, since Q JI ( ®C ) I C Q JJ ( ®C ) J is bounded, Q JI ( ®C ) I C Q JJ ( ®C ) J ¡ e J / ( Cl) C bC y J < 0 implies ( ®C ) J D e J / l, 6 (#yi D ¡1) . Therefore, which violates the constraint y T ® D 0 if ( #yi D 1) D ( ) y > 0 so y < 0. This implies that e when bC J bC I ®C I D I / l C is sufciently ¤ T small. Hence e ®C D ºmax D º . If (#yi D 1) D ( #yi D ¡1), we can let ®C D e / l and bC D 0. When C is small enough, this will be a KKT point. Therefore, e T ®C D ºmax D º¤ D 1. From lemma 1 we immediately know that Dº is infeasible if º > º¤ . From lemma 4, where e T ®C is a continuous function, for any º 2 (º¤ , º¤ ], there is a ( D0C ) such that e T ®C D º. Then from theorem 3, D0C and Dº have the same optimal solution set. If Dº has the same optimal solution set as that of D0C1 and D0C2 where C1 < C2 , since eT ®C is a decreasing function, for any C 2 [C1 , C2 ], its optimal solutions satisfy eT ® D º. From theorem 3, its optimal solution set is the same as that of Dº. Thus, such Cs construct an interval. If º < º¤ , Dº must be feasible from lemma 1. It cannot have nonzero objective value due to theorem 4 and the denition of º¤ . For Dº¤ , if º¤ D 0, the objective value of Dº¤ is zero as ® D 0 is a feasible solution. If º¤ > 0, since feasible regions of Dº are bounded by 0 · ai · 1 / l, i = 1, . . . , l, with theorem 1, there is a sequence f®ºi g, º1 · º2 · ¢ ¢ ¢ < º¤ such that O ´ limºi !º¤ ®ºi exists. ®ºi is an optimal solution of Dºi , e T ®ºi D ºi , and ®
2128
Chih-Chung Chang and Chih-Jen Lin
Since eT ®ºi D ºi , e T ® O D limºi !º¤ e T ®ºi D º¤ . We also have 0 · ® O · 1/ l and y T ® O D limºi !º¤ y T ®ºi D 0 so ® O is feasible to Dº¤ . However, ® O TQ® O D limºi !º¤ ®ºTi Q ®ºi D 0 as ®ºTi Q ®ºi D 0 for all ºi . Therefore, the objective value of Dº¤ is always zero. Next we prove that the objective value of Dº is zero if and only if º · º¤ . From the above discussion, if º · º¤ , the objective value of Dº is zero. If the objective value of Dº is zero but º > º¤ , theorem 3 implies º D ºmax D º¤ D º¤ , which causes a contradiction. Hence the proof is complete. Note that when the objective value of Dº is zero, the optimal solution w of the primal problem Pº is zero. Crisp & Burges (1999, sec. 4) considered such a Pº as a trivial problem. Next we present a corollary: If training data are separable, º¤ D 0. If training data are nonseparable, º¤ ¸ 1 / l > 0. Furthermore, if Q is positive denite, training data are separable and º¤ D 0. Corollary 1.
From (Lin 2001, theorem 3.3), if data are separable, there is a C¤ such that for all C ¸ C¤ , an optimal solution ®C¤ of DC¤ is also optimal for DC . Therefore, for D0C , an optimal solution becomes ®C¤ / ( Cl) and e T ®C¤ / ( Cl) ! 0 as C ! 1. Thus, º¤ D 0. On the other hand, if data are nonseparable, no matter how large C is, there are components of optimal solutions at the upper bound. Therefore, e T ®C ¸ 1 / l > 0 for all C. Hence, º¤ ¸ 1 / l. If Q is positive denite, the unconstrained problem, Proof.
1 min ®T Q ® ¡ e T ®, 2
(2.12)
has a unique solution at ® D Q ¡1 e . If we add constraints to equation 2.12, min
1 T ® Q ® ¡ eT ® 2 yT ® D 0, ai ¸ 0, i D 1, . . . , l,
(2.13)
is a problem with a smaller feasible region. Thus the objective value of equation 2.13 is bounded. From corollary 27.3.1 of Rockafellar (1970) any bounded nite dimensional space quadratic convex function over a polyhedral attains at least an optimal solution. Therefore, equation 2.13 is solvable. From Lin (2001, theorem 2), this implies the following primal problem is solvable: min
1 T w w 2 yi ( w T w ( x i ) C b) ¸ 1, i D 1, . . . , l.
Hence training data are separable.
Training º-Support Vector Classiers
2129
In many situations, Q is positive denite. For example, from Micchelli 6 (1986), if the radial basis function (RBF) kernel is used and x i D xj , Q is positive denite. We illustrate the above results by some examples. Given three nonseparable training points x1 D 0, x 2 D 1, and x 3 D 2 with y D [1, ¡1, 1]T , we will show that this is an example of lemma 3. Note that this is a nonseparable problem. For all C > 0, the optimal solution of D0C is ® D [1 / 6, 1 / 3, 1 / 6]T . Therefore, in this case, º¤ D º¤ D 2 / 3. For Dº, º · 2 / 3, an optimal solution is ® D ( 3º/ 2) [1 / 6, 1 / 3, 1 / 6]T with the objective value 2
0 (3º/ 2) 2 [1 / 6, 1 / 3, 1 / 6] 40 0
0 1 ¡2
32 3 0 1/6 ¡25 41 / 35 D 0. 4 1/6
Another example shows that we may have the same value of e T ®C for all C in an interval, where ®C is any optimal solution of D0C . Given x 1 D h i h i h i h i ¡1 1 0 0 [1, ¡1, 1, ¡1]T , part of x x , and x , D , D D 2 3 4 0 1 ¡1 0 with y D the KKT condition of D0C is 2 1 61 6 40 0
1 2 1 0
32 3 2 3 2 3 0 a1 1 1 6a2 7 617 6 ¡17 1 07 76 7 ¡ 6 7 C b 6 7 D ¸ ¡ ». 4 15 05 4a3 5 4C 415 0 a4 1 ¡1
0 1 1 0
Then one optimal solution of D0C is:
®C D [ 14 , 14 , 14 , 14 ]T
b 2 [1 ¡
1 1 4C , 4C
¡ 12 ] if 0 < C · 13 ,
1 [3 C C2 , ¡3 C C4 , 3 C C2 , 9]T D 36
1 D 12C
if
1 3
· C · 43 ,
D [ 18 , 0, 18 , 14 ]T
1 ¡ 18 D 4C
if
4 3
· C · 4,
1 1 1 T D [ 2C , 0, 2C , C]
D ¡1 4C
if C ¸ 4.
This is a separable problem. We have º¤ D 1, º¤ D 0, and
eT ® C D
8 > >11 <
2 C 9C 3
1 > > : 21
2C
if 0 < C · 13 , if 13 · C · 43 , if 43 · C · 4, if C ¸ 4.
In summary this section shows: The increase of C in C-SVM is like the decrease of º in º-SVM.
(2.14)
2130
Chih-Chung Chang and Chih-Jen Lin
Solving Dº and D0C is just like solving two different problems with the same optimal solution set. We may expect that many numerical aspects of solving them are similar. However, they are still two different problems, so we cannot obtain C without solving Dº. Similarly, without solving DC , we cannot nd º. 3 The Relation Between º and C
A formula like equation 2.14 motivates us to conjecture that all º D eT ®C have a similar form. That is, in each interval of C, e T ®C D A C B / C, where A and B are constants independent of C. The formulation of e T ®C will be the main topic of this section. We note that in equation 2.14, in each interval of C, ®C are at the same face. Here we say two vectors are at the same face if they have the same free, lower-bounded, and upper-bounded components. The following lemma deals with the situation when ®C are at the same face: If C < C and there are ®C and ®C at the same face, then for each C 2 [C, C], there is at least one optimal solution ®C of D0C at the same face as ®C and ®C . Furthermore, Lemma 5.
e T ®C D D 1 C
D2 C
, C · C · C,
where D 1 and D 2 are constants independent of C. In addition, D 2 ¸ 0. If f1, . . . , lg are separated into two sets A and F, where A corresponds to bounded variables and F corresponds to free variables of ®C (or ®C as they are at the same face), the KKT condition shows Proof.
µ
Q FF Q AF
Q FA Q AA
¶µ
¶ µ ¶ µ ¶ e yF 0 ®F Cb ¡ D , yA ®A ¸A ¡ » A Cl
(3.1)
y TF ®F C y TA ® A D 0,
(3.2)
li ¸ 0, ji ¸ 0, i 2 A.
(3.3)
Equations 3.1 and 3.2 can be rewritten as 2
Q FF 4Q AF y TF
Q FA Q AA yTA
32
3
2
3
2
3
yF e F / ( Cl ) 0 ®F y A 5 4® A 5 ¡ 4e A / ( Cl ) 5 D 4¸A ¡ » A 5 .
0
b
0
0
If Q FF is positive denite, ¡1 ( eF / ( Cl) ¡ Q FA ®A ¡ by F ). ®F D Q FF
(3.4)
Training º-Support Vector Classiers
2131
Thus, ¡1 ( e F / ( Cl) ¡ Q FA ®A ¡ byF ) C y TA ®A D 0 y TF ®F C y TA ®A D y TF Q FF
implies bD
¡1 ( y TA ®A C y TF Q FF e F / ( Cl ) ¡ Q FA ®A ) ¡1 y TF Q FF yF
.
Therefore,
³ ®F D
¡1 Q FF
eF
Cl ¡
¡ Q FA ®A ¡1 ( e F / ( Cl) ¡ Q FA ®A ) y TA ®A C y TF Q FF ¡1 y TF Q FF yF
´ yF .
(3.5)
We note that for C · C · C, if ( ®C ) F is dened by equation 3.5 and ( ®C ) A ´ ( ®C ) A (or ( ®C ) A ), then ( aC ) i ¸ 0, i D 1, . . . , l. In addition, ®C satises the rst part of equation 3.1 (the part with right-hand side zero). The sign of the second part is not changed, and equation 3.2 is also valid. Thus, we have constructed an optimal solution ®C of D0C that is at the same face as ®C and ®C . Then following from equation 3.5 and ®A is a constant vector for all C · C · C, ¡1 ( e F / ( Cl) ¡ Q FA ®A ¡ by F ) C e TA ® A eT ®C D e TF Q FF
³
D
¡1 e TF Q FF
eF / ( Cl ) ¡ Q FA ® A
¡
³ D
l
³ D D
¡1 eTF Q FF eF
¡1 eTF Q FF eF
l
¡1 ( eF / ( Cl) ¡ Q FA ®A ) y TA ®A C y TF Q FF ¡1 y TF Q FF yF
¡ ¡
D 2 / C C D 1.
¡1 T ¡1 ( y F Q FF eF / l) y F e TF Q FF
¡1 ( yTF Q FF yF ) l
yF
C eT A ®A
´
¡1 y TF Q FF yF ¡1 ( e TF Q FF yF ) 2
´
/ C C D1
´
/C C D1
If Q FF is not invertible, it is positive semidenite so we can have Q FF D O Q O T , where Q O ¡1 D Q O T is an orthonormal matrix. Without loss of generQD
2132
Chih-Chung Chang and Chih-Jen Lin
µ
N D
ality we assume D D
0
¶ 0 . Then equation 3.4 can be modied to 0
O T ®F D Q O ¡1 ( e F / ( Cl) ¡ Q FA ®A ¡ by ). DQ F One solution of the above system is " O ®F D Q
¡T
N D
¡1
0
# 0 Q O ¡1 ( e F / ( Cl) ¡ Q FA ®A ¡ by ) . F 0
Thus, a representation similar to equation 3.4 is obtained, and all arguments follow. Note that due to the positive semideniteness of Q FF , ®F may have multiple solutions. From lemma 2, eT ®C is a well-dened function of C. Hence the representation D 1 C D 2 / C is valid for all solutions. From lemma 4, eT ®C is a decreasing function of C, so D 2 ¸ 0. The main result on the representation of e T ®C is in the following theorem: Theorem 6.
There are 0 < C1 < ¢ ¢ ¢ < Cs and Ai , Bi , i D 1, . . . , s such that
8 ¤ º > > < T e ®C D Ai C > > : As C
Bi C Bs C
C · C1 ,
Ci · C · Ci C 1 , i D 1, . . . , s ¡ 1, Cs · C,
where ®C is an optimal solution of DC0 . We also have Ai C
Bi Bi C 1 D Ai C 1 C , i D 1, . . . , s ¡ 1. Ci C 1 Ci C1
(3.6)
From theorem 5, we know that eT ®C D º¤ when C is sufciently small. From lemma 4, if we gradually increase C, we will reach a C1 such that if C > C1 , eT ®C < º¤ . If for all C ¸ C1 , ®C are at the same face, from lemma 5, we have e T ®C D A1 C B1 / C, 8C ¸ C1 . Otherwise, from this C1 , we can increase C to a C2 such that for all intervals ( C2 , C2 C 2 ) , 2 ¸ 0, there is an ®C not at the same face as ®C1 and ®C2 . Then from lemma 5, for C1 · C · C2 , we can have A1 and B1 such that Proof.
e T ®C D A1 C
B1 . C
We can continue this procedure. Since the number of possible faces is nite (· 3l ), we have only nite Ci ’s. Otherwise we will have Ci and Cj , j ¸ i C 2,
Training º-Support Vector Classiers
2133
such that there exist ®Ci and ®Cj at the same face. Then lemma 5 implies that for all Ci · C · Cj , all ®C are at the same face as ®Ci and ®Cj . This contradicts the denition of Ci C 1 . From lemma 4, the continuity of eT ®C immediately implies equation 3.6. Finally we provide Figure 1 to demonstrate the relation between º and C. It clearly indicates that º is a decreasing function of C. Information about these two test problems, australian and heart, is in section 5. 4 A Decomposition Method for º-SVM
Based on existing decomposition methods for C-SVM, in this section we propose a decomposition method for º-SVM. For solving DC , existing decomposition methods separate the index f1, . . . , lg of the training set to two sets B and N, where B is the working set if ® is the current solution of the algorithm. If we denote ®B and ®N as vectors containing corresponding elements, the objective value of DC is equal to 1 T ( C Q BN ®N ) T ®B C 12 ®TN Q NN ®N ¡ eTN ®N . At each iteration, 2 ®B Q BB ®B ¡ e B ®N is xed, and the following problem with the variable ®B is solved: min
µ where
1 T ® Q ®B ¡ ( e B ¡ Q BN ®N ) T ®B 2 B BB y TB ®B D ¡y TN ®N , 0 · ( aB ) i · C, i D 1, . . . , q,
Q BB Q NB
Q BN Q NN
(4.1)
¶
is a permutation of the matrix Q and q is the size
of B. The strict decrease of the objective function holds, and the theoretical convergence was studied in Chang, Hsu, and Lin (2000), Keerthi and Gilbert (2000), and Lin (2000). An important process in the decomposition methods is the selection of the working set B. In the software SVMlight (Joachims, 1998), there is a systematic way to nd the working set B. In each iteration the following problem is solved: min r f ( ®k ) T d
y T d D 0, ¡1 · di · 1,
di ¸ 0, if ( ak ) i D 0, di · 0, if ( ak ) i D C, 6 0g| D q, |fdi | di D
(4.2) (4.3) (4.4)
where we represent f ( ® ) ´ 12 ®T Q ® ¡ e T ®, ®k as the solution at the kth 6 iteration, and r f ( ® k ) is the gradient of f ( ® ) at ® k. Note that |fdi | di D 0g| means the number of components of d that are not zero. The constraint 4.4 implies that a descent direction involving only q variables is obtained. Then
2134
Chih-Chung Chang and Chih-Jen Lin
Figure 1: Relation between º and C.
components of ® k with nonzero di are included in the working set B, which is used to construct the subproblem, equation 4.1. Note that d is used only for identifying B but not as a search direction. If q is an even number Joachims (1998) showed a simple strategy for solving equations 4.2 through 4.4. First, he sorts yi r f ( ® k ) i , i D 1, . . . , l in
Training º-Support Vector Classiers
2135
decreasing order. Then he successively picks the q / 2 elements from the top of the sorted list, which 0 < (ak ) i < C or di D ¡yi obeys equation 4.3. Similarly he picks the q / 2 elements from the bottom of the list for which 0 < ( ak ) i < C or di D yi obeys equation 4.3. Other elements of d are assigned to be zero. Thus, these q nonzero elements compose the working set. A complete analysis of his procedure is in Lin (2000, sect. 2). To modify the above strategy for Dº, we consider the following problem in each iteration: min r f ( ®k ) T d
y T d D 0, eT d D 0, ¡1 · di · 1,
di ¸ 0, if ( ak ) i D 0, di · 0, if ( ak ) i D 1 / l, 6 0g| · q, |fdi | di D
(4.5)
where q is an even integer. Now f ( ®) ´ 12 ®T Q ®. Here we use · instead of D because in theory q nonzero elements may not be always available. This was rst pointed out by Chang et al. (2000). Note that the subproblem, equation 4.1, becomes as follows if decomposition methods are used for solving Dº: min
1 T ® Q ®B C Q BN ®TN ®B 2 B BB y TB ®B D ¡y TN ®N , e TB ®B
D
(4.6)
º ¡ e TN ®N ,
0 · ( aB ) i · 1 / l, i D 1, . . . , q.
Problem 4.5 is more complicated then 4.2 as there is an additional constraint e T d D 0. The situation of q D 2 has been discussed in Keerthi and Gilbert (2000). We will describe a recursive procedure for solving equation 4.5. We consider the following problem: min
X t2S
X t2S
r f ( ® k ) t dt yt dt D 0,
X t2S
dt D 0, ¡1 · dt · 1,
dt ¸ 0, if (ak ) t D 0, dt · 0, if ( ak ) t D 1 / l, 6 0, t 2 Sg| · q, |fdt | dt D
(4.7)
which is the same as equation 4.5 if S D P f1, . . . , lg. We denote the variables fdt |t 2 Sg as d and the objective function t2S r f ( ® k ) t dt as obj ( d ) .
2136
Chih-Chung Chang and Chih-Jen Lin
If q D 0, the algorithm stops and outputs d D 0. Otherwise choose a pair of indices i and j from either Algorithm 1.
i D argmint fr f ( ®k ) t |yt D 1, ( ak ) t < 1 / l, t 2 Sg,
(4.8)
i D argmint fr f ( ®k ) t |yt D ¡1, (ak ) t < 1 / l, t 2 Sg,
(4.9)
j D argmaxt fr f ( ®k ) t |yt D 1, ( ak ) t > 0, t 2 Sg,
or j D argmaxt fr f ( ®k ) t |yt D ¡1, (ak ) t > 0, t 2 Sg,
depending on which one gives a smaller r f ( ®k ) i ¡ r f ( ® k ) j . If there are no such i and j, or r f ( ®k ) i ¡r f ( ® k ) j ¸ 0, the algorithm stops and outputs a solution d D 0. Otherwise we assign di D 1, dj D ¡1 and determine values of other variables by recursively solving a smaller problem of equation 4.7: X min r f ( ® k ) t dt t2S0
X t2S0
yt dt D 0,
X t2S0
dt D 0, ¡1 · dt · 1,
dt ¸ 0, if (ak ) t D 0, dt · 0, if (ak ) t D 1 / l, 6 0, t 2 S0 g| · q0 , |fdt | dt D
(4.10)
where S0 D Snfi, jg and q0 D q ¡ 2. Algorithm 1 assigns nonzero values to at most q / 2 pairs. The indices of nonzero elements in the solution d are used as B in the subproblem 4.6. Note that algorithm 1 can be implemented as an iterative procedure by selecting q / 2 pairs sequentially. Then the computational complexity is similar to Joachim’s strategy. Here, for convenience in writing proofs, we describe it in a recursive way. Next we prove that algorithm 1 solves equation 4.5. If there is an optimal solution d of equation 4.7, there exists an optimal integer solution d ¤ with d¤t 2 f¡1, 0, 1g, for all t 2 S. Lemma 6.
P Because t2S dt D 0, if there are some noninteger elements in d , there must be at least two. Furthermore, from the linear constraints Proof.
X t2S
yt dt D 0 and
we have X t2S,yt D 1
X t2S
yt dt D 0 and
dt D 0,
X t2S,yt D ¡1
yt dt D 0.
(4.11)
Training º-Support Vector Classiers
2137
Thus, if there are only two noninteger elements di and dj , they must satisfy yi D yj . Therefore, if d contains some noninteger elements, there must be two of them, di and dj , which satisfy yi D yj . If di C dj D c,
r f ( ® k ) i di C r f ( ® k ) j dj D ( r f ( ® k ) i ¡ r f ( ®k ) j ) di C cr f ( ®k ) j .
(4.12)
6 Since di , dj 2/ f¡1, 0, 1g and ¡1 < di , dj < 1, if r f ( ®k ) i D r f ( ® k ) j , we can pick a sufciently small 2 > 0 and shift di and dj by ¡2 ( r f ( ® k ) i ¡ r f ( ® k ) j ) and 2 ( r f ( ® k ) i ¡ r f ( ®k ) j ) , respectively, without violating their feasibility. Then the decrease of the objective value contradicts the assumption that d is an optimal solution. Hence we know r f ( ®k ) i D r f ( ®k ) j . Then we can eliminate at least one of the nonintegers by shifting di and dj by argminv f|v|: v 2 fdi ¡ bdi c, ddi e ¡ di , dj ¡ bdj c, ddj e ¡ dj gg. The objective value is the same because of equation 4.12 and r f ( ®k ) i D r f ( ® k ) j . We can repeat this process until an integer optimal solution d ¤ is obtained.
If there is an optimal integer solution d of equation 4.7 that is not all zero and ( i, j) can be chosen from equation 4.8 or 4.9, then there is an optimal integer solution d ¤ with d¤i D 1 and dj¤ D ¡1. Lemma 7.
Because ( i, j) can be chosen from equation 4.8 or 4.9, we know 6 6 ( ak ) i < 1 / l and ( ak ) j > 0. We will show that if di D 1 and dj D ¡1, we can construct an optimal integer solution d ¤ from d such that d¤i D 1 and dj¤ D ¡1. We rst note that for any nonzero integer element di0 , from equation 4.11, there is a nonzero integer element dj0 such that Proof.
dj0 D ¡di0 and yj0 D yi0 . We dene p ( i0 ) ´ j0 . If di D ¡1, we can nd i0 D p ( i) such that di0 D 1 and yi D yi0 . Since di0 D 1, ( ak ) i0 < 1 / l. By the denition of i and the fact that (ak ) i < 1 / l, r f ( ®k ) i · r f ( ® k ) i0 . Let d¤i D 1, d¤i0 D ¡1, and d¤t D dt otherwise. Then obj ( d ¤ ) · obj ( d ), so d ¤ is also an optimal solution. Similarly, if dj D 1, we can have an optimal solution d ¤ with dj¤ D ¡1. Therefore, if the above transformation has been done, we have only three cases left: ( di , dj ) D ( 0, ¡1), ( 1, 0), and ( 0, 0) . For the rst case, we can nd an i0 D p ( j) such that di0 D 1 and yi0 D yi D yj . From the denition of i and the fact that ( ak ) i0 < 1 / l and ( ak ) i < 1 / l, r f ( ®k ) i · r f ( ®k ) i0 . We can dene d¤i D 1, d¤i0 D 0, and d¤t D dt otherwise. Then obj ( d ¤ ) · obj ( d ) so d ¤ is also an optimal solution. If ( di , dj ) D ( 1, 0) , the situation is similar. Finally we check the case where di and dj are both zero. Since d is a nonzero integer vector, we can consider a di0 D 1 and j0 D p ( i0 ) . From equations 4.8
2138
Chih-Chung Chang and Chih-Jen Lin
and 4.9, r f ( ®k ) i ¡ r f ( ® k ) j · r f ( ®k ) i0 ¡ r f ( ® k ) j0 . Let d¤i D 1, dj¤ D ¡1, d¤i0 D dj¤0 D 0, and d¤t D dt otherwise. Then d ¤ is feasible for equation 4.7 and obj ( d ¤ ) · obj ( d ) . Thus, d ¤ is an optimal solution. If there is an integer optimal solution of equation 4.7 and algorithm 1 outputs a zero vector d , then d is already an optimal solution of equation 4.7. Lemma 8.
If the result is wrong, there is an integer optimal solution d ¤ of equation 4.7 such that X r f ( ®k ) t d¤t < 0. obj ( d ¤ ) D Proof.
t2S
Without loss of generality, we can consider only the case of X r f ( ® k ) t d¤t < 0.
(4.13)
t2S,yt D 1
From equation 4.11 and d¤t 2 f¡1, 0, 1g, the number of indices satisfying d¤t D 1, yt D 1 is the same as those of d¤t D ¡1, yt D 1. Therefore, we must have min
d¤t D 1,yt D 1
Otherwise, X d¤t D1,yt D 1
r f ( ®k ) t ¡
r f ( ®k) t ¡
max
d¤t D ¡1,yt D1
X d¤t D ¡1,yt D 1
r f ( ® k ) t < 0.
r f ( ®k ) t D
X yt D 1
(4.14)
r f ( ®k ) t d¤t ¸ 0
contradicts equation 4.13. Then equation 4.14 implies that in algorithm 1, i and j can be chosen with di D 1 and dj D ¡1. This contradicts the assumption that algorithm 1 outputs a zero vector. Theorem 7.
Algorithm 1 solves equation 4.7.
6 First we note that the set of d that satises |fdt | dt D 0, t 2 Sg| · q can be considered as the union of nitely many closed sets of the form fd | di1 D 0, . . . , dil¡q D 0g. Therefore, the feasible region of equation 4.7 is closed. With the bounded constraints ¡1 · di · 1, i D 1, . . . , l, the feasible region is compact, so there is at least one optimal solution. As q is an even integer, we assume q D 2k. We then nish the proof by induction on k:
Proof.
k D 0: Algorithm 1 correctly nds the solution zero.
Training º-Support Vector Classiers
2139
k > 0: Suppose algorithm 1 outputs a vector d with di D 1 and dj D ¡1. In this situation the optimal solution of equation 4.7 cannot be zero. Otherwise, by assigning a vector dN with dNi D 1, dNj D ¡1, and dNt D 0 for all t 2 Snfi, jg, obj ( dN ) < 0 gives a smaller objective value than that of the zero vector. Thus, the assumptions of lemma 7 hold. Then by the fact that equation 4.7 is solvable and lemmas 6 and 7, we know that there is an optimal solution d ¤ of equation 4.5 with d¤i D 1 and dj¤ D ¡1. By induction fdt , t 2 S0 g is an optimal solution of equation 4.10. Since t 2 S0 g is also a feasible solution of equation 4.10, we have X r f ( ® k ) t dt obj ( d ) D r f ( ® k ) i di C r f ( ®k ) j dj C
fd¤t ,
·r
f ( ® k ) i d¤i C
r
t2S0 X ¤ r f ( ®k ) j dj C t2S 0
f ( ®k ) t d¤t D obj ( d ¤ ) .
(4.15)
Thus d , the output of algorithm 1 is an optimal solution. Suppose algorithm 1 does not output a vector d with di D 1 and dj D ¡1. Then d is actually a zero vector. Immediately from lemma 8, d D 0 is an optimal solution. Since equation 4.5 is a special case of equation 4.7, theorem 7 implies that algorithm 1 can solve it. After solving Dº, we want to calculate r and b in Pº. The KKT condition, equation 2.5, shows ( Q ® ) i ¡ r C byi D 0 if 0 < ai < 1 / l, ¸ 0 if ai D 0,
· 0 if ai D 1 / l.
Dene r1 ´ r ¡ b, r2 ´ r C b. If yi D 1, the KKT condition becomes ( Q ® ) i ¡ r1 D 0 if 0 < ai < 1 / l, ¸ 0 if ai D 0,
· 0 if ai D 1 / l.
(4.16)
Therefore, if there are ai that satisfy equation 4.16, r1 D ( Q ® ) i . Practically to avoid numerical errors, we can average them: P r1 D
0 < ai < 1/ l,yi D 1 ( Q ® ) i
P
0 < ai < 1/ l,yi D 1 1
.
2140
Chih-Chung Chang and Chih-Jen Lin
On the other hand, if there is no such ai , as r1 must satisfy max
ai D 1/ l,yi D 1
( Q ® ) i · r1 ·
min ( Q ® ) i ,
ai D 0,yi D 1
we take r1 the midpoint of the range. For yi D ¡1, we can calculate r2 in a similar way. After r1 and r2 are obtained, rD
r1 C r2 r1 ¡ r2 and ¡ b D . 2 2
Note that the KKT condition can be written as max ( Q ®) i ·
ai > 0,yi D 1
min
ai < 1/ l,yi D1
( Q ® ) i and
max
ai > 0,yi D ¡1
( Q®) i ·
min
ai < 1 / l,yi D ¡1
( Q® ) i .
Hence practically we can use the following stopping criterion: The decomposition method stops if the solution ® satises the following condition: ¡ ( Q ® ) i C ( Q ®) j < 2 ,
(4.17)
where 2 > 0 is a chosen stopping tolerance, and i and j are the rst pair obtained from equation 4.8 or 4.9. In section 5, we conduct some experiments on this new method. 5 Numerical Experiments
When C is large, there may be more numerical difculties using decomposition methods for solving DC (see, for example, the discussion in Hsu & Lin, 1999). Now there is no C in Dº, so intuitively we may think that this difculty no longer exists. In this section, we test the proposed decomposition method on examples with different º and examine required time and iterations. Since the constraints 0 · ai · 1 / l, i D 1, . . . , l, imply ai are small, the objective value of Dº may be very close to zero. To avoid possible numerical inaccuracy, here we consider the following scaled form of Dº: min
1 T ® Q® 2 yT d D 0, e T ® D ºl,
(5.1)
0 · ai · 1, i D 1, . . . , l.
The working set selection follows the discussion in section 4, and here we implement a special case with q D 2. Then the working set in each iteration contains only two elements.
Training º-Support Vector Classiers
2141
For the initial point a1 , we assign the rst [ºl / 2 ] elements with yl D 1 as [1, . . . , 1, ºl / 2 ¡ [ºl/ 2]]T . Similarly, the same numbers are assigned to the rst [ºl / 2] elements with yi D ¡1. All other elements are assigned to be zero. Unlike the decomposition method for DC , where the zero vector is usually used as the initial solution so r f ( ®1 ) D ¡e, now ®1 contains dºle nonzero components. In order to obtain r f ( ®1 ) D Q ®1 of equation 4.5, in the beginning of the decomposition procedure, we must compute dºle columns of Q . This might be a disadvantage of using º-SVM. Further investigations are needed on this issue. 2 We test the RBF kernel with Qij D yi yj e¡kx i ¡xj k / n , where n is the number of attributes of training data. Our implementation is part of the software LIBSVM(version 2.03), which is an integrated package for SVM classication and regression. (LIBSVM is available online at http://www.csie.ntu.edu.tw/ »cjlin/libsvm.) We test problems from various collections. Problems australian to shuttle are from the Statlog collection (Michie, Spiegelhalter, & Taylor, 1994). Problems adult4 and web7 are compiled by Platt (1998) from the UCI Machine Learning Repository (Blake & Merz, 1998; Murphy & Aha 1994). Note that all problems from Statlog are with real numbers, so we scale them to [¡1, 1]. Problems adult4 and web7 are with binary representation, so we do not conduct any scaling. Some of these problems have more than two classes, so we treat all data not in the rst class as in the second class. As LIBSVM also implements a decomposition method with q D 2 for C-SVM (Chang & Lin, 2000), we try to conduct some comparisons between C-SVM and º-SVM. Note that these two codes are nearly the same except for different working selections specially for Dº and DC . For each problem, we solve its DC form using C D 1 and C D 1000 rst. If ®C is an optimal solution of DC , we then calculate º by e T ®C / ( Cl) and solve Dº. The stopping tolerance 2 for solving C-SVM is set to be 10¡3 . As the ® of equation 4.17 is like the ® of DC divided by C and the stopping criterion involves Q ®, to have a fair comparison, the tolerance (i.e., 2 of equation 4.17) for equation 5.1 is set as 10¡3 / C. The computational experiments for this section were done on a Pentium III-500 with 256 MB RAM using the gcc compiler. We used 100 MB as the cache size of LIBSVM for storing recently used Qij . Tables 1 and 2 report results of C D 1 and 1000, respectively. In each table, the corresponding º is listed, and the number of iterations and time (in seconds) of both algorithms are compared. Note that for the same problem, fewer iterations do not always lead to less computational time. We think there are two possible reasons: First, the computational time for calculating the initial gradient for Dº is more expensive. Second, due to different contents of the cache (or different numbers of kernel evaluations), the cost of each iteration is different. We also present the number of support vectors (#SV column) as well as free support vectors (#FSV column). It can be clearly
2142
Chih-Chung Chang and Chih-Jen Lin
Figure 2: Training data and separating hyperplanes.
seen that the proposed method for Dº performs very well. This comparison has shown the practical viability of using º-SVM. From Sch¨olkopf et al. (2000), we know that ºl is a lower bound of the number of support vectors and an upper bound of the number of bounded support vectors (also number of misclassied training data). It can be clearly seen from Tables 1 and 2 that ºl lies between the number of support vectors and bounded support vectors. Furthermore, we can see that if º becomes smaller, the total number of support vectors decreases. This is consistent with using DC , where the increase of C decreases the number of support vectors. We also observe that although the total number of support vectors decreases as º becomes smaller, the number of free support vectors increases. When º is decreased (C is increased), the separating hyperplane tries to t as many training data as possible. Hence more points (that is, more free ai ) tend to be at two planes w T w ( x ) C b D §r. We illustrate this in Figures 2a and 2b, where º D 0.5 and 0.2, respectively, are used on the same problem. Since the weakest part of the decomposition method is that it cannot consider all variables together in each iteration (only q elements are selected), a larger number of free variables may cause more difculty. This explains why many more iterations are required when º are smaller. Therefore, here we have given an example that for solving DC and Dº, the decomposition method faces a similar difculty. 6 Discussion and Conclusion
In an earlier version of this article, since we did not know how to design a decomposition method for Dº that has two linear constraints, we tried to
690 768 1000 270 846 4435 15,000 43,500 4781 24,692
australian diabetes german heart vehicle satimage letter shuttle adult4 web7
0.309619 0.574087 0.556643 0.43103 0.501182 0.083544 0.036588 0.141534 0.41394 0.059718
º
1040 395 953 219 791 355 764 3267 1460 1896
C Iteration 946 297 909 175 904 534 897 6982 1464 1721
º Iteration 0.34 0.4 1.23 0.07 0.69 8.16 22.59 422.04 21.14 74.51
C Time
l
690 768 1000 270 846 4435 15,000 43,500 4781 24,692
Problem
australian diabetes german heart vehicle satimage letter shuttle adult4 web7
0.147234 0.421373 0.069128 0.033028 0.262569 0.015416 0.005789 0.033965 0.263506 0.023691
º
151,438 216,845 79,542 11,933 220,973 44,372 69,052 143,273 359,618 187,578
C Iteration 117,758 137,941 81,824 11,075 190,324 45,323 70,604 154,558 350,818 187,170
º Iteration 10.98 18.96 11.24 0.38 20.07 28.3 141.4 1215.8 257.51 1262.15
C Time
Table 2: Solving C-SVM and º-SVM: C D 1000 (Time in Seconds).
l
Problem
Table 1: Solving C-SVM and º-SVM: C D 1 (Time in Seconds).
8.65 11.79 11.37 0.35 17.01 28.31 134.14 1468.56 244.84 1112.07
º Time
0.42 0.47 1.61 0.08 0.91 14.05 35.13 1058.0 28.86 102.99
º Time
222 376 509 100 284 136 152 1487 1760 1112
#SV
244 447 600 132 439 377 563 6159 2002 1556
#SV
167 102 494 99 111 106 100 17 837 696
#FSV
55 13 88 25 26 12 26 5 53 140
#FSV
102 324 70 9 223 69 87 1478 1260 585
dºle
214 441 557 117 424 371 549 6157 1980 1475
dºle
Training º-Support Vector Classiers 2143
2144
Chih-Chung Chang and Chih-Jen Lin
remove one of them. For C-SVM Friess, Cristianini, and Campbell (1998) and Mangasarian and Musicant (1999) added b2 / 2 into the objective function so the dual does not have the linear constraint y T ® D 0. We used a similar approach for Pº by considering the following new primal problem: ( PNº )
l 1 T 1 1X w w C b2 ¡ ºr C ji 2 2 l i D1
min
yi ( w T w ( xi ) C b ) ¸ r ¡ ji ,
ji ¸ 0, i D 1, . . . , l, r ¸ 0.
The dual of PNº is: ( DN º)
min
1 T ® ( Q C yy T ) ® 2 e T ® ¸ º, 0 · ai · 1 / l,
i D 1, . . . , l.
(6.1)
(6.2)
Similar to theorem 1, we can solve DN º using only the equality eT ® D º. Hence the new problem has only one simple equality constraint and can be solved using existing decomposition methods like SVMlight . To be more precise, the working selection becomes: min r f ( ® k ) T d
eT d D 0, ¡1 · di · 1,
di ¸ 0, if (ak ) i D 0, di · 0, if ( ak ) i D 1 / l, 6 0g| · q, |fdi | di D
(6.3)
where f ( ®) is 12 ®T ( Q C yy T ) ®. Equation 6.3 can be considered as a special problem of equation 4.2 since e of e T d D 0 is a special case of y . Thus SVMlight ’s selection procedure can be directly used. An earlier version of LIBSVM implemented this decomposition method for DN º. However, later we nd that the performance is much worse than that of the method for Dº. This can be seen in Tables 3 and 4, which present the same information as Tables 1 and 2 for solving DN º. As the major difference is on the working set selection, we suspect that the performance gap is similar to the situation happened for C-SVM. Hsu and Lin (1999) showed that by directly using SVMlight ’s strategy, the decomposition method for ( DN C )
min
1 T ® ( Q C yy T ) ® ¡ e T ® 2 0 · ai · C, i D 1, . . . , l,
(6.4)
performs much worse than that for DC . Note that the relation between DN C and DN º is very similar to that of DC and Dº presented earlier. Thus we conjecture that there are some common shortages of using SVM light ’s working
Training º-Support Vector Classiers
2145
N º ) : A Comparison with Table 1. Table 3: Solving ( D Problem australian diabetes german heart vehicle satimage letter shuttle adult4 web7
l
º
º Iteration
º Time
#SV
#FSV
690 768 1000 270 846 4435 15,000 43,500 4781 24,692
0.309619 0.574087 0.556643 0.43103 0.501182 0.083544 0.036588 0.141534 0.41394 0.059718
4871 1816 1641 527 1402 3034 7200 17,893 7500 3109
0.64 0.58 1.67 0.1 1.04 15.44 54.6 1198.83 35.03 107.5
244 447 599 130 437 380 562 6161 2002 1563
53 13 87 23 26 16 28 8 54 149
N º ) : A Comparisom with Table 2. Table 4: Solving ( D Problem australian diabetes german heart vehicle satimage letter shuttle adult4 web7
l
º
º Iteration
º Time
#SV
#FSV
690 768 1000 270 846 4435 15,000 43,500 4781 24,692
0.147234 0.421373 0.069128 0.033028 0.262569 0.015416 0.005789 0.033965 0.263506 0.023691
597,205 1,811,571 504,114 48,581 1,626,315 919,695 1,484,401 8,364,010 8,155,518 28,791,608
36.06 132.7 56.33 1.13 125.51 445.42 2544.23 59,286.83 4905.67 96,912.82
222 376 508 100 284 136 150 1487 1759 1245
167 102 493 99 112 106 97 18 842 830
set selection for DN C and DN º. Further investigations are needed to understand whether explanations in Hsu and Lin (1999) are true for DN º. In conclusion, this article discusses the relation between º-SVM and CSVM in detail. In particular, we show that solving them is just like solving two different problems with the same optimal solution set. We also have proposed a decomposition method for º-SVM. Experiments show that this method is competitive with methods for C-SVM. Hence we have demonstrated the practical viability of º-SVM. Acknowledgments
This work was supported in part by the National Science Council of Taiwan, grant NSC 89-2213-E-002-013. C.-J. L. thanks Craig Saunders for bringing him to the attention of º-SVM and thanks a referee of Lin (2001) whose comments led him to think about the infeasibility of Dº. We also thank Bernhard Sch¨olkopf and two anonymous referees for helpful comments.
2146
Chih-Chung Chang and Chih-Jen Lin
References Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Available online at http://www.ics.uci. edu/»mlearn/MLRepository.html. Univ. of Calif. Irvine: Dept. of Info. and Comp. Sci. Chang, C.-C., Hsu, C.-W., & Lin, C.-J. (2000). The analysis of decomposition methods for support vector machines. IEEE Trans. Neural Networks, 11(4), 1003–1008. Chang, C.-C., & Lin, C.-J. (2000). LIBSVM: Introduction and benchmarks (Tech. Rep.). Taipei, Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. Crisp, D. J., & Burges, C. J. C. (1999). A geometric interpretation of º-SVM classiers. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing, 11. Cambridge, MA: MIT Press. Friess, T.-T., Cristianini, N., & Campbell, C. (1998). The kernel adatron algorithm: A fast and simple learning procedure for support vector machines. In Proceedings of 15th Intl. Conf. Machine Learning. San Mateo, CA: Morgan Kaufmann. Hsu, C.-W., & Lin, C.-J. (1999). A simple decomposition method for support vector machines (Tech. Rep.). Taipei, Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. To appear in Machine Learning. Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning, Cambridge, MA: MIT Press. Keerthi, S., & Gilbert, E. G. (2000). Convergence of a generalized SMO algorithm for SVM classier design (Tech. Rep. CD-00-01). Singapore: Department of Mechanical and Production Engineering, National University of Singapore. Keerthi, S., Shevade, C. B. S. K., & Murthy, K. R. K. (2000). A fast iterative nearest point algorithm for support vector machine classier design. IEEE Trans. Neural Networks, 11(1), 124–136. Lin, C.-J. (2000). On the convergence of the decomposition method for support vector machines (Tech. Rep.). Taipei, Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. Lin, C.-J. (2001). Formulations of support vector machines: A note from an optimization point of view. Neural Computation, 13(2), 307–317. Mangasarian, O. L., & Musicant, D. R. (1999). Successive overrelaxation for support vector machines. IEEE Trans. Neural Networks, 10(5), 1032–1037. Micchelli, C. A. (1986). Interpolation of scattered data: Distance matrices and conditionally positive denite functions. Constructive Approximation, 2, 11– 22. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classication. Englewood Cliffs, NJ: Prentice Hall. Available online at anonymous ftp: ftp.ncc.up.pt/pub/statlog/. Murphy, P. M., & Aha, D. W. (1994). UCI repository of machine learning databases (Technical Rep.). Irvine, CA: University of California, Department of Infor-
Training º-Support Vector Classiers
2147
mation and Computer Science. Data available online at: http://www.ics.uci. edu/»mlearn/MLRepository.html. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ: Princeton University Press. Saunders, C., Stitson, M. O., Weston, J., Bottou, L., Sch¨olkopf, B., & Smola, A. (1998). Support vector machine reference manual (Technical Rep. CSD-TR-98-03). Egham, UK: Royal Holloway, University of London. Sch¨olkopf, B., Smola, A. J., & Williamson, R. (1999). Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Sch¨olkopf, B., Smola, A., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Received January 26, 2000; accepted November 3, 2000.
LETTER
Communicated by Michael Jordan
A Tighter Bound for Graphical Models M. A. R. Leisink H. J. Kappen Department of Biophysics, University of Nijmegen, NL 6525 EZ Nijmegen, The Netherlands We present a method to bound the partition function of a Boltzmann machine neural network with any odd-order polynomial. This is a direct extension of the mean-eld bound, which is rst order. We show that the third-order bound is strictly better than mean eld. Additionally, we derive a third-order bound for the likelihood of sigmoid belief networks. Numerical experiments indicate that an error reduction of a factor of two is easily reached in the region where expansion-based approximations are useful. 1 Introduction
Graphical models have the capability of modeling a large class of probability distributions. The neurons in these networks are the random variables, whereas the connections between them represent conditional independencies. Usually some of the nodes have a direct relation with the random variables in the problem and are called visibles. The other nodes, known as hiddens, are used to model more complex probability distributions. Learning in graphical models, dened as maximizing the likelihood, can be done as long as the likelihood that the visibles correspond to a pattern in the data set is tractable to compute. For a lot of special structures (like trees or other sparse networks) this can be done efciently, but in general the time it takes scales exponentially with the number of hidden neurons. For such architectures, one has no other choice than to use an approximation of the likelihood. A well-known approximation technique from statistical mechanics, Gibbs sampling, was applied to graphical models in Pearl (1988). More recently, the mean-eld approximation was derived for sigmoid belief networks (Saul, Jaakkola, & Jordan, 1996). For this type of graphical model, the parental dependency of a neuron is modeled by a nonlinear (sigmoidal) function of the weighted parent states (Neal, 1992). It turns out that the mean-eld approximation bounds the likelihood from below. This is useful for learning, since a maximization of the bound increases either its accuracy or the likelihood for a pattern in the data set. Other work on bounds for c 2001 Massachusetts Institute of Technology Neural Computation 13, 2149– 2171 (2001) °
2150
M. A. R. Leisink and H. J. Kappen
neural networks can be found in Jaakkola, Saul, and Jordan (1996) and Neal and Hinton (1998). In this article we show that it is possible to improve the mean-eld approximation without losing the bounding properties. In section 2 we show the general theory for creating a new bound using an existing one. If we start with a polynomial bound of degree k (mean-eld corresponds to k D 1), it turns out that the new bound is of degree k C 2. The procedure leads, however, to quite complicated formulas for belief networks. Therefore, we rst focus on Boltzmann machines in section 3. These networks are stochastic as well, but the connections are symmetric and not directed (Ackley, Hinton, & Sejnowski, 1985). A mean-eld approximation for this type of neural networks has been described in Peterson and Anderson (1987). An improvement of this approximation was found by Thouless, Anderson, and Palmer (1977), which was applied to Boltzmann machines in Kappen and RodrÂõ guez (1999). Unfortunately, this so-called TAP approximation is not a bound. We apply our method to the mean-eld approximation, which results in a third-order bound. We prove the latter is always tighter than the standard-mean eld bound. In section 4 the procedure is extended to sigmoid belief networks. In contrast to Boltzmann machines, we need an additional bound for this type of graphical model to make the nal approximation tractable to compute. This is analogous to the mean-eld case, which was described in Saul et al. (1996). For both sigmoid belief networks and Boltzmann machines, the combination of a lower and upper bound is important for inference, since conditional probabilities (which are ratios of likelihoods) can be bounded as well. This article focuses solely on lower bounds, but more information about upper bounds can be found in Jaakkola and Jordan (1996a, 1996b). In section 5 we present some numerical results and compare the thirdorder bound with several existing approximation techniques. Finally, in section 6, we present our conclusions. 2 Higher-Order Bounds
This section shows the general procedure to obtain a k C 2 order bound given a polynomial bound of order k and then applies this method to the known straight line bound of the exponential function, which results in a third-order bound. 2.1 Upgrading a Bound. Suppose we have a function f 0 ( x) and a bound
b0 ( x ) such that 8x
f 0 ( x ) ¸ b 0 ( x) .
(2.1)
A Tighter Bound for Graphical Models
2151
Figure 1: The three stages in deriving a new bound. In this case f2 ( x ) D ¡ log cosh x and º D 0.
Let f1 ( x ) and b1 ( x ) be two primitive functions of f0 ( x ) and b0 ( x) : Z f1 ( x ) D
Z dx f0 ( x )
and
b1 ( x) D
dx b0 ( x ) .
(2.2)
This denes the functions f1 ( x ) and b1 ( x ) upto a constant. We choose this constant such that for some º: f1 (º) D b1 (º) . Since the surface under f0 ( x) at the left as well as at the right of x D º is obviously greater than the surface under b0 ( x) (which follows from equation 2.1) and the primitivefunctions are equal at x D º (by construction), we know ( f1 ( x) · b1 ( x ) for x · º (2.3) f1 ( x) ¸ b1 ( x ) for x ¸ º, or in shorthand notation, f1 ( x ) b1 ( x ) . It is important to understand that even if f0 (º) > b0 (º) , the above result holds. Therefore, we are completely free to choose º. If we repeat this and let f2 ( x ) and b2 ( x ) be two primitive functions of f1 ( x) and b1 ( x) , again such that f2 (º) D b2 (º) , one can easily verify that 8x
f2 ( x ) ¸ b2 ( x ).
(2.4)
Note that by construction, º is the point where f1 (º) D b1 (º), which is necessary for the above result to hold. This procedure is illustrated in Figure 1. Thus, given a lower bound of f0 ( x ), we can create another lower bound. In case the given bound is a polynomial of degree k, the new bound is a polynomial of degree k C 2 with one additional free parameter. In the next section, we apply this procedure to the exponential function.
2152
M. A. R. Leisink and H. J. Kappen
2.2 The Exponential Function. Starting with the trivial bound ex ¸ 0
and applying the procedure of section 2.1, we derive
8º
f 0 ( x ) D ex ¸ 0 D b 0 ( x )
(2.5)
f1 ( x ) D e
(2.6)
x
º
e D b1 ( x )
f2 ( x ) D ex ¸ eº (1 C x ¡ º) D b2 ( x ) .
(2.7)
One can verify that the condition that f1 (º) D b1 (º) and f2 (º) D b2 (º) indeed is met, and therefore we can conclude that b2 ( x ) is also a lower bound on the exponential function. The function b2 ( x ) here is the tangent of ex at the point x D º, and due to the convexity of the exponential function, that is indeed a lower bound. Nothing will stop us, however, from applying the procedure again, but this time to the lower bound f2 ( x ) ¸ b2 ( x ), which yields f2 ( x ) D ex ¸ eº ( 1 C x ¡ º) D b2 ( x) ³ ´ 1 f3 ( x ) D ex em C eº ( 1 C m ¡ º) ( x ¡ m ) C ( x ¡ m ) 2 D b3 ( x ) 2 » ³ 1 ¡ (º ¡ m ) (x ¡ m )2 f4 ( x ) D ex ¸ em 1 C x ¡ m C eº¡m 2 ´¼ 1 C (x ¡ m )3 D b4 ( x) . 6
(2.8) (2.9)
(2.10)
or (substituting l D º ¡ m ) » ³ ´¼ 1¡l 1 ( x ¡ m )2 C (x ¡ m ) 3 8x,m ,l f4 ( x ) D ex ¸ em 1 C x ¡ m C el 2 6 D b4 ( x ).
(2.11)
In Figure 2, the derived bound is shown for some values of m and l. The role of m is clearly to determine at which point the bound equals the exponential function. The role of l can be seen as tightening the bound at the left of x D m for negative and at the right for positive l. The price we pay, however, is a less accurate approximation at the opposite side. 3 Boltzmann Machines
In this section we derive a third-order lower bound on the partition function of a Boltzmann machine neural network using the results from the previous section. The probability of nding a Boltzmann machine in a state Es 2 f¡1, C 1gN is given by P ( Es ) D
1 exp( ¡E( Es ) ) , Z
(3.1)
A Tighter Bound for Graphical Models
2153
Figure 2: Examples of the bound as in equation 2.11 for some values of m and l.
where ¡E( Es ) D
1 ij h si sj C h i si . 2
(3.2)
There is an implicit summation over all repeated indices (Einstein’s convention), unless stated otherwise. Z is the normalization constant known as the partition function, ZD
X all Es
exp( ¡E( Es ) ) ,
(3.3)
which requires a sum over all, exponentially many states. Therefore, this sum is intractable to compute even for rather small networks. 3.1 A Third-Order Approximation. To compute the partition function approximately, we use the third-order bound1 from equation 2.11. We obtain
ZD
X all Es
exp( ¡E( Es ) ) ¸
X
em ( Es )
all Es
» ³ ´¼ 1 ¡ l( Es ) 1 £ 1 ¡ D E C el( Es ) D E2 ¡ D E3 , 2 6
(3.4)
where D E D m ( Es ) C E. Note that the former constants m and l are now functions of Es, since we may take different values for m and l for each term in the sum. In principle, these functions can take any form. If we take, 1 If we use the rst-order bound from equation 2.7, the result would be the well-known mean-eld bound.
2154
M. A. R. Leisink and H. J. Kappen
for instance, m ( Es ) D ¡E( Es ), the approximation is exact. This would lead, however, to the same intractability as before, and therefore we must restrict our choice to those that make equation 3.4 tractable to compute. We choose m ( Es ) and l( Es ) to be linear with respect to the neuron states si : m ( Es ) D m i si C m 0 i
(3.5)
0
l( Es ) D l si C l .
(3.6)
One may view m ( Es ) and l( Es ) as (the negative of) the energy functions for the Boltzmann distribution P » exp(m ( Es ) ) and P » exp( l( Es ) ). Therefore, we will sometimes speak of “the distribution m ( Es ) .” Since these linear energy functions correspond to factorized distributions, 2 we can compute the righthand side of equation 3.4 in a reasonable time (i.e., polynomial increasing with the network size). For instance, ± ² X ( ) X Y 0 em Es D exp m i si C m 0 D em 2 cosh m i , (3.7) Zm D all Es
all Es
i
or averages with respect to the distribution m ( Es ), hsi sj i D
1 X m ( Es ) e si sj D tanh m i tanh m j . Zm all Es
(3.8)
Since equation 3.4 is a lower bound, we may maximize it with respect to its variational parameters m 0, m i , l 0, and li to obtain the tightest bound. 3.2 A Special Case of the Third-Order Bound. Although li D 0 does certainly not correspond to a maximum of equation 3.4, we choose to set them to zero, because numerical experiments indicate (not presented in this article) that for the real optimum of equation 3.4, the value of li is close to zero for Boltzmann machines (given a gaussian distribution of the weights), and, more important, this simplies equation 3.4 enormously. This enables us to compare the third-order bound with the mean-eld bound. The reader should keep in mind, however, that the calculations in the rest of this article 6 could be done for li D 0, which would have tightened the bound even further. Given li D 0 we can rewrite the bound as » ³ ´¼ 1 ¡ l0 D 2E 1 D 3E 0 DE ¡ D E Z ¸ Zm 1 ¡ hD Ei C el D B ( E, m , l) , (3.9) 2 6
where Zm is the partition function of the distribution m ( Es ) as dened in equation 3.7 and h¢i denotes an average over that (factorized) distribution as dened in equation 3.8. 2
This is the simplest choice. See, however, Barber and Wiegerinck (1998).
A Tighter Bound for Graphical Models
2155
To nd the tightest bound, we set all derivatives with respect to the variational parameters to zero. In appendix 6 this is done explicitly for m 0 and l0 , which yields 8 0 « ¬ <m D ¡ E C m i si D E D E :l 0 D ¡ 1 D E 3 / D E 2 . 3
(3.10)
Using this solution, the bound reduces to » E¼ 1 0D log Z ¸ log Zm C log 1 C el D E2 , 2
(3.11)
« ¬ where the term D E2 corresponds to the variance (or second-order moment) « ¬ of E C m i si with respect to the distribution m ( Es ) , since m 0 D ¡ E C m i si . l 0, on the other hand, is proportional to the third-order moment according to equation 3.10. Explicit expressions for these moments can be found in appendix B. We still have not set the variational parameters m i to an appropriate value. Taking the derivative with respect to these parameters yields @B
n
D Zm @m i
¡ hD Esi i C el
0
±±
² 1 ¡ l0 hD Esi i
´¼ l0 D 2 E 1 D 3 E ¡ D E si ¡ D E si D 0, 2 6
(3.12)
which is an implicit equation for m i , analogous to the standard mean-eld equations. We can solve m i numerically by iteration. Wherever we speak of “fully optimized,” we refer to the solution of m i given by equation 3.12. Although the fully optimized m i give the tightest bound, we focus for a moment on the suboptimal case where m i correspond to the mean-eld solution, given by 8i
± ² def mi D tanh m i D tanh h i C h ij mj .
(3.13)
For this choice for m i , the log Zm term in equation 3.11 is equal to the optimal mean-eld bound on the partition function. Since the last term in equation 3.11 is always positive, we conclude that the third-order bound is always tighter than the mean-eld bound. Additionally, a real optimization of m i using equation 3.12 would tighten the third-order bound further. The relation between TAP and the third-order bound is clear in the region 3 of small weights. If we assume that the terms of O (h ij ) are negligible, a small
2156
M. A. R. Leisink and H. J. Kappen
weight expansion of equation 3.11 yields (see also appendix B) » ¼ 1 0 log Z ¸ log Zm C log 1 C el hD E2 i ¼ log Zm 2 ± ² ± ² 1 2 C h ij 1 ¡ m2i 1 ¡ mj2 , 4
(3.14)
where the last term is equal to the TAP correctionterm (Kappen & RodrÂõ guez, 1999). Thus, the third-order bound tends to the TAP approximation for small weights. For larger weights, however, the TAP approximation overestimates the partition function, whereas the third-order approximation is still a bound. 4 Sigmoid Belief Networks
In the previous section we saw how to derive a third-order bound on the partition function. For sigmoid belief networks, we can use the same strategy to obtain a third-order bound on the likelihood of the visible neurons of the network to be in some particular state. We start with a short description of sigmoid belief networks (see also Neal, 1992). After that we derive analogous to the previous section a third-order bound. A sigmoid belief network has connections h ij such that h ij D 0 for i · j. For i > j, it is zero if neuron j is not a parent of i. The probability of nding a neuron in state C 1 given all its parents can be written as ± ² ¡ ¢ P si D C 1 | pa ( si ) D s h ij sj C h i D
1 ¢, 1 C exp ¡2h ij sj ¡ 2h i ¡
(4.1)
and hence ¡ ¢ ¢ exp h ij si sj C h i si ( ) ¡ ¢ | P si pa si D , 2 cosh h ij sj C h i ¡
(4.2)
where there is no implicit summation over i. The joint probability is given by P ( Es ) D
Y ¡ ¢ ¡ ¢ P si | pa ( si ) D exp ¡E( Es ) ,
(4.3)
i
with ¡E( Es ) D h ij si sj C h i si ¡
X p
¡ ¢ log 2 cosh h pi si C h p .
(4.4)
The last term, known as the local normalization, does not appear in the Boltzmann machine energy function.
A Tighter Bound for Graphical Models
2157
We have similar difculties as in equation 3.3, if we want to compute the log-likelihood given by X X log L D log exp( ¡E( Es ) ). (4.5) P ( Es ) D log Es2H
Es2H
The sum is taken only over the hidden units; the visible units are clamped to some given pattern. Using Greek indices for the visible units, the clamped energy is given by ¡ ¡ ¢ ¢ ¡ ¢ ¡E( Es ) D h ij si sj C h i C h ia C h ai sa si C h ab sa sb C h a sa X ¡ ¡ ¢¢ ¡ log 2 cosh h pi si C h pa sa C h p , (4.6) p
where the index p runs over both hidden and visible units. 4.1 The Problem of Local Normalization. This problem has certain similarities with the Boltzmann machine. However, it is well known that due to the nonlinear log 2 cosh term in the sigmoid belief energy, the bound as in equation 3.4 is intractable for all choices of m ( Es ) and l( Es ) . Therefore, it is necessary to derive an additional bound such that the approximated likelihood is tractable to compute. We make use of the concavity of the log function to nd a straight-line upper bound3 given by
log x · ej x ¡ j ¡ 1.
8j
(4.7)
We use this inequality to bound the log 2 cosh term in equation 4.6 for each p separately, where we choose j p to be j p ( Es ) D j pi si C j p .
(4.8)
We derive ¡E( Es ) ¸ ¡EQ ( Es ) D h ij si sj C hQ i si C hQ ¡
Xn p
p
p
expj C ( Es ) C expj ¡ ( Es )
o (4.9)
with X ¡ ¢ hQ i D h i C h ia C h ai sa C j pi
(4.10)
p
hQ D h ab sa sb C h a sa C
X p
jp C N
¡ ¢ ¡ ¢ p j C ( Es ) D j pi C h pi si C j p C h pa sa C h p ¡ ¢ ¡ ¢ p j¡ ( Es ) D j pi ¡ h pi si C j p ¡ h pa sa ¡ h p . 3
(4.11) (4.12) (4.13)
This bound is also derivable using the method from section 2 starting with ¡ x12 · 0.
2158
M. A. R. Leisink and H. J. Kappen
By introducing this second bound, we have rewritten the likelihood in an already solved form (see equation 3.4), since ± ² ± ² X X Q m, l , L D exp( ¡E( Es ) ) ¸ exp ¡EQ ( Es ) ¸ B E, (4.14) Es2H
Es2H
±
²
Q m , l is tractable to compute. For instance, where the bound B E, D E D E XD E XD E p p ¡ EQ D h ij si sj C hQ i si C hQ ¡ expj C ( Es ) ¡ expj ¡ ( Es ) . p
(4.15)
p
p p Since j C ( Es ) and j ¡ ( Es ) correspond to factorized distributions, the last two terms are easily computed. Unfortunately, due to the summation over ¡ ¢ ¡ ¢ p, the complexity of the third-order bound increases from O N 3 to O N 4 .4 In Saul et al. (1996) the bound on the log 2 cosh term was derived in a different way. Their result is the special case that ( j pi D aph pi (4.16) j p D ap (h pa sa C h p ) ,
where the ap are N-variational parameters. One might argue that the number of variational parameters in our approach (proportional to N2 ) is too high for practical applications. Fortunately, we can reduce this number. It turns out that the optimal choice for j pi is zero if the corresponding weight h pi is zero, as one can easily verify by investigating the derivative of the bound with respect to those parameters. If the network has not too many parents for each node (which is often the case), this corresponds mathematically to a few nonzero weights for each node. Therefore, the number of variational parameters is rather linear in N than quadratic. The attentive reader might have thought about taking the quadratic bound, 5 8º
log 2 cosh x ·
1 ( x ¡ º) 2 C ( x ¡ º) tanh º C log 2 coshº, 2
(4.17)
instead of equation 4.7. Although this choice indeed leads to a tractable function that obeys the bounding property, we have experimental evidence that it gives a worse approximation generally, mainly for the larger weights and thresholds. 4
With the assumption ¡ 3 ¢ of not too many neurons in each layer, we can reduce the complexity to O N . In fact, for a layered network, the order of the computa-
¡
¢
tional complexity is max N 3 , N 2 P2 , where P is the number of neurons in the largest layer. 5 This can be derived using the method from section 2, starting with 1 ¡ tanh2 x · 1.
A Tighter Bound for Graphical Models
2159
5 Results
In this section we compare the third-order bound with the mean-eld bound. In section 5.1 this is done for Boltzmann machines and in section 5.2 for sigmoid belief networks. 5.1 Boltzmann Machines. In section 3 we derived the third-order bound for Boltzmann machines. We distinguish three bounds on the partition function: (1) the mean-eld bound, Bmf , (2) the third-order bound using the (easy to obtain) mean-eld solution (see equation 3.13) for m i , Btm , and (3) the fully optimized (see equation 3.12) third-order bound, Bto . The reason that we consider Btm apart from Bto is that the rst has a lower computational complexity, which is especially important for sigmoid belief networks. Note that Bmf · Btm · Bto · Z. We created networks of N D 20 neurons with thresholds a gaussian N (m D 0, s D s1 ) and weights drawn ± drawn from p ² from N m D 0, s D s2 / N for various s1 and s2 , which is a so called SK-model (Sherrington & Kirkpatrick, 1975). In Figure 3 the exact partition function versus s2 is shown with s1 D 0.1. In the same gure, the mean-eld and fully optimized third-order bound are shown together with the TAP approximation. For large s2 , the exact partition function is linear in s2 , whereas this is not necessarily the case for small s2 (see Figure 3). In fact, in the absence of thresholds, the partition function is quadratic for small s2 . Since TAP is based on a Taylor expansion in the weights up to second order, it is very accurate in the small weight region. However, as soon as the size of the weights exceeds the radius of convergence of this expansion (this occurs approximately at s2 D 1), the approximation rapidly diverges from the true value (Leisink & Kappen, 1999; Plefka, 1981). Although the gure might suggest otherwise, the TAP approximation is neither an upper nor a lower bound. The mean-eld and third-order approximation are both linear for large s2 , which prevents them from crossing the true partition function and violating the bound. For small weights (s2 < 1) we see that the third-order bound is much closer to the exact curved form than mean eld is. Thus, if one wants to preserve the bounding property but nds mean eld too poor to work with, the third-order approximation is worth considering. We dene the relative improvement from bound Bx to By by
gx!y D
log By ¡ log Bx . log Z ¡ log Bx
(5.1)
This quantity takes on values from zero to one, for minimal to maximal improvement, respectively. We consider gmf!tm and gtm!to . For several values of s1 and s2 , we computed the three bounds Bmf , Btm , and Bto, and the exact partition function. In Figure 4 the relative improvements are shown.
2160
M. A. R. Leisink and H. J. Kappen
Figure 3: The exact partition function and three approximations: mean eld, TAP, and fully optimized third order. The standard deviation of the thresholds is 0.1. Each point was averaged over 100 randomly generated networks of 20 neurons. The inner plot shows the behavior of the approximating functions for small weights. h
mf® tm
1.2
0.8
1
0.7
0.8
0.6
0.6
0.5
0.4
0.4
0.2
0.3
SD of the thresholds
0.9
0.2 0.4 0.6 0.8 1 1.2 1.4 SD of the weights / sqrt(N)
tm®
to
1.4
SD of the thresholds
h 1.4
0.12 0.1
1.2 1
0.08
0.8
0.06
0.6
0.04
0.4
0.02
0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 SD of the weights / sqrt(N)
Figure 4: Relative improvement of the bound. (Left) Comparison between third order and mean eld, both with the mean-eld solution for m i . (Right) Comparison between the third-order bound with the mean-eld solution for m i and the optimal solution. Zero is no improvement; one is maximal improvement. Each point was averaged over 20 randomly generated networks.
We conclude that a 30% to 100% improvement is due to the use of the thirdorder bound. Using the fully optimized m i instead of the (easier to obtain) mean-eld solution has only a minor effect (about 10%). We should mention,
A Tighter Bound for Graphical Models
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
1.2 1
0.8 0.6 0.4 0.2
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
1.4 SD of the thresholds
SD of the thresholds
1.4
2161
1.2 1
0.8 0.6 0.4 0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 SD of the weights / sqrt(N)
0.2 0.4 0.6 0.8 1 1.2 1.4 SD of the weights / sqrt(N)
Figure 5: Ratio between third-order and mean-eld approximation. (Left) Ratio of the mean ring rates. (Right) Ratio of the correlations. The solid line indicates a mean-eld error of 0.1, the dashed line a third-order error of 0.1.
however, that the full optimization becomes relatively more important for large s2 . However, in this regime, any expansion-based approximation is too inaccurate for practical purposes. Although the partition function approximation is more accurate, this is not necessarily the case for the mean ring rates and correlations in the system. For a Boltzmann machine, the mean ring rates are equal to @ log Z / @h i , and, in the same way, the correlations are given by @ log Z / @h ij . In Figure 5 we plotted the ratio of these statistics obtained by the third-order bound and the mean-eld approximation. In that way, a number smaller than one indicates that the third-order approximation is better. This is done for 25 networks of 10 neurons for all values for s1 and s2 . If we dene a sum squared error measure, Error2 D
¢2 1 X¡ hsi iexact ¡ hsi iapprox , n i
(5.2)
and similarly for the correlations, then the solid and dashed lines in the gures correspond to an error of 0.1 for mean-eld and third-order, respectively. We conclude that the third-order method is better than mean eld, at least in this region of weights and thresholds. 5.2 Sigmoid Belief Networks. As we saw in the previous section, we have two types of third-order bounds—one for which we fully optimize all variational parameters and one that we compute using the mean-eld solution for m i . We saw in Figure 4 that the major improvement is due to the third-order bound and not to the choice of optimization. Therefore, we propose to consider only the mean-eld solution, since the computation of the mean-eld parameters is considerably less complex than a full op-
2162
M. A. R. Leisink and H. J. Kappen
Figure 6: A toy problem.
timization. In the following we explore the computational quality of this bound. Given this choice the following options are left: using the mean-eld bound or the third-order bound and using all j pi as variational parameters or restricting them to the choice of Saul et al. (1996) as in equation 4.16. To assess the error made by the various approaches, we use the same toy problem as in Saul et al. (1996) and Barber and Wiegerinck (1998). The network has a top layer of two neurons, a middle layer of four neurons, and a lower layer of six visibles (see Figure 6). All neurons of two successive layers are connected with weights pointing downward and drawn from a uniform distribution over [¡b, b]. Each neuron has a threshold drawn from a uniform distribution over [¡a, a]. Since Saul et al. used a 0 / 1 coding for the neuron activity, we transformed the randomly generated weights and thresholds from their representation to ¡1 / C 1 coding, which is used in this article. This makes our results comparable with those of Saul et al. We want to compute the likelihood when all visibles are clamped to ¡1. Since the network is rather small, we can compute the exact likelihood to compare the lower bound with. In Figure 7 we show a typical example of the relative error, dened as log B / log L ¡ 1, for the four possible bounds given a D b D 1. It is clear from the gure that the use of the third-order bound reduces the error enormously. In the regime of such a small error, it is even helpful to use a full optimization of the parameters j pi . We computed the relative improvement as dened in equation 5.1 (but with the log-likelihood instead of log Z) between the rst-order bound with partial optimization of j p ( Es ) (as in Saul et al., 1996) and the third-order bound with a full optimization. The value of the improvement g for several a and b is shown in Table 1. The numbers are averaged over 981 networks (in 19 cases, the optimizer did not converge). It is clear from the table that the gain by using the third-order bound is almost independent of the size of the thresholds. The size of the weights, however, plays an important role, and we see that the relative error decreases by more than a factor of 20 (the relative improvement is about 95%) for small weights to a factor of 4 for b D 2. Keep in mind that the third-order bound is always better (i.e., g > 0)
A Tighter Bound for Graphical Models
Relative error (%)
1.5
Mean field
2163
S exp( E) bound log cosh bound
1
0.5 Third order 0 Partial
Full
Partial
Full
Figure 7: Typical example of the error made by the various bounds. The left two bars are mean eld, and the right two bars are third-order bounds. The word full refers to a full optimization of all j pi , whereas the word partial stands for the special choice for j p ( Es ) as in equation 4.16. The gray surface is the error due to the bound on the exponential function, whereas black refers to the bound on the log 2 cosh term.
Figure 8: Histograms of the relative error for a D b D 1. The error of the thirdorder bound is roughly 10 times smaller than the error of the mean-eld bound.
for large weights also. In Figure 8, we show the histograms of the relative error for the case a D b D 1. Besides this toy problem we address the quality of the approximation and the computation time for larger networks. Let us rst dene the cascade network as shown in Figure 9. This is a layered network with L layers.
2164
M. A. R. Leisink and H. J. Kappen
Table 1: Relative Improvement as Dened in Equation 5.1 of the Third-Order Bound with Full Optimization ofj p ( Es ) Compared to the Mean-Field Bound with Partial Optimization. a b
0.5
1.0
1.5
2.0
0.5
0.968
0.903
0.837
0.767
1.0
0.968
0.906
0.839
0.769
1.5
0.969
0.907
0.840
0.771
2.0
0.971
0.911
0.842
0.774
Note: Boldface type denotes the toy problem as in Saul et al. (1996).
Figure 9: Cascade network. The lth layer has l neurons. Thresholds are initialized from a gaussian with zero mean and standard deviation s1 ; weights between layerpl and lpC 1 are from a gaussian with zero mean and standard deviation s2 / l. The l makes s1 and s2 comparable in magnitude.
The lth layer (l D 1, . . . , L) has l neurons, which are fully connected to the previous and the next layer. A cascade network with L layers has N D L ( L C 1) / 2 neurons. For each L up to 37, we initialized 10 networks with random weights and thresholds (s1 D s2 D 0.1) and computed the mean-eld and third-order bound on the log-likelihood. Since no units were clamped, the exact log-likelihood is zero. From Figure 10 we conclude that the thirdorder bound gives a signicant error reduction, although computation time may be a drawback for large networks. Although a better bound on the likelihood is nice, a more important aspect is whether this results in a better-trained network. First, we have to dene what we mean by “a better trained network.” Our goal can be to maximize the likelihood of a given data set by setting all the weights and thresholds to an appropriate value. In that case, we view exact, mean eld, and third order just as three different models, each learning the data set as much as possible allowing a totally different set of weights for each model. On the other hand, our goal can be to approximate the exact method as much as possible such that the weights, thresholds, and mean activities
A Tighter Bound for Graphical Models
2165
Figure 10: Computation of the likelihood for the cascade network. (Left) Computation time on a double log scale for exact, mean eld, and third order. (Right) Estimated log-likelihood per neuron for mean eld and third order. Since no neurons are clamped, the exact log-likelihood is zero. Each point was averaged over 10 networks. The dotted lines are three standard deviations from the mean.
obtained with our approximation method closely resemble the exact values. An accurate approximation in this sense makes it possible to reveal the hidden structure of a particular data set. We started with a cascade network (see Figure 9) with L D 5 layers and initialized the network with zero p thresholds and gaussian distributed weights with standard deviation s2 / l. Then we computed the exact probabilities, p( Es ), for all 25 D 32 states of the bottom layer. Finally, we learned this probability distribution by maximizing the total likelihood, X log L total D (5.3) p ( Es ) log L ( Es ) , Es
with a standard gradient ascent procedure. The log-likelihood is given by equation 4.5 for exact and by equation 4.14 for mean-eld and the thirdorder method. The thresholds were initialized with zero and not adapted. This results in three sets of weights: (1) the exact weights, h ex (used to generate the probability distribution), (2) the mean-eld weights, h mf , and (3) the third-order weights, h th . The initialization of the weights was random but identical for the three methods. In this way, we force them to learn the same hidden representation, which enables us to compare the weights directly. This is necessary, since any permutation of the hidden nodes would result in same maximum likelihood. Additionally, we repeated the experiments, where we initialized with the exact weights and did the maximization. The exact method stopped immediately, but mean eld and third order adapted the weights slightly and came up with comparable results as for the random initialization. When our goal is to maximize the likelihood regardless of the underlying model, the three methods do not differ very much. For s2 · 1, the relative
2166
M. A. R. Leisink and H. J. Kappen
Figure 11: Difference between the exact and approximated weights after training a cascade network with ve layers, s1 D 0 and s2 D 0.5 The upper two histograms correspond to mean eld, the lower two to third order. Left is the histogram of weights from a hidden node to another hidden (upper layers); right is from a hidden to a visible (layer 4 to 5).
error for the approximation versus the exact method is usually smaller than 0.1%. When our goal, however, is to approximate the exact model and nd the weights that explain the hidden structure of the data set, the third-order method turns out to be much better than mean eld. We ran 20 learning problems starting with random initialization. In Figure 11, we show the histograms of the difference between the exact and approximated weights for mean eld and third order. It is striking that mean eld manages only to learn the weights between layer 4 and 5, whereas the third-order method is capable of learning all weights up to the top quite accurately. If we dene the error in the weights by Error2 D
²2 X± h ij exact ¡ h ij approx ,
(5.4)
ij
the average error for mean eld was 0.566 and for third order 0.073. In all runs, the third-order error was less than mean eld.
A Tighter Bound for Graphical Models
2167
We conclude that both the mean-eld and third-order methods are capable of nding a good log-likelihood, although third order is still slightly better. Mean eld, viewed as an approximation of the exact method, however, fails to learn the hidden structure accurately. 6 Conclusion
We explained a procedure to nd any odd-order polynomial bound for the exponential function. A 2k ¡ 1 order polynomial bound has k free parameters per binary variable. For the third order bound, these are m and l. We can apply this bound to the exponential function to derive a bound on the partition function. In the simplest case, where the free parameters dene the energy function of a factorized distribution, we have ( N C 1) k free parameters. It is certainly possible to use other choices, as was done in Barber and Wiegerinck (1998). Since the approximating function is a bound, we may maximize it with respect to all its free parameters. In this article we restricted ourselves to the third-order bound, although an extension to any odd-order bound is possible. Third order is the next higher-order bound to naive mean eld. We showed that this bound is strictly better than the mean-eld bound and tends to the TAP approximation for small weights. For larger weights, however, the TAP approximation crosses the partition function and violates the bounding properties. We saw that the third-order bound makes an enormous improvement in the quality of the bound, which gradually declines in the region of large weights and small thresholds, where almost all expansion-based approximations are bad. We conclude that third-order bounds are helpful in general, since they are always tighter than the mean-eld bound. In practice, however, third-order bounds are most useful for problems just outside the scope of the mean-eld approximation. Besides the partition function itself, we compared the approximated mean ring rates and correlations. There we saw an improvement of the approximation in the whole range, especially for small weights. A promising direction for further research is to combine the third-order lower bound with upper bounds on Boltzmann machines and sigmoid belief networks to obtain better bounds on conditional probabilities. Also in training, the third-order method is better than mean eld. The differences are quite small when we are simply interested in maximizing the log-likelihood. However, the hidden structure found by mean eld is totally different compared to the exact method, especially for the weights between hidden nodes. Third order, on the other hand, manages to nd a highly comparable structure. A full optimization of the bound is computationally expensive (especially for the sigmoid belief networks). Therefore we suggest using the meaneld solution for m i instead of solving equation 3.12. This avoids an O ( N 4 ) for Boltzmann machines and a worst-case O ( N6 ) for belief networks to
2168
M. A. R. Leisink and H. J. Kappen
nd the tightest bound, whereas the approximation is almost as good as in the fully optimized case. The computational complexity for Boltzmann machines is then O ( N3 ) to optimize and compute the bound; for sigmoid belief networks, the worst-case complexity is O ( N 4 ), although an O ( N3 ) is more likely for layered networks. Appendix A: Optimal Solution for m 0 and l0
The bound from equation 3.4 is valid for all values of m 0 , l0 , and m i . To nd the tightest bound, we set all the derivatives with respect to these variational parameters to zero. For m 0 and l 0, it is possible to obtain explicit results. We derive (keep in mind that D E does depend on m 0) »
@B
D Zm @m 0
¡ hD Ei C el
D 0 @B
D Zm el @l0
0
³±
´¼ ² l0 D 2 E 1 D 3 E 1 ¡ l0 hD Ei ¡ DE ¡ DE 2 6 (A.1)
³ 0
¡
´ l0 D 2 E 1 D 3 E D E ¡ DE D 0. 2 6
(A.2)
Given equation A.2, we can reduce equation A.1 to ± ± ²² 0 hD Ei 1 ¡ el 1 ¡ l0 D 0.
(A.3)
This leads to two possible solutions for m 0 and l 0 : 8 0 « ¬ <m D ¡ E C m i s i D E D E : l 0 D ¡ 1 D E3 / D E2 3
(A.4a)
8½ ± ²3 ¾ < E C m i si C m 0 D 0 : 0 l D 0.
(A.4b)
or
The implicitequation at the top of equation A.4b can easily be made explicit, since it is cubic in m 0. We analyze the stability of both solutions by investigating the Hessian, 2
@2 B
± ² 6 m 02 6 @ H m 0, l0 D 6 4 @2 B @l0 @m 0
@2 B
3
@m 0@l 0 7 7 @2 B @l0
2
7, 5
(A.5)
A Tighter Bound for Graphical Models
2169
at both solution points. These can be written as ± ² H m 0 , l0 D ¡Zm
" « 1 2
¬
D E2 C X « 2¬ 1 2 DE
1 2 1 2
¬#
«
D E2 « 2¬ , DE
(A.6)
where X D 1 ¡ el
0
±
² 1 ¡ l0 ,
(A.7)
6 which is zero for solution A.4b and positive if l D 0. Therefore, the Hessian is negative denite for solution A.4a. Solution A.4b, however, has a zero eigenvalue. It turns out that for small uctuations 2 in the direction corresponding to this eigenvalue, the bound varies proportional to hD Ei2 3 , and therefore this solution corresponds to a saddle point and should be discarded. Thus, the optimal choice for m 0 and l 0 is given by equation A.4a.
«
¬
«
Appendix B: Explicit Expressions for D E2 and D E3
¬
« ¬ « ¬ It is possible to compute the moments D E2 and D E3 under the factorized distribution m ( Es ) . Dening mi D hsi i D tanh m i ,
(B.1)
we derived ²± ² 1 ± ² 1 D 2E 1 2± D E D h ij 1 ¡ m2i 1 ¡ mj2 C ai 2 1 ¡ m2i 2 4 2 D E ± ² ± ²± ² 1 1 ¡ D E3 D h ijh jkh ki 1 ¡ m2i 1 ¡ mj2 1 ¡ m2k 6 6 ± ²± ² 1 3 C h ij mi mj 1 ¡ m2i 1 ¡ mj2 3 ² 1 ± ²± ² 1 3 ± ¡ ai mi 1 ¡ m2i C ai a jh ij 1 ¡ m2i 1 ¡ mj2 3 2 ± ²± ² 2 i ij 2 ¡ a h mi 1 ¡ mi 1 ¡ mj2
( B.2)
( B.3) (B.4)
where ai D h i C h ij mj ¡ m i
(B.5)
Note that ai D 0 is equivalent to the well-known mean-eld equations. For the fully optimized third-order bound, however, ai differs from zero in general.
2170
M. A. R. Leisink and H. J. Kappen
Acknowledgments
This research is supported by the Technology Foundation STW, Applied Science Division of NWO, and the technology programme of the Ministry of Economic Affairs.
References Ackley, D., Hinton, G., & Sejnowski, T. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Barber, D., & Wiegerinck, W. (1998). Tractable undirected approximations for graphical models. In L. Niklasson, M. BodÂen, & T. Ziemke (Eds.), International Conference on Articial Neural Networks, (vol. 1, pp. 93–98). New York: Springer-Verlag. Jaakkola, T., & Jordan, M. (1996a). Recursive algorithms for approximating probabilities in graphical models (Comp. Cogn. Science Tech. Rep. 9604). Cambridge, MA: MIT. Jaakkola, T. S., & Jordan, M. I. (1996b). Computing upper and lower bounds on likelihoods in intractable networks. In Proceedings of the Twelfth Annual Conference on Uncertainty in Articial Intelligence (UAI–96) (pp. 340–348). San Mateo, CA: Morgan Kaufmann. Jaakkola, T., Saul, L., & Jordan, M. (1996). Fast learning by bounding likelihoods in sigmoid-type belief networks. In D. Touretzsky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 528–534). Cambridge, MA: MIT Press. Kappen, H., & RodrÂõ guez, F. (1999). Boltzmann machine learning using mean eld theory and linear response correction. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural informationprocessing systems,11 (pp. 280–286). Cambridge, MA: MIT Press. Leisink, M., & Kappen, H. (1999). Validity of TAP equations in neural networks. In International Conference on Articial Neural Networks (vol. 1, pp. 425–430). London: Institution of Electrical Engineers. Neal, R. (1992). Connectionist learning of belief networks. Articial Intelligence, 56, 71–113. Neal, R., & Hinton, G. (1998). A view of the EM algorithm that justies incremental, sparse, and others variants. In M. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann. Peterson, C., & Anderson, J. (1987). A mean eld theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Plefka, T. (1981). Convergence condition of the TAP equation for the inniteranged Ising spin glass model. J. Phys. A: Math. Gen., 15, 1971–1978. Saul, S., Jaakkola, T., & Jordan, M. (1996). Mean eld theory for sigmoid belief networks. Journal of Articial Intelligence Research, 4, 61–76.
A Tighter Bound for Graphical Models
2171
Sherrington, D., & Kirkpatrick, S. (1975). Solvable model of a spin-glass. Physical Review Letters, 35(26), 1793–1796. Thouless, D., Anderson, P., & Palmer, R. (1977). Solution of “solvable model of a spin glass.” Philosophical Magazine, 35(3), 593–601. Received February 17, 2000; accepted November 17, 2000.
Neural Computation, Volume 13, Number 10
CONTENTS Article Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology Yair Weiss and William T. Freeman
2173
Letters MOSAIC Model for Sensorimotor Learning and Control Masahiko Haruno, Daniel M. Wolpert, and Mitsuo Kawato
2201
Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning Rajesh P. N. Rao and Terrence J. Sejnowski
2221
Analysis and Neuronal Modeling of the Nonlinear Characteristics of a Local Cardiac Reex in the Rat Rajanikanth Vadigepalli, Francis J. Doyle III, and James S. Schwaber
2239
Evaluating Auditory Performance Limits: I. One-Parameter Discrimination Using a Computational Model for the Auditory Nerve Michael G. Heinz, H. Steven Colburn, and Laurel H. Carney
2273
Evaluating Auditory Performance Limits: II. One-Parameter Discrimination with Random-Level Variation Michael G. Heinz, H. Steven Colburn, and Laurel H. Carney
2317
Adaptive Algorithm for Blind Separation from Noisy Time-Varying Mixtures V. Koivunen, M. Enescu, and E. Oja
2339
Robust Full Bayesian Learning for Radial Basis Networks Christophe Andrieu, Nando de Freitas, and Arnaud Doucet
2359
ARTICLE
Communicated by Brendan J. Frey
Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology Yair Weiss Computer Science Division, University of California, Berkeley,CA 94720-1776,U.S.A.
William T. Freeman Mitsubishi Electric Research Labs, Cambridge, MA 02139, U.S.A.
Graphical models, such as Bayesian networks and Markov random elds, represent statistical dependencies of variables by a graph. Local “belief propagation” rules of the sort proposed by Pearl (1988) are guaranteed to converge to the correct posterior probabilities in singly connected graphs. Recently, good performance has been obtained by using these same rules on graphs with loops, a method we refer to as loopy belief propagation. Perhaps the most dramatic instance is the near Shannon-limit performance of “Turbo codes,” whose decoding algorithm is equivalent to loopy propagation. Except for the case of graphs with a single loop, there has been little theoretical understanding of loopy propagation. Here we analyze belief propagation in networks with arbitrary topologies when the nodes in the graph describe jointly gaussian random variables. We give an analytical formula relating the true posterior probabilities with those calculated using loopy propagation. We give sufcient conditions for convergence and show that when belief propagation converges, it gives the correct posterior means for all graph topologies, not just networks with a single loop. These results motivate using the powerful belief propagation algorithm in a broader class of networks and help clarify the empirical performance results. 1 Introduction Problems involving probabilistic belief propagation arise in a wide variety of applications, including error correcting codes, speech recognition, and image understanding. Typically, a probability distribution is assumed over a set of variables, and the task is to infer the values of the unobserved variables given the observed ones. The assumed probability distribution is described using a graphical model (Lauritzen, 1996); the qualitative aspects of the distribution are specied by a graph structure. The graph may be either directed, as in a Bayesian network (Pearl, 1988; Jensen, 1996), or undirected, as in a Markov random eld (Pearl, 1988; Geman & Geman, 1984). c 2001 Massachusetts Institute of Technology Neural Computation 13, 2173–2200 (2001) °
2174
Yair Weiss and William T. Freeman
If the graph is singly connected (there is only one path between any two given nodes), then there exist efcient local message-passing schemes to calculate the posterior probability of an unobserved variable given the observed variables. Pearl (1988) derived such a scheme for singly connected Bayesian networks and showed that this “belief propagation” algorithm is guaranteed to converge to the correct posterior probabilities (or “beliefs”). However, as Pearl noted, the same algorithm is not guaranteed to converge in multiply connected networks, and even if it does, it will not calculate the correct beliefs. Several groups have recently reported excellent experimental results by running algorithms equivalent to Pearl’s algorithm on networks with loops (Frey, 1998; Murphy, Weiss, & Jordan, 1999; Freeman, Pasztor, & Carmichael, 2000). Perhaps the most dramatic instance of this performance is in an error-correcting code scheme known as Turbo codes (Berrou, Glavieux, & Thitimajshima, 1993). These codes have been described as “the most exciting and potentially important development in coding theory in many years” (McEliece, Rodemich, & Cheng, 1995) and have recently been shown (Kschischang & Frey, 2000; McEliece & MacKay, 1998) to use an algorithm equivalent to belief propagation in a network with loops. Although there is widespread agreement in the coding community that these codes “represent a genuine, and perhaps historic, breakthrough” (McEliece et al., 1995), a theoretical understanding of their performance has yet to be achieved. Progress in the analysis of loopy belief propagation has been made for the case of networks with a single loop (Weiss, 1997, 2000; Forney, Kschischang, Marcus, & Tuncel, 2000; Aji, Horn, & McEliece, 1998). For these networks, it can be shown that: Unless all the compatibilities are deterministic, loopy belief propagation will converge. An analytic expression relates the correct marginals to the loopy marginals. The approximation error is related to the convergence rate of the messages; the faster the convergence, the more exact the approximation. If the hidden nodes are binary, then both the loopy beliefs and the true beliefs are maximized by the same assignments, although the condence in that assignment is wrong for the loopy beliefs. In this article we analyze belief propagation (BP) in graphs of arbitrary topology where the nodes describe jointly gaussian random variables. We show that if BP converges, then it will give the correct posterior means for all graph topologies, not just networks with a single loop. This result holds for vector-valued or scalar-valued nodes and under arbitrary message scheduling. In particular, it holds for Turbo decoding when the densities are assumed to be gaussian. We also analyze the dynamics of BP under a synchronous messagepassing protocol. We derive an exact relationship between the means and
Correctness of Gaussian Belief Propagation
2175
Figure 1: Any Bayesian network can be converted into an undirected graph with pairwise cliques by adding cluster nodes for all parents that share a common child. (a) A Bayesian network. (b) The corresponding undirected graph with pairwise cliques. A cluster node for (x2, x3) has been added. The potentials can be set so that the joint probability in the undirected network is identical to that in the Bayesian network. In this case, the update rules presented in this article reduce to Pearl’s propagation rules in the original Bayesian network (Weiss, 2000).
covariances estimated using BP at every iteration and the correct means and covariances. The covariance estimates will generally be incorrect, but we present a relationship between the error in the covariance estimates and the convergence speed. 2 Belief Propagation in Undirected Graphical Models Pearl’s original algorithm was described for directed graphs, but in this article, we focus on undirected graphs. Every directed graphical model can be transformed into an undirected graphical model before doing inference (see Figure 1). An undirected graphical model (or a Markov random eld, MRF) is a graph in which the nodes represent variables and arcs represent compatibility constraints between them. Assuming all probabilities are nonzero, the Hammersley-Clifford theorem (e.g., Pearl, 1988) guarantees that the probability distribution will factorize into a product of functions of the maximal cliques of the graph.
2176
Yair Weiss and William T. Freeman
Denoting by x the values of all unobserved variables in the graph, the factorization has the form P ( x) D
Y
Yc ( xc ) ,
(2.1)
c
where xc is a subset of x that form a clique in the graph and Yc is the potential function for the clique. We will assume, without loss of generality, that each xi node has a corresponding yi node that is connected only to xi . Thus, P ( x, y ) D
Y c
Yc ( xc )
Y
Yii ( xi , yi ) .
(2.2)
i
The restriction that all the yi variables are observed and none of the xi variables are is just to make the notation simple; Yii ( xi , yi ) may be independent of yi (equivalent to yi being unobserved) or Yii ( xi , yi ) may be d( xi ¡ xo ) (equivalent to xi being observed, with value xo ). In describing and analyzing belief propagation, we assume the graphical model has been preprocessed so it becomes a pairwise MRF—an MRF in which the joint probability factorizes into a product of bivariate potentials (potentials involving only two variables). Any graphical model can be converted into this form before doing inference through a suitable clustering of nodes into large nodes (Weiss, 2000). Figure 1 shows an example; a Bayesian network is converted into an MRF in which all the cliques are pairs of units and the probability factorizes into a product of bivariate potentials. The formalism of “factor graphs” (Frey, Koetter, & Vardy, 1998) is more explicit than the MRF formalism in representing the factorization of the joint distribution. In Weiss and Freeman (2001) we show how any factor graph can be converted into a pairwise MRF and that when this conversion is performed, running BP on the pairwise MRF is equivalent to running probability propagation on the factor graph. Equation 2.2 becomes P ( x, y ) D
Y i, j
Yij ( xi , xj )
Y
Yii ( xi , yi ) ,
(2.3)
i
where the rst product is over connected pairs of nodes. By preprocessing the graph into one with pairwise cliques, the description and the analysis of belief propagation become simpler. For completeness, we review the belief propagation scheme used in Weiss (2000). At every iteration, each node sends a (different) message to each of its neighbors and receives a message from each neighbor. Let xi and xj be two neighboring nodes in the graph. We denote by mij ( xj ) the message that node xi sends to node xj , and by bi ( xi ) ) the belief at node xi .
Correctness of Gaussian Belief Propagation
2177
The belief update (or “sum-product” update) rules are: Z mij ( xj ) Ã a
Y
Yij ( xi , xj ) Yii ( xi , yi ) xi
bi ( xi ) Ã aYii ( xi , yi )
Y
mki ( xi )
(2.4)
xk 2N(xi )nxj
mki ( xi ) ,
(2.5)
xk 2N(xi )
where a denotes a normalization constant and N ( xi ) nxj means all nodes neighboring xi , except xj . The procedure is initialized with all message vectors set to constant functions. The normalization of mij in equation 2.4 is not necessary; whether or not the message are normalized, the belief bi will be identical. However, normalizing the messages avoids numerical underow and adds to the stability of the algorithm. We assume throughout this article that all nodes simultaneously update their messages in parallel. It is easy to show that for singly connected graphs, these updates will converge in a number of iterations equal to the diameter of the graph and the beliefs are guaranteed to give the correct posterior marginals: bi ( xi ) D Prob ( Xi D xi | Y) where Y denotes the set of observed variables. We emphasize that this transformation of a graphical model into a pairwise MRF is simply to make the analysis simpler. The belief propagation scheme described here includes as special cases Pearl’s (1988) BP algorithm in directed graphs, the generalized distributive law algorithm of Aji and McEliece (1999), and the factor graph propagation algorithm of Frey et al. (1998). These three algorithms correspond to the algorithm presented here with a particular way of preprocessing the graph in order to obtain pairwise potentials (Weiss & Freeman, 2001). Thus, our analysis of BP holds for these alternative formulations of BP as well (see section 5.1 for an explicit discussion of vector-valued nodes that will generally arise as a result of clustering). For particular graphs with particular settings of the potentials, equations 2.4 and 2.5 yield other well-known Bayesian inference algorithms, such as the forward-backward algorithm in hidden Markov models, the Kalman lter, and even the fast Fourier transform (Aji & McEliece, 1999; Kschischang & Frey, 2000). A related algorithm, max-product, changes the integration in equation 2.4 to a maximization. This message passing is equivalent to Pearl’s “belief revision” algorithm in directed graphs. For particular graphs with particular settings of the potentials, the max-product algorithm is equivalent to the Viterbi algorithm for hidden Markov models and concurrent dynamic programming. We dene the max-product assignment at each node to be the value that maximizes its belief (assuming a unique maximizing value exists). For singly connected graphs, the max-product assignment is guaranteed to give the maximum a posteriori (MAP) assignment.
2178
Yair Weiss and William T. Freeman
2.1 Gaussian Markov Random Fields. A gaussian MRF (GMRF) is an MRF in which the joint distribution is gaussian. We assume, without loss of generality, that the joint mean ¡ ¢is zero (the means can be added-in later), so the joint distribution of z D xy is given by: 1 T
Prob( z ) D ae ¡ 2 z
Vz
,
(2.6)
where a is a normalization constant and V has the following structure:
³ VD
Vxx
Vxy
Vyx
Vyy
´ .
(2.7)
It is straightforward to write the inverse covariance matrix describing the GMRF which respects the statistical dependencies within the graphical model (Cowell, 1998). Again we assume that the graph has been preprocessed so that the joint distribution factorizes as a product of pairwise potentials. Thus, there exist matrices Vij , one corresponding to each connected pair of nodes and matrices Vii corresponding to each node such that Prob( z) D a
Y
e¡ 2 ( xi xj ) Vij ( xi xj ) 1
T
Y
T
e ¡ 2 ( xi yi ) Vii ( xi yi ) , 1
(2.8)
i
i, j
where the rst product is again over connected pairs of nodes. We will assume that the preprocessing is such that Vij are all positive semidenite matrices. Note that the decomposition of V into Vij , Vii is not unique. For scalar nodes, any set of Vij , Vii that satises the following constraints are valid: Vxy ( i, i) D Vii ( 1, 2) Vxx ( i, j) D Vij ( 1, 2) Vxx ( i, i) D Vii ( 1, 1) C
(2.9) (2.10) X
Vij ( 1, 1) .
(2.11)
xj 2N(xi )
2.2 Exact Inference in Gaussian MRFs. Writing out the exponent of 2.6 and completing the square shows that the mean m of x, given y, is a solution to: Vxxm D ¡Vxyy,
(2.12)
and the covariance matrix Cx|y of x given y is ¡1 . Cx|y D Vxx
(2.13)
We will denote by Cxi |y the ith row of Cx|y , so the marginal posterior variance of xi , given the data, is Cxi |y ( i) .
Correctness of Gaussian Belief Propagation
2179
2.3 Belief Propagation for MRFs. Belief propagation in MRFs gives simpler update formulas than the general case (see eqs. 2.4 and 2.5). The messages and the beliefs are all gaussians, and the updates can be written directly in terms of the means and inverse covariance matrices. Each node sends and receives a mean vector and inverse covariance matrix to and from each neighbor, in general, each different. To write the updates explicitly in terms of means and covariances, we denote by m ij the mean of the message that node xi sends to node xj and by Pij the precision or inverse covariance matrix that node xi sends to node xj . Similarly, we denote by m i the mean of the belief at node xi and by Pi the inverse covariance matrix of the belief. As before, we use m ii , Pii for the mean and inverse covariance of Yii ( xi , yi ) . We partition the matrix Vij into
³ Vij D
´
a
b
bT
c
.
(2.14)
The message update rules are Pij à c ¡ b( a C P 0 ) ¡1 bT
(2.15)
m ij à ¡Pij¡1 b ( a C P 0 ) ¡1 P 0m 0,
(2.16)
with P 0 D Pii C
X
(2.17)
P ki xk 2N(xi )nxj
0
@P iim ii C m 0 D P ¡1 0
1 X
P kim ki A .
(2.18)
xk 2N(xi )nxj
The beliefs are given by Pi à Pii C
X
³ mi Ã
Pi¡1
(2.19)
P ki
xk 2N(xi )
P iim ii C
X
´ P kim ki .
(2.20)
xk 2N(xi )
We can now state the main question of this article. What is the relationship between the true posterior means and covariances (calculated using equation 2.12) and the belief propagation means and covariances (calculated using the belief propagation rules equations 2.15–2.20)?
2180
Yair Weiss and William T. Freeman
3 Dynamics of Belief Propagation In this section we analyze the dynamics of BP in a gaussian pairwise MRF. To simplify the notation, we assume throughout this section that all nodes are scalar valued. In section 5.1 we generalize the analysis to vector-valued nodes. To compare the correct posteriors and the loopy beliefs, we construct an unwrapped tree. The unwrapped tree is the graphical model that the loopy belief propagation is solving exactly when applying the belief propagation rules in a loopy network (Gallager, 1963; Wiberg, 1996; Weiss, 2000). In errorcorrecting codes, the unwrapped tree is referred to as the computation tree; it is based on the idea that the computation of a message sent by a node at time t depends on messages it received from its neighbors at time t ¡ 1 and those messages depend on the messages the neighbors received at time t ¡ 2, and so forth. To construct the topology of the unwrapped tree, set an arbitrary node, say x1 , to be the root node, and then iterate the following procedure t times: 1. Find all leaves of the tree (start with the root). 2. For each leaf, nd all k nodes in the loopy graph that neighbor the node corresponding to this leaf. 3. Add k¡1 nodes as children to each leaf, corresponding to all neighbors except the parent node. The potential matrices and observations for each node in the unwrapped network are copied from the corresponding nodes in the loopy graph. Each node in the loopy graph will have a different unwrapped tree with that node at the root. Figure 2 shows an unwrapped tree around node x1 for the diamondshaped graph on the left. Each node has a shaded observed node attached to it that is not shown for clarity. Since belief propagation is exact for the unwrapped tree, we can calculate the beliefs in the unwrapped tree by using the marginalization formulas for gaussians. We use Q for unwrapped quantities. We scan the tree in breadth-rst order and denote by xQ the vector of values in the hidden nodes of the tree when scanned in this fashion. Similarly, we denote by yQ the observed nodes ¡¢ scanned in the same order. As before, zQ D xyQQ . The basic idea behind our analysis is to relate the wrapped and unwrapped inverse covariance matrices. From equation 2.11, all elements VQ xy ( i0 , j0 ) and yQ ( i0 ) are copies of the corresponding elements Vxy ( i, j) and y ( i) (where xQ i0 , xQj0 are replicas of xi ,xj ). Also, all elements VQ xx ( i0 , j0 ) are copies of Vxx ( i, j) , and elements VQ xx ( i0 , i0 ) for non-leaf nodes are replicas of Vxx ( i, i) . However, the elements VQ xx ( i0 , i0 ) for the leaf nodes are not copies of Vxx ( i, i) because the leaf nodes are missing some neighbors.
Correctness of Gaussian Belief Propagation
2181
Figure 2: (Left) A Markov network with multiple loops. (Right) The unwrapped network corresponding to this structure. The unwrapped networks are constructed by replicating the potentials Yij ( xi , xj ) and observations yi while preserving the local connectivity of the loopy network. They are constructed so that the messages received by node A after t iterations in the loopy network are equivalent to those that would be received by A in the unwrapped network. An observed node, yi , not shown, is connected to each depicted node.
Intuitively, we might expect that if all the equations that mQ satises are copies of the equations that m satises, then simply creating mQ by many copies of m would give a valid solution in the unwrapped network. However, because some of the equations are not copies, this intuition does not explain why the means are exact in gaussian networks. An additional intuition, which we formalize below, is that the inuence of the noncopied equations (those at the leaf nodes) decreases with additional iterations. As the number of iterations is increased, the distance between the leaf nodes and the root node increases and their inuence on the root node decreases. When their inuence goes to zero, the mean at the root node is exact. Although the elements VQ xx ( i0 , j0 ) are copies of Vxx ( i, j) for the nonleaf nodes, the matrix VQ xx is not simply a block replication of Vxx . The system of equations that denes mQ is a coupled system of equations. Hence ¡1 (1 1) differs from the correct variance the variance at the root node VQ xx , ¡1 ( Vxx 1, 1) . In the following section, we prove the following four claims regarding the dynamics of loopy belief propagation in gaussian graphical models. Assume, without loss of generality, that the root node is x1 . Let mQ ( 1) and sQ 2 ( 1) be the conditional mean and variance at node 1 after t iterations of loopy propagation. Let m ( 1) and s 2 (1) be the correct conditional mean and variance of node 1. Let CQ x1 |y be the conditional correlation of the root node
2182
Yair Weiss and William T. Freeman
with all other nodes in the unwrapped tree; then: Claim 1: mQ ( 1) D m ( 1) C CQ x1 |y r,
(3.1)
where r is a vector that is zero everywhere but the last L components (corresponding to the leaf nodes). Claim 2: sQ 2 ( 1) D s 2 (1) C CQ x1 |y r1 ¡ CQ x1 |y r2 ,
(3.2)
where r1 is a vector that is zero everywhere but the last L components and r2 are equal to 1 for all components corresponding to nonroot nodes in the unwrapped tree that reference x1 . All other components of r2 are zero. Claim 3: If the conditional correlation between the root node and the leaf nodes decreases rapidly enough, then (1) belief propagation converges, (2) the belief propagation means are exact, and (3) the belief propagation variances are equal to the correct variances minus the summed conditional correlations between xQ 1 and all xQj that are replicas of x1 .
Claim 4: Assume all Vij are diagonally dominant. Then (1) belief propagation converges, (2) the belief propagation means are exact, and (3) the belief propagation variances are equal to the correct variances minus the summed conditional correlations between xQ 1 and all xQj that are replicas of x1 . To obtain intuition, Figure 3 shows CQ x1 |y for the diamond gure in Figure 2. We generated random potential functions and observations for the loopy diamond gure and calculated the conditional correlations in the unwrapped network. Note that the conditional correlation decreases with distance in the tree; we are scanning in breadth-rst order so the last L components correspond to the leaf nodes. As the number of iterations of loopy propagation is increased, the size of the unwrapped tree increases, and the conditional correlation between the leaf nodes and the root node decreases. From equations 3.1 and 3.2, it is clear that if the conditional correlations between the leaf nodes and the root nodes are zero for all sufciently large unwrappings, then (1) belief propagation converges, (2) the means are exact, and (3) the belief propagation variances are equal to the correct variances minus the summed conditional correlations between xQ 1 and all xQj that are replicas of x1 . In practice, the conditional correlations will not actually be equal to zero for any nite unwrapping, so claim 3 states this more precisely. Claim 4 gives sufcient conditions, in terms of the Vij matrices, for the conditional correlation to decrease rapidly enough.
Correctness of Gaussian Belief Propagation
2183
Figure 3: The conditional correlation between the root node and all other nodes in the unwrapped tree for the diamond gure after seven iterations. Potentials were chosen randomly. Nodes are presented in breadth-rst order, so the last elements are the correlations between the root node and the leaf nodes. It can be proven that if this correlation goes to zero, then (1) belief propagation converges, (2) the loopy means are exact, and (3) the loopy variances equal the correct variances minus the summed conditional correlation of the root node and all other nodes that are replicas of the same loopy node. Symbols plotted with a star denote correlations with nodes that correspond to the node x1 in the loopy graph. We prove that the sum of these correlations gives the correct variance of node x1 while loopy propagation uses only the rst correlation.
How wrong will the variances be? The term CQ x1 |y r2 in equation 3.2 is simply the sum of many components of CQ x1 |y . Figure 3 shows these components. The correct variance is the sum of all the components, while the loopy variance approximates this sum with the rst (and dominant) term. Note that when the conditional correlation decreases rapidly to zero, two things happen: the convergence is faster (because CQ x1 |y r1 approaches zero faster), and the approximation error of the variances is smaller (because CQ x1 |y r2 is smaller). Thus, as in the single-loop case, we nd that quick convergence is correlated with good approximation. 3.1 Relation of Loopy and Unwrapped Quantities. The proof of the claims relies on the relationship between the elements of y, Vxy and Vxx with their unwrapped quantities, described below. Each node in xQ corresponds to a node in the original loopy network. Let O be a matrix that denes this correspondence. O( i, j) D 1 if xQ i corresponds to xj and zero otherwise. Thus, in Figure 2, ordering the nodes alphabetically,
2184
Yair Weiss and William T. Freeman
the rst rows of O are: 0 B B B B B B B B B B B ODB B B B B B B B B B @
1
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0C C C 0C C C 0C C 1C C C 1 C. C 1C C C 0C C 0C C C 0A
0
1
0
0
...
...
...
...
(3.3)
...
Using O we can formalize the relationship between the unwrapped quantities and the original ones. The simplest one is y, Q which contains only replicas of the original y: yQ D Oy.
(3.4)
Since every xi is connected to a yi , Vxy and VQ xy are zero everywhere but along their diagonals (the block diagonals, for vector-valued variables). The diagonal elements of VQ xy are simply replications of Vxy; hence: VQ xy O D OVxy .
(3.5)
VQ xx also contains the elements of the original Vxx , but here special care needs to be taken. Note that by construction, every node in the interior of the unwrapped tree has exactly the same statistical relationship with its neighbors as with the corresponding node in the loopy graph. If a node in the loopy graph has k neighbors, then a node in the unwrapped tree will have, by construction, one parent and k ¡ 1 children. The leaf nodes in the unwrapped tree, however, will be missing the k ¡ 1 children and hence will not have the same number of neighbors. Thus, for all nodes xQ i , xQj that are not leaf nodes, VQ xx ( i, j) is a copy of the corresponding Vxx ( k, l) , where unwrapped nodes i and j refer to loopy nodes k and l, respectively. Therefore, VQ xx O C E D OVxx ,
(3.6)
where E is an error matrix. E is zero for all nonleaf nodes, so the rst N ¡ L rows of E are zero.
Correctness of Gaussian Belief Propagation
2185
3.2 Proof of Claim 1. The marginalization equation for the unwrapped problem gives Q VQ xxmQ D ¡VQ xy y.
(3.7)
Substituting equations 3.4 and 3.5, relating loopy and unwrapped network quantities, into equation 3.7, for the unwrapped posterior mean, gives VQ xxmQ D ¡OVxyy.
(3.8)
For the true means, m , of the loopy network, we have Vxxm D ¡Vxy y.
(3.9)
To relate that to the means of the unwrapped network, we left-multiply by O: OVxxm D ¡OVxy y.
(3.10)
Using equation 3.6, relating Vxx to VQ xx, we have VQ xxOm C Em D ¡OVxy y.
(3.11)
Comparing equations 3.11 and 3.8 gives VQ xxOm C Em D VQ xxmQ
(3.12)
¡1 mQ D Om C VQ xx Em .
(3.13)
or
Using equation 2.13, mQ D Om C CQ x|y Em .
(3.14)
The left- and right-hand sides of equation 3.14 are column vectors. We take the rst component of both sides and get mQ (1) D m ( 1) C CQ x1 |y Em .
(3.15)
Since E is zero in the rst N ¡ L rows, Em is zero in the rst N ¡ L components.
2186
Yair Weiss and William T. Freeman
3.3 Proof of Claim 2. From equation 2.13, Vxx Cx|y D I.
(3.16)
Taking the rst column of this equation gives Vxx CTx1 |y D e1 ,
(3.17)
where e1 ( 1) D 1, e1 ( j > 1) D 0. Using the same strategy as in the previous proof, we left-multiply by O, OVxx CTx1 |y D Oe1 ,
(3.18)
and similarly we substitute equation 3.6: VQ xx OCTx1 |y C ECTx1 |y D Oe1 .
(3.19)
The analog of equation 3.17 in the unwrapped problem is VQ xx CQ Tx1 |y D eQ1 ,
(3.20)
where eQ1 ( 1) D 1, eQ1 ( j > 1) D 0. Subtracting equations 3.19 and 3.20 and rearranging terms gives ¡1 ¡1 ( eQ1 ¡ Oe1 ) . CQ x1 |y D OCTx1 |y C VQ xx ECTx1 |y C VQ xx
(3.21)
Again, we take the rst row of both sides of equation 3.21 and use the ¡1 is Q fact that the rst row of VQ xx Cx1 |y to obtain sQ 2 ( 1) D s 2 ( 1) C CQ x1 |y ECTx1 |y C CQ x1 |y ( eQ1 ¡ Oe1 ) .
(3.22)
Again, since E is zero in the rst NL rows, ECx1 |y is zero in the rst N ¡ L components. 3.4 Proof of Claim 3. Here we need to dene what we mean by “rapidly enough.” We restate the claim precisely: Claim 3: Suppose for every 2 there exists a t2 such that for all t > t2 | CQ x1 |y r| < 2 maxi |r( i) | for any vector r that is nonzero only in the last L components (those corresponding to the leaf nodes). In this case, (1) belief propagation converges, (2) the means are exact, and (3) the variances are equal to the correct variances minus the summed conditional correlations between xQ 1 and all nonroot xQ j that are replicas of x1 . This claim follows from the rst two claims. The only thing to show is that Em and ECx1 |y are bounded for all iterations. This is true because the rows of E are bounded and m , Cx1 |y do not depend on the iteration.
Correctness of Gaussian Belief Propagation
2187
3.5 Proof of Claim 4. The proof is based on the following lemma:1 1 T
Conditional correlation lemma. Assume P ( x, y ) D ae ¡ 2 z Vz with z, V as in equations 2.6 and 2.7. Let r be an arbitrary vector. Then Cx|y r can be calculated by (1) modifying the joint probability so that Vxy D ¡I, (2) setting the observations y D r, and (3) calculating the posterior means of x given this y in the modied joint. Proof.
This follows directly from equations 2.12 and 2.13.
Using this lemma, we can give sufcient conditions for convergence in terms of the potentials of the loopy network. Claim 4: Assume all Vij are diagonally dominant (|Vij ( k, l) | < Vij ( k, k)) and all Vii ( 1, 1) > 0. Then (1) belief propagation converges, (2) the belief propagation means are exact, and (3) the belief propagation variances are equal to the correct variances minus the summed conditional correlations between xQ 1 and all xQj that are replicas of x1 . Proof. The proof is based on claim 3 and the conditional correlation lemma. From claim 3, we know that it is sufcient to show that for any 2 and r that is nonzero only at the leaves, | CQ x1 |y r| < 2 maxi |r( i) | for sufciently large unwrappings. By the conditional correlation lemma, we know we can calculate | CQ x1 |y r| by constructing an unwrapped tree with observations r and computing the conditional mean at the root node. Since the unwrapped tree is singly connected, we can calculate the mean at the root node exactly by running belief propagation. We start by sending messages from the leaf nodes to the layer above and continue sending messages upward until we reach the root node. We use Ml to denote the maximum absolute value of means of all messages sent upward by layer l. By equation 2.18, the value m 0 calculated by all nodes at layer l ¡ 1 is a weighted average of means from the previous layer and the means from the observations at layer l ¡1. Since r is zero for all layers but the bottom one, the means of the messages sent by observations at the l ¡ 1 layer are zero. Thus: |m 0 | · M l .
(3.23)
If we rewrite equation 2.16 taking advantage of the fact that xi , xj are scalars, we have m ij D 1
¡b(a C P 0 ) ¡1 P 0m 0 . c ¡ b2 ( a C P 0 ) ¡1
We thank Andrew Ng for suggesting this proof.
(3.24)
2188
Yair Weiss and William T. Freeman
Multiplying top and bottom by m ij D
( a C P0 ) , P0
¡b m 0. ( ac ¡ b2 ) / P 0 C c
(3.25)
Similarly, we can rewrite equation 2.15 taking advantage of xi , xj being scalar to give P ij D
ca ¡ b2 C cP 0 . a C P0
(3.26)
Note that for diagonally dominant matrices, Pij is nonnegative if P 0 is nonnegative. Note also that if Vii ( 1, 1) > 0 for all the observation potentials, the observations will send a positive precision to the unobserved nodes. Thus, all precisions will be nonnegative. Since Vij is diagonally dominant and P 0 is nonnegative, both terms in the denominator of equation 3.25 are nonnegative, so |m ij | ·
|b| |m 0 |. |c|
(3.27)
We now denote by b D maxi, j |Vij ( 1, 2) / Vij ( 2, 2) |. Since all Vij are diagonally dominant, b < 1. Combining equations 3.23 and 3.27 gives Ml¡1 · b Ml . So for any 2 if we choose t2 >
(3.28) log 2 log b
then | CQ x1 |y r| < 2 max |r( i) |.
The conditional correlation lemma can also be used to give bounds on the loopy variances—for example: Corollary. If Vij are diagonally dominant and the off-diagonal elements are negative, then the loopy beliefs are overcondent sQ 2 ( 1) < s 2 (1) . Proof. By the conditional correlation lemma, we can calculate CQ x1 |y r2 by setting the observations to be one at all copies of the root node and zero elsewhere. From equation 3.25 it is clear that when b < 0 and the observations are zero or one, the mean messages are weighted averages of positive values; hence, the mean at the root will be positive. The decomposition of a GMRF into Vij , Vii is not unique. The following lemma shows that even when we have a loopy graph where Vij are not diagonally dominant, belief propagation will still converge when there exists a reparameterization using diagonally dominant matrices.
Correctness of Gaussian Belief Propagation
2189
Reparameterization Lemma. If the unwrapped tree for any iteration of loopy propagation can be parameterized so that VQ ij are diagonally dominant and VQ ii (1, 1) > 0, then (1) belief propagation converges, (2) the belief propagation means are exact, and (3) the belief propagation variances are equal to the correct variances minus the summed conditional correlations between xQ 1 and all xQj that are replicas of x1 . Proof. This follows from the proof of claim 4. Since the unwrapped tree is singly connected, belief propagation using any parameterization is exact. We can calculate CQ x1 |y r by running belief propagation on the reparameterized tree in which all the matrices are diagonally dominant. We emphasize that claim 4 gives only sufcient conditions for convergence. It is easy to construct networks in which Vij are not all diagonally dominant but loopy belief propagation still converges. In section 6 we show an example. 4 Fixed Points of Loopy Propagation Each iteration of belief propagation can be thought of as an operator F that inputs a list of messages m ( t ) and outputs a list of messages m ( t C 1) D Fm ( t) . Thus, belief propagation can be thought of as an iterative way of nding a solution to the xed-point equations Fm D m with an initial guess m 0 in which all messages are constant functions. Note that this is not the only way of nding xed points. McEliece et al. (1995) have shown a simple example for which F contains multiple xed points and belief propagation nds only one. They also showed an example where a xed point exists but the iterations m à Fm do not converge. Murphy et al. (1999) describe an alternative method for nding xed points of F. An alternative way of nding solutions to Fm D m is to use an asynchronous update order, that is, update equations 2.4 in some xed update order rather than the synchronous order we use here. Update Order Lemma. Let FA m be an updating of the messages in some xed order so that all messages mij are updated sequentially using equation 2.4. Let Fm be an updating of the messages in parallel. A set of messages m satises Fm D m if and only if they satisfy F A m D m. Proof. By the denition of a xed point, m is a xed point of F A or of F if and only if the left-hand side of equation 2.4 equals the right-hand side. In this section we ask, suppose a xed-point m¤ D Fm¤ has been found by some method. How are the beliefs calculated based on these messages related to the correct beliefs?
2190
Yair Weiss and William T. Freeman
Claim 5: For a gaussian graphical model of arbitrary topology, if m¤ is a xed point of the message-passing dynamics, then the means based on that xed point are exact. The proof is based on the following lemma. Periodic Beliefs Lemma. If m¤ is a xed point of the message-passing dynamics in a graphical model G, then one can construct a modied unwrapped tree T of arbitrarily large depth such that (1) all nonleaf nodes in T have the same statistical relationship with their neighbors as the corresponding nodes in G and (2) all nodes in T will have the same belief as the beliefs in G derived from m¤ . Proof. The proof is by construction. We rst construct an unwrapped tree T of the desired depth. We then modify the potentials and the observations in the leaf nodes in the following manner. For each leaf node xQ i , nd the k ¡1 nodes in G that neighbor xi0 (where xQ i is a replica of xi0 ) excluding the parent of xQ i . Calculate the product of the k ¡ 1 messages that these neighbors send to the corresponding node in G under the xed-point messages m¤ and the message that yi0 sends to xi0 . Set yQ i and Y( yQ i , xQ i ) such that the message yQ i sends to xQ i is equal to this product. By this construction, all leaf nodes in T will send their neighbors a message from m¤ . Since all nonleaf nodes in T have the same statistical relationship to their neighbors as the corresponding nodes in G, the local message passing updates in T are identical to those in G. Thus, all messages in T will be replicas of messages in m¤ . Proof of Claim 5. Using this lemma, we can prove claim 5. Let mQ be the conditional mean in the modied unwrapped tree. Then, by the periodic beliefs lemma, mQ D Om 0,
(4.1)
where m 0 ( i) is the posterior mean at node i under m¤ . We also know that mQ is a solution to Q VQ xxmQ D ¡VQ xyy,
(4.2)
where VQ xx , VQ xy , yQ refers to quantities in the modied unwrapped tree. So Q VQ xx Om 0 D ¡VQ xy y.
(4.3)
We use the notation [A]m to indicate taking the m rst rows of a matrix A. Note that for any two matrices, [AB]m D [A]m B. Taking the rst m rows of equation 4.3 gives [VQ xxO]m m 0 D ¡[VQ xy y] Q m.
(4.4)
Correctness of Gaussian Belief Propagation
2191
As in the previous proofs, the key idea is to relate the inverse covariance matrix of the modied unwrapped tree to that of the original loopy graph. Since all nonleaf nodes in the modied unwrapped tree have the same neighborhood relationships with their neighbors as the corresponding nodes in the loopy graph, we have, for any m < N ¡ L, [VQ xxO]m D [OVxx ]m
(4.5)
£ ¤ [VQ xy y] Q m D OVxy y m .
(4.6)
and
Substituting these relationships into equation 4.4 gives [O]m Vxxm 0 D ¡ [O]m Vxy y.
(4.7)
This equation holds for any m < N ¡ L. Since we can unwrap the tree to arbitrarily large size, we can choose m such that [O]m has n independent rows (this happens once all nodes in the loopy graph appear at least once in the modied unwrapped tree). Thus: Vxxm 0 D ¡Vxy y.
(4.8)
Hence, the means derived from the xed-point messages are exact. 5 Extensions 5.1 Vector-Valued Nodes. Most of the results we have derived so far hold for vector-valued nodes as well, but the indexing notation is rather more cumbersome. We use a stacking convention, in which we dene the vector x by 0
x1
1
B C x D @ x2 A . ¢¢¢
(5.1)
Thus, supposing x1 is a vector of length 2, then x ( 1) is the rst component of x1 and x ( 2) is the second component of x1 (not x2 ). We dene y in a similar fashion. Using this stacking notation, the equations for exact inference in gaussians remain unchanged, but we need to be careful in reading out the posterior means and covariances from the stacked vectors. Thus, we can still complete the square in stacked notation to obtain Vxxm D ¡Vxy y
(5.2)
2192
Yair Weiss and William T. Freeman
¡1 . Assuming and Cx|y D Vxx x1 is of length 2, m 1 the posterior mean of x1 is given by
³ m1 D
´
m ( 1) , m ( 2)
(5.3)
and the posterior covariance matrix S1 is given by
³ S1
D
Cx|y ( 1, 1) Cx|y ( 2, 1)
´
Cx|y ( 1, 2) . Cx|y ( 2, 2)
(5.4)
We use the same stacked notation for xQ and dene the matrix O such that O ( i, j) D 1 if xQ ( i) is a replica of x( j) and zero otherwise. Using this notation, the relationships between unwrapped and loopy quantities (e.g., [VQ xx O]m D [OVxx ]m ) still hold. Thus, all the analysis done in the previous sections holds. The only difference is the semantics of quantities such as m ( 1) , which need to be understood as a scalar component of a (possibly) larger vector m 1 . For explicitness, we restate the ve claims for vector-valued nodes. For any i, j less than or equal to the number of components in x1 , we have: Claim 1a: mQ ( i) D m ( i) C CQ xi |y r,
(5.5)
where r is a vector that is zero everywhere but the last L components (corresponding to the leaf nodes). Claim 2a: CQ x|y ( i, j) D Cx|y ( i, j) C CQ xj |y r1 ¡ CQ xj |y r2 ,
(5.6)
where r1 is a vector that is zero everywhere but the last L components (corresponding to the leaf nodes) and r2 is equal to 1 for all components corresponding to nonroot nodes in the unwrapped tree that reference x( i) . All other components of r2 are zero. Claim 3a: If the conditional correlation between all components of the root node and the leaf nodes decreases rapidly enough, then (1) belief propagation converges, (2) the belief propagation means are exact, and (3) the i, j component of the belief propagation covariance matrices is equal to the i, j component of the true covariance matrices minus the summed conditional correlations between xQ ( j) and all nonroot xQ ( k) that are replicas of x ( i). Claim 5a: For a (possibly vector-valued) gaussian graphical model of arbitrary topology, if m¤ is a xed point of the message-passing dynamics, then the means based on that xed point are exact.
Correctness of Gaussian Belief Propagation
2193
Claim 6a: For a (possibly vector-valued) graphical model of arbitrary topology with continuous potential functions, if m¤ is a xed point of the maxproduct message-passing dynamics, then the assignment based on that xed point is a local maximum of the posterior probability. We emphasize that these claims do not need to be reproven. All the equations used in proving the scalar-valued case still hold; only the semantics we place on the individual components are different. We end this analysis with two simple corollaries: Corollary 1. Let m¤ be a xed point of Pearl’s belief propagation algorithm on a gaussian Bayesian network of arbitrary topology and arbitrary clique size. Then the means based on m¤ are exact. Corollary 2. Let m¤ be a xed point of Pearl’s belief revision (max-product) algorithm on a Bayesian network with continuous joint probability, arbitrary topology, and arbitrary clique size. The assignment based on m¤ is at least a local maximum of the posterior probability. These corollaries follow from claims 5a and 6a, along with the equivalence between Pearl’s propagation rules and the propagation rules for pairwise undirected graphical models analyzed here (Weiss, 2000). Note that even if the Bayesian network contained only scalar nodes, the conversion to pairwise cliques may necessitate using vector-valued nodes. 5.2 Non-gaussian Variables. In section 2, we described the max-product belief propagation algorithm that nds the MAP estimate for each node (Pearl, 1988; Weiss, 2000) of a network without loops. As with max-product, iterating this algorithm is a method of nding a xed point of the messagepassing dynamics. How does the assignment derived from this xed point compare the MAP assignment? Claim 6: For a graphical model of arbitrary topology with continuous potential functions, if m¤ is a xed point of the max-product messagepassing dynamics, then the assignment based on that xed point is a local maximum of the posterior probability. Proof. Since the posterior probability factorizes into a product of pairwise potentials, the log posterior will have the form X log P ( x | y ) D (5.7) Jij ( xi , xj ) C Jii ( xi , yi ). ij
Assuming the clique potential functions are differentiable and nite, the MAP solution, u, will satisfy @ @xi
log P ( x | y ) | xD u D 0.
(5.8)
2194
Yair Weiss and William T. Freeman
We will write this as Vu D 0,
(5.9)
where V is a nonlinear operator. As in the previous section, we can use the periodic belief lemma to construct a modied unwrapped tree of arbitrary size based on m¤ . If we denote by VQ the nonlinear set of equations that the solution to the modied unwrapped problem must satisfy, we have VQ uQ D 0.
(5.10)
Because of the periodic belief lemma, uQ D Ou0.
(5.11)
Similarly, as in the previous section, all the nonleaf nodes will have the same statistical relationship with their neighbors as do the corresponding nodes in the loopy network, so Q m D [OV]m , [VO]
(5.12)
where the left- and right-hand sides are nonlinear operators. Substituting equations 5.11 and 5.12 into 5.10 gives Vu0 D 0.
(5.13)
A similar substitution can be made with the second derivative equations to show that the Hessian at u 0 is positive denite. Thus, the assignment based on m¤ is at least a local maximum of the posterior. Claim 6 can be extended to discrete-valued nodes as well. In Weiss and Freeman (2001) we prove that an assignment based on a xed point of the max-product BP algorithm is a “neighborhood maximum” of the posterior probability. The posterior probability of the max-product assignment is guaranteed to be greater than all other assignments in a large neighborhood around that assignment. The neighborhood includes all assignments that differ from the max-product assignment in any subset of nodes that form no more than a single loop in the graph. 6 Simulations To illustrate the analysis, we ran belief propagation on a 25 £ 25 twodimensional grid. The joint probability was 0 1 X X (6.1) P ( x, y ) D exp @ ¡ wij ( xi ¡ xj ) 2 ¡ wii ( xi ¡ yi ) 2 A , ij
i
Correctness of Gaussian Belief Propagation
2195
where wij D 0 if nodes xi , xj are not neighbors and 0.01 otherwise and wii was randomly selected to be 10¡6 or 1 for all i with probability of 1 set to 0.2. When wii D 1, we set the observations yi to be samples from the surface z ( x, y ) D x C y. When wii D 10¡6 , we set yi =0. This problem corresponds to an approximation problem from sparse data where only 20% of the points are visible and there is a weak prior on the unobserved nodes pulling them toward zero. We found the exact posterior by solving equation 2.12. We also ran loopy belief propagation and found that when it converged, the loopy means were identical to the true means up to machine precision. Also, as predicted by the theory, the loopy variances were too small; the loopy estimate was overcondent. How is this predicted from our analysis? This illustrates the reparameterization lemma in section 3.5. The parameterization we used in the loopy network was
³ Vij D
wij ¡wij
¡wij
´
wij
³ ,
Vii D
wii
¡wii
¡wii wii
´ .
Thus, Vij are not diagonally dominant. However, we can also use a different parameterization in which we shift the diagonal component from Vii to Vij : 0
wii ¡ 2 Bwij C deg ( xQi ) VQ ij D B @ ¡wij
1 ¡wij
C C wij ¡ 2 A , wij C deg ( xQj )
³ VQ ii D
2
¡wii
¡wii wii
´ .
In this parameterization, all VQ ij are diagonally dominant. Thus, claim 4 guarantees that belief propagation will converge and the variances will be overcondent. We also ran belief propagation on this problem with wii D 0 for the unobserved nodes. Note that under this case, we cannot reparameterize the unwrapped tree so that all Vij are diagonally dominant, so claim 4 does not hold. We found that belief propagation still converged with a similar convergence rate and that the means were exact. This illustrates the fact that claim 4 gives only sufcient (but not necessary) conditions for the conditional correlation to decrease rapidly enough. Claim 5 guarantees that the means will be exact. In many applications, the solution of equation 2.12 by matrix inversion is intractable and iterative methods are used. Figure 4 compares the error in the means as a function of iterations for loopy propagation, successive overrelaxation (SOR), and conjugate gradient (CG). We used an overrelaxation constant of 1.9. Note that after ve iterations, loopy propagation has a mean squared error of order 10¡1 , while SOR and CG require many more (although the asymptotic performance of CG is the best). As expected by the
2196
Yair Weiss and William T. Freeman
Figure 4: (a) 25£25 graphical model for simulation. (b) The error of the estimates of belief propagation (BP), conjugate gradient (CG), and successive overrelaxation (SOR) as a function of iteration. (c) The same plot as in b but with log scaling on the y axis. (d–e) Similar plots when the connections between unobserved nodes were increased by four orders of magnitude. BP converges very quickly to a good solution, although SOR and CG may have better asymptotic properties.
Correctness of Gaussian Belief Propagation
2197
Figure 5: (a) The correct interpolation. (b) Results of BP after 10 iterations. (c) Results of SOR after 10 iterations. (d) Results of CG after 10 iterations.
fast convergence, the approximation error in the variances was quite small. The median error was 0.018. For comparison, the true variances ranged from 0.01 to 0.94, with a mean of 0.322. Also, the nodes for which the approximation error was worse were indeed the nodes that converged more slowly. The analysis in section 3.5 suggests that the convergence rate is related to the ratio of diagonal to off-diagonal elements in VQ ij . To illustrate this, we redid the simulations with the same network but set wij D 10.0 between neighboring nodes. Indeed convergence is much slower in this case (see gures 4d and 4e). As expected from the slower convergence, the error in the variance estimates is larger here with the median error 0.299. For visual comparison, Figure 5 shows the optimal answer and the results of BP, CG, and SOR after 10 iterations. The slow convergence of SOR on problems such as these leads to the development of multiresolution models in which the MRF is approximated by a tree (Luettgen, Karl, & Willsky, 1994; Daniel & Willsky, 1999) and an algorithm equivalent to belief propagation is then run on the tree. Although the multiresolution models are much more efcient for inference, the tree
2198
Yair Weiss and William T. Freeman
structure often introduces block artifacts in the estimate. Our results suggest that one can simply run belief propagation on the original MRF and get the exact posterior means (in theory, BP may fail to converge, but in many simulation runs, we never observed a failure to converge of BP in these two-dimensional interpolation problems). It would be interesting to see whether one can use the convergence rate of the beliefs to improve the variance estimate at every node. 7 Discussion Independent of our results, two groups have recently analyzed special cases of gaussian graphical models. Frey (2000) analyzed the graphical model corresponding to factor analysis and gave conditions for the existence of a stable xed point. Rusmevichientong and Van Roy (2000) analyzed a graphical model with the topology of turbo decoding but a gaussian joint density. They showed that for this specic case, belief propagation converges and the means are exact. Our main interest in analyzing the gaussian case was to understand the performance of belief propagation in networks with multiple loops. Although there are many special properties of gaussians, we are struck by the similarity of the analytical results reported here for multiloop gaussians and the analytical results for single loops and general distributions reported in Weiss (2000). The most salient similarities are these: In single-loop networks with binary nodes, the mode at each node is guaranteed to be correct, but the condence in the mode may be incorrect. In gaussian networks with multiple loops, the mean at each node is guaranteed to be correct, but the condence around that mean will in general be incorrect. In single-loop networks, fast convergence is correlated with good approximation of the beliefs. This is also true for gaussian networks with multiple loops. In single-loop networks, the convergence rate and the approximation error were determined by a ratio of eigenvalues l1 / l2 . This ratio determines the extent of the statistical dependencies between the root and the leaf nodes in the unwrapped network for a single loop. In gaussian networks, the convergence rate and the approximation error are determined by the off-diagonal terms of CQ x|y . These terms quantify the extent of conditional dependencies between the root nodes and the leaf nodes of the unwrapped network. These similarities are even more intriguing when one considers how different gaussian graphical models are from discrete models with arbitrary potentials and a single loop. In gaussian, the conditional mean is equal to the conditional mode, and there is only one maximum in the posterior probabil-
Correctness of Gaussian Belief Propagation
2199
ity, while the single-loop discrete models may have multiple maxima, none of which will be equal to the mean. Furthermore, in terms of approximate inference, the two classes behave quite differently. For example, mean-eld approximations give the exact means for gaussian MRFs, while they work poorly in discrete networks with a single loop in which the connectivity is sparse (Weiss, 1996). The resemblance of the results for gaussian graphical models and for single loops leads us to believe that similar results may hold for a larger class of networks. The sum-product and max-product belief propagation algorithms are appealing, fast, and easily parallelizable algorithms. Due to the well-known hardness of probabilistic inference in graphical models, belief propagation will obviously not work for arbitrary networks and distributions. Nevertheless, there is a growing body of empirical evidence showing its success in many loopy networks. Our results give a theoretical justication for applying belief propagation in networks with multiple loops. This may enable fast, approximate probabilistic inference in a range of new applications. Acknowledgments We thank A. Ng, K. Murphy, P. Pakzad, B. Frey, and M. I. Jordan for comments on previous versions of this article. Y. W. is supported by MURIARO-DAAH04-96-1-0341 References Aji, S., Horn, G., & McEliece, R. (1998). On the convergence of iterative decoding on graphs with a single cycle. In Proc. 1998 ISIT. Aji, S. M., & McEliece, R. (1999). The generalized distributive law. IEEE Transactions on Information Theory, 46, 325–343. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit errorcorrecting coding and decoding: Turbo codes. In Proc. IEEE International Communications Conference ’93. Cowell, R. (1998). Advanced inference in Bayesian networks. In M. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Daniel, M. M., & Willsky, A. S. (1999). The modeling and estimation of statistically self-similar processes in a multiresolution framework. IEEE Trans. Info. Theory, 45(3), 955–970. Forney, G., Kschischang, F., Marcus, B., & Tuncel, S. (2000). Iterative decoding of tail-biting trellises and connections with symbolic dynamics. In Proc. IMA Workshop on Codes, Systems, and Graphical Models. New York: Springer-Verlag. Freeman, W., Pasztor, E., & Carmichael, O. (2000). Learning low-level vision. Intl. J. of Comp. Vision, 40(1), 25–47. Frey, B. J. (1998). Graphical models for pattern classication, data compression and channel coding. Cambridge, MA: MIT Press.
2200
Yair Weiss and William T. Freeman
Frey, B. (2000). Local probability propagation for factor analysis. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Frey, B. J., Koetter, R., & Vardy, A. (1998). Skewness and pseudocode-words in iterative decoding. In Proc. 1998 ISIT. Gallager, R. (1963). Low density parity check codes. Cambridge, MA: MIT Press. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. PAMI, 6(6), 721–741. Jensen, F. (1996).An introduction to Bayesiannetworks. New York: Springer-Verlag. Kschischang, F. R., & Frey, B. J. (2000). Iterative decoding of compound codes by probability propagation in graphical models. IEEE Journal on Selected Areas in Communication, 16(2), 219–230. Lauritzen, S. (1996). Graphical models. New York: Oxford University Press. Luettgen, M. R., Karl, W. C., & Willsky, A. S. (1994). Efcient multiscale regularization with application to the computation of optical ow. IEEE Transactions on Image Processing, 3(1), 41–64. McEliece, R., MacKay, D., & Cheng, J. (1998). Turbo decoding as an instance of Pearl’s “belief propagation” algorithm. IEEE Journal on Selected Areas in Communication, 16(2), 140–152. McEliece, R., Rodemich, E., & Cheng, J. (1995). The Turbo decision algorithm. In Proc. 33rd Allerton Conference on Communications, Control and Computing (pp. 366–379). Monticello, IL. Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Articial Intelligence. San Mateo, CA: Morgan Kaufmann. Pearl, J. (1988). Probabibilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Rusmevichientong, P., & Van Roy, B. (2000). An analysis of Turbo decoding with gaussian densities. In S. A. Solla, T. K. Leen, & K.-R. MÂ’uller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Weiss, Y. (1996). Interpreting images by propagating Bayesian beliefs. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Weiss, Y. (1997). Belief propagation and revision in networks with loops (Tech. Rep. No. 1616). Cambridge, MA: MIT AI Lab. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12, 1–41. Weiss, Y., & Freeman, W. (2001). On the optimality of solutions of the maxproduct belief propagation algorithm. IEEE Transactions on Information Theory, 37(2), 736–744. Wiberg, N. (1996). Codes and decoding on general graphs. Unpublished doctoral dissertation, University of Linkoping, Sweden.
Received August 6, 1999; accepted January 17, 2001.
LETTER
Communicated by Andrew Barto
MOSAIC Model for Sensorimotor Learning and Control Masahiko Haruno ATR Human Information Processing Research Laboratories, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
Daniel M. Wolpert Sobell Department of Neurophysiology,Institute of Neurology, University College London, London WC1N 3BG, U.K.
Mitsuo Kawato Dynamic Brain Project, ERATO, JST, Kyoto, Japan, and ATR Human Information Processing Research Laboratories, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
Humans demonstrate a remarkable ability to generate accurate and appropriate motor behavior under many different and often uncertain environmental conditions. We previously proposed a new modular architecture, the modular selection and identication for control (MOSAIC) model, for motor learning and control based on multiple pairs of forward (predictor) and inverse (controller) models. The architecture simultaneously learns the multiple inverse models necessary for control as well as how to select the set of inverse models appropriate for a given environment. It combines both feedforward and feedback sensorimotor information so that the controllers can be selected both prior to movement and subsequently during movement. This article extends and evaluates the MOSAIC architecture in the following respects. The learning in the architecture was implemented by both the original gradient-descent method and the expectation-maximization (EM) algorithm. Unlike gradient descent, the newly derived EM algorithm is robust to the initial starting conditions and learning parameters. Second, simulations of an object manipulation task prove that the architecture can learn to manipulate multiple objects and switch between them appropriately. Moreover, after learning, the model shows generalization to novel objects whose dynamics lie within the polyhedra of already learned dynamics. Finally, when each of the dynamics is associated with a particular object shape, the model is able to select the appropriate controller before movement execution. When presented with a novel shape-dynamic pairing, inappropriate activation of modules is observed followed by on-line correction.
c 2001 Massachusetts Institute of Technology Neural Computation 13, 2201– 2220 (2001) °
2202
M. Haruno, D. Wolpert, and M. Kawato
1 Introduction Given the multitude of contexts, that is, the properties of objects in the world and the prevailing environmental conditions, there are two qualitatively distinct strategies to motor control and learning. The rst is to use a single controller, which would need to be highly complex to allow for all possible scenarios. If this controller were unable to encapsulate all the contexts, it would need to adapt every time the context of the movement changed before it could produce appropriate motor commands. This would produce transient and possibly large performance errors. Alternatively, a modular approach can be used in which multiple controllers coexist, with each controller suitable for one or a small set of contexts. Such a modular strategy has been introduced in the mixture of experts architecture for supervised learning (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jacobs, Jordan, & Barto, 1991; Jordan & Jacobs, 1992). This architecture comprises a set of expert networks and a gating network that performs classication by combining each expert’s output. These networks are trained simultaneously so that the gating network splits the input space into regions in which particular experts can specialize. To apply such a modular strategy to motor control, two problems must be solved. First, how does the set of inverse models (controllers) learn to cover the contexts that might be experienced (the module learning problem)? Second, given a set of such inverse models how are the correct subsets selected for the current context (the module selection problem)? From human psychophysical data, we know that such a selection process must be driven by two distinct processes: feedforward selection based on sensory signals such as the perceived size of an object, and selection based on feedback of the outcome of a movement. For example, on picking up an object that appears heavy, feedforward selection may choose controllers responsible for generating a large motor impulse. However, feedback processes, based on contact with the object, can indicate that it is in fact light, thereby causing a switch to the inverse models appropriate for such an object. Narendra and Balakrishnan (1997) and Narendra, Balakrishnan, and Ciliz (1995) have conducted pioneering work on a multiple-module system consisting of paired forward and inverse models. At any given time, the system selects only one control module, that is, an inverse model, according to the accuracy of the prediction of its paired forward model. Under this condition, they were able to prove stability of the system when the modules are used in the context of model reference control of an unknown linear time-invariant system. To guarantee this stability, Narendra’s model has no learning of the inverse models and little in the forward models. In addition, no mixing of the outputs of the inverse models is allowed, thereby limiting generalization to novel contexts. A hallmark of biological control is the ability to learn from a naive state and generalize to novel contexts. Therefore, we have sought to examine the modular architecture in a framework in which the forward and inverse mod-
MOSAIC Model for Sensorimotor Learning and Control
2203
els are learned de novo, and multiple inverse models can simultaneously contribute to control. In addition, switching between modules, as well as being driven by prediction errors as in Narendra’s model, can be initiated by purely sensory cues. Therefore, our goal is to develop an empirical system that shows suitable generalization, can learn both predictors and controllers from a naive state, and can switch control based on both prediction errors during a movement and sensory cues prior to movement initiation. In an attempt to incorporate these goals into a model of motor control and learning, Gomi and Kawato (1993) combined the feedback-error-learning (Kawato, 1990) approach and the mixture of experts architecture to learn multiple inverse models for different manipulated objects. They used both the visual shape of the manipulated objects and intrinsic signals, such as somatosensory feedback and efference copy of the motor command, as the inputs to the gating network. Using this architecture, it was possible, although quite difcult, to acquire multiple inverse models. This difculty arose because a single gating network needed to divide up the large input space into complex regions based solely on the control error. Furthermore, Gomi and Kawato’s (1993) model cannot demonstrate feedforward controller selection prior to movement execution. Wolpert and Kawato (1998) proposed a conceptual model of sensorimotor control that addresses these problems and can potentially solve the module learning and selection problems in a computationally coherent manner. The basic idea of the model, termed the modular selection and identication for control (MOSAIC) model, is that the brain contains multiple pairs (modules) of forward (predictor) and inverse (controller) models. Within a module, the forward and inverse models are tightly coupled during both their acquisition and use. The forward models learn to divide up the contexts experienced so that for any given context, a set of forward models can predict the consequences of a given motor command. The prediction errors of each forward model are then used to gate the learning of its paired inverse models. This ensures that within a module, the inverse model learns the appropriate control for the context in which its paired forward model makes accurate predictions. The selection of the inverse models is derived from the combination of the forward model’s prediction errors and the sensory contextual cues, which enables the MOSAIC model to select controllers prior to movement. This article extends the MOSAIC model and provides the rst simulations of its performance. First, the model is extended to incorporate the context transitions under the Markovian assumption (hidden Markov model, HMM). The expectation-maximization (EM) learning rule for the HMMbased MOSAIC model is derived and compared with the original proposed gradient-based method. Second, we evaluate the architecture in several respects by using the same task as Gomi and Kawato’s (1993). It conrms that the MOSAIC model demonstrates more efcient learning when compared to the mixture of experts architecture. Next, we examine the performance of
2204
M. Haruno, D. Wolpert, and M. Kawato
the MOSAIC model to novel contexts (generalization). Finally, we discuss its performance when each context is linked to a sensory stimulus and the effects of a mismatch of the sensory stimulus with context. These simulations clearly demonstrate the usefulness of combining the forward model’s feedback selection and the feedforward selection based on the sensory contextual cues. Section 2 introduces the MOSAIC model and describes its gradient-based learning method. In section 3, the Markovian assumption of context switching is introduced, and the EM learning rule is derived. Section 4 reports several simulation results by using the multiple object manipulation task (Gomi & Kawato, 1993; Haruno, Wolpert, & Kawato, 1999b). Finally, the article concludes with the discussion in section 5. 2 MOSAIC Model 2.1 Feedforward and Feedback Selection. Figure 1 illustrates how the MOSAIC model can be used to learn and control movements when the controlled object (plant) can have distinct and changing dynamics. For example, for simulations, we will consider controlling the arm when the hand manipulates objects with distinct dynamics. Central to the MOSAIC model is the notion of dividing up experience using predictive forward models (Kawato, Furukawa, & Suzuki, 1987; Jordan & Rumelhart, 1992). We consider n undifferentiated forward models that each receives the current state, xt , and motor command, ut , as input. The output of the ith forward model is xO it C 1 , the prediction of the next state at time t, xO it C 1 D w ( wit , xt , ut ) ,
(2.1)
where wit are the parameters of a function approximator w (e.g., neural network weights) used to model the forward dynamics. These predicted next states are compared to the actual next state to determine the likelihood that each forward model accounts for the behavior of the system. Based on the prediction errors of the forward models, the likelihood, lit , of the ith forwardinverse model pair (module) generating xt is calculated by assuming the data are generated by that forward model dynamics but contaminated with gaussian noise with standard deviation s as
lit D P ( xt |wit , ut , i) D p
1 2p s 2
i 2 2 e ¡|xt ¡Oxt | / 2s ,
(2.2)
where xt is the true state of the system. To normalize these likelihoods across the modules, the softmax function of likelihoods in equation (2.3) could
MOSAIC Model for Sensorimotor Learning and Control
2205
Figure 1: Schematic of the MOSAIC model with n paired modules. Each module consists of three interacting parts. The rst two, the forward model and the responsibility predictor, are used to determine the responsibility of the module lit , reecting the degree to which the module captures the current context and should therefore participate in control. The ith forward model receives a copy of the motor command ut and makes a state prediction xO it . The likelihood that a particular forward model captures the current behavior, lit , is determined from its prediction error, xt ¡ xO it , using a likelihood model. A responsibility predictor estimates the responsibility before movement onset using sensory contextual cues yt and is trained to approximate the nal responsibility estimate. By multiplying this estimate, the prior p ti , by the likelihood lit , and normalizing across the modules an estimate of the module’s responsibility, lit is achieved. The third part is an inverse model that learns to provide suitable control signals, uit , to achieve a desired state, x¤t , when the paired forward model provides accurate predictions. The responsibilities are used to weight the inverse model outputs to compute the nal motor command ut and control the learning within both forward and inverse models, with those models with high responsibilities receiving proportionally more of their error signal.
be used, 1 lit Pn
j
(2.3)
j D 1 lt
in which a value of each module lies between 0 and 1 and sums to 1 over the modules. Those forward models that capture the current behavior, and therefore produce small prediction errors, will have large softmax values. 1
The softmax function represents Bayes’ rule in the case of the equal priors.
2206
M. Haruno, D. Wolpert, and M. Kawato
They could be then used to control the learning of the forward models in a competitive manner, with those models with large softmax receiving proportionally more of their error signal than modules with small softmax. While the softmax provides information about the context during a movement, it cannot do so before a motor command has been generated and the consequences of this action evaluated. To allow the system to select controllers based on contextual information2 we introduce a new component, the responsibility predictor (RP). The input to this module, yt , contains contextual sensory information (see Figure 1), and each responsibility predictor produces a prior probability, p ti as in equation (2.4), that each model is responsible in a given context, p ti D g(dti , yt ) ,
(2.4)
where dti are the parameters of a function approximator g. These estimated prior probabilities can then be combined with the likelihoods obtained from the forward models by using the general framework of Bayes’ rule. This produces a posterior probability as in equation (2.5), p i li lit D P t t j j , n j D 1 p t lt
(2.5)
that each model is responsible for a given context. We call these posterior probabilities the responsibility signals, and these form the training signal for the RP. This ensures that the priors will learn to reect the posterior probabilities. In summary, the estimates of the responsibilities produced by the RP can be considered as prior probabilities because they are computed before the movement execution based only on extrinsic signals and do not rely on knowing the consequences of the action. Once an action takes place, the forward models’ errors can be calculated, and this can be thought of as the likelihood after the movement execution based on knowledge of the result of the movement. The nal responsibility, which is the product of the prior and likelihood, normalized across the modules, represents the posterior probability. Adaptation of the RP ensures that the prior probability becomes closer to the posterior probability. 2.2 Gradient-Based Learning. This section describes gradient-descent learning of the MOSAIC model. The responsibility signal is used to gate the learning of the forward and inverse models and determine the inverse model selection. It ensures the competition between the forward models. 2 We will regard all sensory information that is not an element of dynamical differential equations as contextual signal.
MOSAIC Model for Sensorimotor Learning and Control
2207
The following gradient-descent algorithm, equation (2.6) is similar in spirit to the “annealed competition of experts” architecture (Pawelzik, Kohlmorgen, & Muller, ¨ 1996),
D wit
D 2 lit
dw i ( xt ¡ xO it ) , dwit
(2.6)
where 2 is the learning rate. For each forward model, a paired inverse model in equation (2.7) is assumed whose inputs are the desired next state x¤t C 1 and the current state xt . The ith inverse model produces a motor command uit as output, uit D y ( vit , x¤t C 1 , xt ) ,
(2.7)
where vit are the parameters of some function approximator y . For physical dynamic systems, an inverse model will exist. There may not be a unique inverse, but feedback error learning, described below, will nd one of the inverses that is sufcient for our model. The total motor command in equation (2.8) is the summation of the outputs from these inverse models using the responsibilities, lit , to weight the contributions ut D
n X iD 1
lit uit D
n X iD 1
lit y ( vit , x¤t C 1 , xt ).
(2.8)
Therefore, each inverse model contributes in proportion to the accuracy of its paired forward model’s prediction. One of the goals of the simulations will be to test the appropriateness of this proposed mixing strategy. Once again, the responsibilities are used to weight the learning of each inverse model as in equation (2.9). This ensures that inverse models learn only when their paired forward models make accurate predictions. Although for supervised learning the desired control command u¤t is needed (but is generally not available), we can approximate ( u¤t ¡ ut ) with the feedback motor command signal u f b (Kawato, 1990):
D vit
D 2 lit
dyi ¤ duit i ( ) ¡ ’ l u f b. u u 2 t dvit t dvit t
(2.9)
In summary, the responsibility signals lit are used in three ways: to gate the learning of the forward models (see equation 2.6), to gate the learning of the inverse models (see equation 2.9), and to determine the contribution of the inverse models to the nal motor command (see equation 2.8). 3 HMM-Based Learning The learning rule described so far for the forward models (see equations 2.5 and 2.6) is a standard gradient-based method and thus involves the fol-
2208
M. Haruno, D. Wolpert, and M. Kawato
lowing drawbacks. First, the scaling parameter s must be manually tuned. This parameter determines the extent of competition among modules and is critical for successful learning. Second, the responsibility signal lit at time t was computed only from the single observation of the sensory feedback made at time t. Without considering the dynamics of context evolution, the modules may switch too frequently due to any noise associated with the observations. To overcome these difculties, we can use the EM algorithm to adjust the scaling parameter s and incorporate a probabilistic model of the context transitions. This places the MOSAIC model selection process within the framework of hidden state estimation; the computational machinery cannot directly access the true value of the responsibility, although it can observe the physical state (i.e., position, velocity, and acceleration) and the motor command. More specically, we introduce here another way of learning and switching based on the hidden Markov model (HMM) (Huang, Ariki, & Jack, 1990). It is related to the methods proposed in the speech processing community (Poritz, 1982; Juang & Rabiner, 1985). Let Xi be [xi1 , . . . , xit ]0 a sequence of outputs of the ith forward model and Yi be [xi0 u1 , . . . , xit¡1 ut ]0 a sequence of forward-model input of the ith module. The forward model is assumed to be a linear system with the relation Xi D Yi Wi , where Wi is a column vector representing the true ith forward model parameters. By assuming gaussian noise process with a standard deviation si , the likelihood L ( X|Wi , si ) that the ith module generates the given data is computed as follows: Li ( X|W i , si ) D p
1 2p si2
e
¡
1 2si2
( X¡Yi Wi ) 0 ( X¡Yi Wi )
.
To consider dynamics, HMM introduces a stationary transition probability matrix P in which aij represents a probability of switching from the ith to the jth module: 0 1 a11 . . . a1n B a21 . . . a2n C B C PDB . .. .. C . @ .. . . A an1 . . . ann The gradient-based method described in the previous section implicitly assumes that aij D 1 / n, so that all transitions are equally likely. As contexts tend to change smoothly, this assumption may lead to more frequent and noisier switching than desired. By using these notations, the total expected likelihood L ( X|h ) that the n modules generate the given data is computed as in equation (3.1): L( X | h ) D
X T¡1 Y j
tD 0
ast¡1 st Lst ( X ( t) | Wst , sst ) ,
(3.1)
MOSAIC Model for Sensorimotor Learning and Control
2209
in which h represents parameters of forward models (Wi and si ), and j and st are all the possible sequences of modules and a module selected at time t, respectively. The objective here is to determine the optimal parameters P, W i , and si , which maximize the expected likelihood L ( X | h ) . The standard iterative technique for the problem is the Baum-Welch algorithm (Baum, Petrie, Soules, & Weiss, 1970) which is a specialized instance of the EM (Dempster, Laird, & Rubin, 1977). The rst step (E-step) is to compute the expectation in equation 3.1. To avoid the combinatorial explosion in enumerating j , the Baum-Welch algorithm introduces at ( i) and b t ( i) for dynamic programming. at ( i) represents the forward probability Pr( X1 , . . . , Xt , st D i | h ) of the partial observation sequence to time t and module i, which is reached at time t given parameter h. bt ( i) , on the other hand, is the backward probability Pr ( Xt C 1 , . . . , XT | st D i, h ) of the partial observation sequence from t C 1 to the nal observation T, given module i at time t and parameter h . The two variables can be computed recursively as follows, and the expectation is reduced to the sum of products of at ( i) and bt ( i) : at ( i) D bt ( i ) D L( X | h ) D
n X
at¡1 ( j) aji Li ( X ( t) )
jD 1 n X jD 1 n X iD 1
bt C 1 ( j) aij Lj ( X ( t C 1) ) aT¡1 ( i) D
n X
at ( i) bt ( i) .
iD 1
In the framework of MOSAIC model, the weighted average of inverse model outputs is necessary to compute the motor command. Equation 3.2 replaces equation 2.8 to compute the motor command: ut D
n X iD 1
at¡1 ( i) uit D
n X iD 1
at¡1 ( i) y ( vit , x¤t C 1 , xt ).
(3.2)
The second step (M-step) determines the optimal parameters for the current value of the expectation. We introduce two variables, c t ( i) and c t ( i, j), each of which represents the posterior probability of being in module i at time t, given the observation sequence and parameters, and a posterior probability of transition from module i to module j, conditioned on the observation sequence and parameters, respectively. Because c t ( i) can be regarded as an example of lit in the previous section, we can also use c t ( i) (and at ( i) for on-line computation) instead of lit in the rest of the article: c t ( i) D P ( st D i | X, h ) D at ( i) bt ( i) / L ( X | h )
2210
M. Haruno, D. Wolpert, and M. Kawato
c t ( i, j) D P ( st D i, st C 1 D j, | X, h ) at ( i) aij Lj ( X ( t C 1) ) bt C 1 ( j) / L ( X | h ) . With these denitions, the state transition probabilities are reestimated as in equation 3.3, PT¡2 0 c t ( i, j ) aOij D Pt DT¡2 , tD 0 c t ( i )
(3.3)
which is the number of expected transitions from module i to module j divided by the expected number, which stay in module i. Since the maximization step for linear parameters is a maximum likelihood estimation weighted by c t ( i) (Poritz, 1982; Fraser & Dimitriadis, 1993), O and s, new estimates of W and s (W O respectively) are computed as in equations 3.4 and 3.5, WO i D ( Y0c ( i) Y ) ¡1 Y0 X
O i ) 0c ( i) ( X ¡ YW O i) ( X ¡ YW sOi2 D , PT¡1 tD 0 c t ( i )
(3.4) (3.5)
which are the standard equations for weighted least-squared method (Jordan & Jacobs, 1992). In the equations, c ( i) represents a vector of c t ( i) . These new parameters are used to recompute the likelihood in equation 3.1. Although we discuss a batch algorithm for simplicity, on-line estimation schemes (Krishnamurthy & Moore, 1993) can be used in the same way (see Jordan & Jacobs, 1994, for details) 4 Simulation of Arm Tracking While Manipulating Objects 4.1 Learning and Control of Different Objects. To examine motor learning and control in the MOSAIC model, we simulated a task in which the hand had to track a given trajectory (30 sec shown in Figure 3b), while holding different unknown objects (see Figure 2). Each object had a particular mass, damping, and spring constant (M, B, K), illustrated in Figure 2. The manipulated object was periodically switched every 5 seconds among three different objects a, b, and c in this order. The task was exactly the same as that of Gomi and Kawato (1993). We assumed the existence of a perfect inverse dynamic model of the arm for the control of reaching movements. The controller therefore needs to learn the motor commands to compensate for the dynamics of the different objects. The sampling rate was 1000 Hz, and a trial was a 30 second run. The inputs to the forward models were the motor command and the position and velocity of the hand. Each forward model outputs a prediction of the acceleration of the hand at the next time step. The inputs to the inverse model were the hand’s position, velocity,
MOSAIC Model for Sensorimotor Learning and Control
K B
2211
M Manipulated Object
u
a b c d
M (Kg) 1.0 5.0 8.0 2.0
B (N m¡1 s) 2.0 7.0 3.0 10.0
K (N m¡1 ) 8.0 4.0 1.0 1.0
Figure 2: Schematic illustration of the simulation experiment in which the arm makes reaching movements while grasping different objects with mass M, damping B, and spring K. The object properties are shown in the table.
and acceleration of the desired state. Each inverse model output a motor command. In the rst simulation, we used the gradient-based method to train three forward-inverse model pairs (modules): the same number of modules as the number of objects. The learning rate for all modules was set to 0.01. In each module, both the forward (w in equation 2.1) and inverse (y in equation 2.7) models were implemented as a linear neural network. The scaling parameter s was tuned by hand over the course of the simulation. The use of linear networks allowed M, B, and K to be estimated from the forward and inverse model weights. 3 Let MjF ,BjF ,KjF be the estimates from the jth forward model and MjI ,BjI ,KjI be the estimates from the jth inverse model. Figure 3a shows an evolution of the forward model estimates of MjF ,BjF ,KjF for the three modules during learning. The three modules started from randomly selected initial conditions (open arrows) and converged over 200 trials to very good approximations4 of the three objects (lled arrows), as
3 The object dynamics can be described by a linear differential equation whose coefcients are dened by M, B, and K. By comparing these coefcients with weights of the linear network, we can calculate the estimated value of M, B, and K from both forward and inverse weights. 4 The weights were assumed to be converged if the predicted acceleration error per point became less than 0.001.
9 8 7 6 5 4 3 2 1 0 0
1
2
3
4
K
5
6
7
8
1
2
3
4
5
6
B
7
1
1 0.8 0.6 0.4 0.2 0 0
2
Step x 10 4
3
1 Step 2 x 104 3
1 0.8 0.6 0.4 0.2 0 0 0.4 0.2 0 -0.2 -0.4 -0.60
1
2
Step x 10
43
Responsibility Responsibility Responsibility
M
1 0.8 0.6 0.4 0.2 00
1
1 0.8 0.6 0.4 0.2 00
1 Step 2
1 0.8 0.6 0.4 0.2 0 0
1
0.4 0.2 0 -0.2 -0.4 -0.60
1
Step
2
4
3
4
3
x 10
x 10
2
3
2
3
Step x 10 4
Trajectory
Responsibility Responsibility Responsibility
Object
Trajectory
Initial condition
1 0.8 0.6 0.4 0.2 0 0
1
Step
2
x 10 4
3
(a)
Step x 10 4
(b)
0
5
10
15 20 Step
25
30
35
40
1 0.8 0.6 0.4 0.2 0 -10
-5
0
5
10
15 20 Step
25
30
35
40
1 0.8 0.6 0.4 0.2 0 -10 -5 x 10 -5 1 0.8
0
5
10
15 20 Step
25
30
35
40
0
5
10
15 20 Step
25
30
35
40
Acceleration Error
-5
Responsibility
1 0.8 0.6 0.4 0.2 0 -10
Responsibility
Responsibility
Figure 3: (a) Learning of three forward models is represented in terms of the parameters of the three corresponding objects. (b) Responsibility signals from the three modules (top 3) and tracking performance (bottom: gray is the desired trajectory) before (left) and after (right) learning.
0.6 0.4 0.2 0 -10
-5
Figure 5: Responsibility predictions based on contextual information of twodimensional object shapes (top three traces) and corresponding acceleration error of control induced by an unusual shape-dynamic pairing (bottom trace).
MOSAIC Model for Sensorimotor Learning and Control
2213
Table 1: Learned Object Characteristics. Module
MjF
BjF
KjF
MjI
BjI
KjI
1 2 3
1.0020 5.0071 8.0029
2.0080 7.0040 3.0010
8.0000 4.0000 0.9999
1.0711 5.0102 7.8675
2.0080 6.9554 3.0467
8.0000 4.0089 0.9527
shown in Table 1. Each of the three modules converged to a, b, and c objects, respectively. It is interesting to note that the estimates of the forward models are superior to those of the inverse models. This is because the inverse model learning depends on how modules are switched by the forward models. Figure 3b shows the performance of the model before (left) and after (right) learning. The top three panels show, from top to bottom, the responsibility signals of the rst, second, and third modules, and the bottom panel shows the hand’s actual and desired trajectories. At the start of learning, the three modules were equally poor and thus generated almost equal responsibilities (one-third) and were equally involved in control. As a result, the overall control performance was poor, with large trajectory errors. However, at the end of learning, the three modules switched quite well but did show occasional inappropriate spikes even though the learning parameter was carefully chosen. If we compare these results with Figure 7 of Gomi and Kawato (1993) for the same task, the superiority of MOSAIC model compared to the gating expert architecture is apparent. Note that the number of free parameters (synaptic weights) is smaller in the current architecture than the other. The difference in performance comes from two features of the basic architecture. First, in the gating architecture, a single gating network tries to divide the space while many forward models split the space in the MOSAIC model. Second, in the gating architecture, only a single control error is used to divide the space, but multiple prediction errors are simultaneously used in the MOSAIC model. 4.2 Feedforward Selection. This section describes results of a simple simulation of feedforward selection using responsibility predictors. Each dynamic object was associated with a visual cue, A, B, or C as shown in Figure 4. The visual cue was a two-dimensional (2-D) shape represented as a 3£3 binary matrix, which was randomly placed at one of four possible locations on a 4£4 retinal matrix. The retinal matrix was used as the contextual input to the RP (three-layer sigmoid feedforward network with 16 inputs, 8 hidden units, and 3 output units). During the course of learning, the combination of manipulated objects and visual cues was xed as A-a, B-b, and C-c . After 200 iterations of the trajectory, the combination A-c was presented for the rst time. Figure 5 plots the responsibility signals of the three modules (top three traces) and corresponding acceleration error
2214
M. Haruno, D. Wolpert, and M. Kawato
A
B
C Figure 4: Visual representations of the objects used as input to the responsibility predictors.
of the control induced by the conicting visual-dynamic pairing (bottom trace). Until the onset of movement (time 0), A was always associated with light mass of a, and C was always associated with the heavy mass c . Prior to movement when A was associated with c , the a module was switched on by the visual contextual information, but soon after the movement was initiated, the likelihood signal from the forward model’s prediction dominated, and the c module was properly selected. This demonstrates how the interplay of feedforward and feedback selection processes can lead to on-line reselection of appropriate control modules. 4.3 Feedback Selection by HMM. We will show here that the MOSAIC model with the EM algorithm was able to learn both the forward and inverse models and the transition matrix. In particular, when the true transitions were made cyclic or asymmetric, the estimated transition matrix closely captured the true transitions. For example, when we tested a hidden state dynamics generated by the transition matrix P, the EM algorithm converged O which was very close to the true P: to the following transition matrix P, 0
1 0.99 0.01 0 0.99 0.01A P D @0 0.01 0 0.99 0 1 0.9816 0.0177 0.0007 PO D @ 0.0021 0.9882 0.0097 A . 0.0079 0.0000 0.9920
MOSAIC Model for Sensorimotor Learning and Control
2215
Table 2: Acceleration Error Rates: HMM-MOSAIC/Gradient-MOSAIC with Two Types of Initial Conditions. Initial Conditions S.D.
Switching Period (Steps)
-
5
25
50
100
200
20% 40%
0.998 0.855
0.887 0.676
1.031 1.147
0.636 0.598
0.443 0.322
Now, we compare the gradient-based (see equation 2.6) and EM (see equation 3.4) learning methods on the same condition as Figure 2. We tested ve switching periods of 5, 25, 50, 100, and 200 steps for the total trial length of 600 steps. Separate simulations were run in which the initial weights of the inverse and forward models were generated by random processes with two different standard deviations. 5 The scaling parameter s in the gradient-based method was xed through the simulation, whereas the EM method automatically adjusted this parameter. Fifty repetitions of 20 trials were performed for each pair of switching period and initial weight variance. Table 2 shows the ratio of the nal acceleration errors of the two learning methods: the error of EM algorithm divided by the error of gradient-based method. The EM version of the MOSAIC model (HMM-MOSAIC) provides more robust estimation in two senses. First, HMM-MOSAIC achieved a precise estimation for a broad range of switching periods. In particular, it signicantly outperforms gradient-MOSAIC for infrequent switching. This is probably because gradient-MOSAIC does not take the switching dynamics into account and switches too frequently. Figure 6 exemplies typical changes in responsibility and actual acceleration for the two learning methods. It conrms the more stable switching and better performance of HMMMOSAIC. The second superiority of HMM-MOSAIC is the better performance when the learning starts from worse (larger variance) initial values. These two results, with the additional benet that the scaling parameter, s, need not be hand-tuned, conrms that HMM-MOSAIC is more suitable for learning and detecting the switching dynamics if the true switching is infrequent (i.e., every more than 100 steps). 4.4 Generalization to a Novel Object. A natural question regarding the MOSAIC model is how many modules need to be used. In other words, what happens if the number of objects exceeds the number of modules or an already trained MOSAIC model is presented with an unfamiliar object? To examine this, HMM-MOSAIC, trained with four objects a, b, c , and 5 The standard deviation of the initial conditions were 20% and 40%, respectively. In both cases, the mean was equal to the true parameters.
2216
M. Haruno, D. Wolpert, and M. Kawato
gradient MOSAIC
HMM MOSAIC 1.5
2.5 2
1
1.5
0.5
1
0
0.5
-0.5
0 -0.5
-1
-1
-1.5 -2
0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
200
200
400
400
600
600
-1.5 -2
0
200
400
600
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
200
400
600
Figure 6: Actual accelerations and responsibilities of HMM-MOSAIC and gradient-MOSAIC. (Top panels) Actual (black) and desired (blue) accelerations. (Bottom Panels) Responsibilities of the three modules.
d in Figure 2, was presented with a novel object g (its ( M, B, K ) is (2.61, 5.76, 4.05)). Because the object dynamics can be represented in a threedimensional parameter space and the four modules already acquired dene
MOSAIC Model for Sensorimotor Learning and Control
2217
four vertices of a tetrahedron within the 3D space, arbitrary object dynamics contained within the tetrahedron can be decomposed into a weighted average of the existing four forward modules (internal division point of the four vertices). The theoretically calculated linear weights of g were (0.15,0.20,0.35,0.30). Interestingly, each module’s responsibility signal averaged over trajectory was (0.14,0.25,0.33,0.28), and the mean squared Euclidean deviation from the true value was 0.16. This combination produced the error that was comparable to the performance with the already learned objects. When tested on 50 different objects within the tetrahedron, 6 the average Euclidean distance between the true internal division vector and estimated responsibility vector was 0.17. Therefore, although the responsibility was computed in the space of acceleration and had no direct relation to the space of ( M, B, K ), the two vectors had very similar values. This demonstrates the exible adaptation of the MOSAIC model to unknown objects, which originates from its probabilistic soft-switching mechanism. This is in sharp contrast to the hard switching of Narendra and Balakrishnan (1997) in which only one module can be selected at a time. Extrapolation outside the tetrahedron is likely to produce poorer performance, and it is an open research question as to the conditions under which extrapolation could be achieved and how the existing modules relearn in such a situation. 5 Discussion We have proposed a novel modular architecture, the MOSAIC model for motor learning and control. We have shown through simulation that this model can learn the controllers necessary for different contexts, such as object dynamics, and switch between them appropriately based on both sensory cues such as vision and feedback. The architecture also has the ability to generalize to novel dynamics by blending the outputs from its learned controllers. The architecture can also correct on-line for inappropriate selection based on erroneous visual cues. We introduced the Markovian assumption primarily for computational efciency. However, this can also be viewed as a model of the way context is estimated in human motor control. Vetter and Wolpert (2000b) have examined context estimation in humans and shown that the central nervous system (CNS) estimates the current context using information from both prior knowledge of how the context might evolve over time and from the comparison of predicted and actual sensory feedback. In one experiment, subjects learned to point to a target under two different visuomotor contexts, for example, rotated and veridical visual feedback. The contexts were
6 Since responsibility signals always take a positive value, we will focus on the inside of the tetrahedron.
2218
M. Haruno, D. Wolpert, and M. Kawato
chosen so that at the start of the movement, subjects could not tell which context they were acting in. After learning, the majority of the movement was made without visual feedback. However, during these movements, brief instances of visual feedback were provided that corresponded to one of the two contexts. The relationship between the feedback provided and the subjects’ estimate of the current movement context was examined by measuring where they pointed relative to where they should point for each of the different contexts. The results showed that in the presence of uncertainty about the current context, subjects produced an estimate of the current context intermediate to the two learned contexts rather than simply choose the more likely context. There were also systematic relationships between the timing and nature of the feedback provided and the nal context estimate. A two-state HMM provided a very good t to these data. This model incorporates an estimate of the context transition matrix and sensory feedback process, visual location corrupted by noise, and optimally estimates the current context. Vetter and Wolpert also studied whether the CNS could model the prior probability of context transitions. This was examined in cases where the context changed either discretely between movement (Vetter & Wolpert, 2000b) or continuously during a movement (Vetter & Wolpert, 2000a). In both cases, when feedback was withheld, subjects’ estimate of context showed behavior that demonstrated that the CNS was using knowledge about context transitions to estimate the context in the absence of feedback. Taken together, these results support that the CNS uses a model of context transition and compares predicted and actual sensory feedback to estimate the current context. Although their HMM model does not contain inverse models and responsibility predictors, it is very close in spirit to our forward models. The HMM was also tested as a model of human context estimation when either the context transition or sensory feedback noise was removed. The result is compatible with our simulation in that neither could explain the experimental data. One may ask how the MOSAIC model works when the context evolution has a hierarchical structure of depth greater than one. A natural extension of the MOSAIC model is to a hierarchical structure that consists of several layers, each comprising a MOSAIC model. At each layer, the higher-level MOSAIC model interacts with its subordinate MOSAIC models through access to the lower-level MOSAIC model’s responsibility signals, representing which module should be selected in the lower level given the current behavioral situation. The higher level generates descending commands, which represent the prior probability of the responsibility signal for the lower-level modules, and thus prioritizes which lower-level modules should be selected. This architecture could provide a general framework that is capable of sequential motor learning, chunking of movement patterns, and autonomous symbol formation (Haruno, Wolpert, & Kawato, 1999a).
MOSAIC Model for Sensorimotor Learning and Control
2219
6 Conclusion We have described an extension and evaluation of the MOSAIC model. Learning was augmented by the EM algorithm for HMM. Unlike gradient descent, it was found to be robust to the initial starting conditions and learning parameters. Second, simulations of an object manipulation task demonstrate that our model can learn to manipulate multiple objects and to switch among them appropriately. Moreover, after learning, the model showed generalization to novel objects whose dynamics lay within the tetrahedron of already learned dynamics. Finally, when each of the dynamics was associated with a particular object shape, the model was able to select the appropriate controller before movement execution. When presented with a novel shape-dynamic pairing, inappropriate activation of modules was observed, followed by on-line correction. In conclusion, the proposed MOSAIC model for motor learning and control, like the human motor system, can learn multiple tasks and shows generalization to new tasks and an ability to switch between tasks appropriately. Acknowledgments We thank Zoubin Ghahramani for helpful discussions on the Bayesian formulation of this model. This work was partially supported by Special Coordination Funds for promoting Science and Technology at the Science and Technology Agency of the Japanese government, the Human Frontiers Science Program, the Wellcome Trust, and the Medical Research Council. References Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic function of Markov chains. Ann. Math. Stat., 41, 164–171. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society B, 39, 1– 38. Fraser, A., & Dimitriadis, A. (1993). Forecasting probability densities by using hidden Markov models with mixed states. In A. Wiegand, & N. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 265–282). Reading, MA: Addison-Wesley. Gomi, H., & Kawato, M. (1993). Recognition of manipulated objects by motor learning with modular architecture networks. Neural Networks, 6, 485–497. Haruno, M., Wolpert, D., & Kawato, M. (1999a). Emergence of sequence-specic neural activities with hierarchical multiple paired forward and inverse models. Neuroscience Meeting Abstract, p. 1910 (760.2). Haruno, M., Wolpert, D., & Kawato, M. (1999b).Multiple paired forward-inverse models for human motor learning and control. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 31–37). Cambridge, MA: MIT Press.
2220
M. Haruno, D. Wolpert, and M. Kawato
Huang, X., Ariki, Y., & Jack, M. (1990). Hidden Markov models for speech recognition. Edinburgh: Edinburgh University Press. Jacobs, R., Jordan, M., & Barto, A. (1991). Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive Science, 15, 219–250. Jacobs, R., Jordan, M., Nowlan, S., & Hinton, G. (1991). Adaptive mixture of local experts. Neural Computation, 3, 79–87. Jordan, M., & Jacobs, R. (1992). Hierarchies of adaptive experts. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 985–993). San Mateo, CA: Morgan Kaufmann. Jordan, M., & Jacobs, R. (1994). Hierarchical mixture of experts and the EM algorithm. Neural Computation, 6, 181–214. Jordan, M., & Rumelhart, D. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16, 307–354. Juang, B., & Rabiner, L. (1985). Mixture autoregressive hidden Markov models for speech signals. IEEE Trans. Acoust. Speech Signal Processing, ASSP-33(6), 1404–1413. Kawato, M. (1990). Feedback-error-learning neural network for supervised learning. In R. Eckmiller (Ed.), Advanced neural computers (pp. 365–372). Amsterdam: North-Holland. Kawato, M., Furukawa, K., & Suzuki, R. (1987). A hierarchical neural network model for the control and learning of voluntary movements. Biol. Cybern., 56, 1–17. Krishnamurthy, V., & Moore, J. (1993). On-line estimation of hidden Markov model parameters based on the Kullback-Leibler Information Measure. IEEE Transactions on Signal Processing, 41(8), 2557–2573. Narendra, K., & Balakrishnan, J. (1997).Adaptive control using multiple models. IEEE Transaction on Automatic Control, 42(2), 171–187. Narendra, K., Balakrishnan, J., & Ciliz, M. (1995). Adaptation and learning using multiple models, switching and tuning. IEEE Contr. Syst. Mag., 37–51. Pawelzik, K., Kohlmorgen, J., & Muller, ¨ K. (1996). Annealed competition of experts for a segmentation and classication of switching dynamics. Neural Computation, 8, 340–356. Poritz, A. (1982).Linear predictive hidden Markov models and the speech signal. In Proc. ICASSP 82 (pp. 1291–1294). Vetter, P., & Wolpert, D. (2000a). The CNS updates its context estimate in the absence of feedback. NeuroReport, 11, 3783–3786. Vetter, P., & Wolpert, D. (2000b). Context estimation for sensory motor control. Journal of Neurophysiology, 84, 1026–1034. Wolpert, D., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11, 1317–1329. Received August 23, 1999; accepted November 22, 2000.
LETTER
Communicated by Laurence Abbott
Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning Rajesh P. N. Rao Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, U.S.A. Terrence J. Sejnowski Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California at San Diego, La Jolla, CA 92037, U.S.A. A spike-timing-dependent Hebbian mechanism governs the plasticity of recurrent excitatory synapses in the neocortex: synapses that are activated a few milliseconds before a postsynaptic spike are potentiated, while those that are activated a few milliseconds after are depressed. We show that such a mechanism can implement a form of temporal difference learning for prediction of input sequences. Using a biophysical model of a cortical neuron, we show that a temporal difference rule used in conjunction with dendritic backpropagating action potentials reproduces the temporally asymmetric window of Hebbian plasticity observed physiologically. Furthermore, the size and shape of the window vary with the distance of the synapse from the soma. Using a simple example, we show how a spike-timing-based temporal difference learning rule can allow a network of neocortical neurons to predict an input a few milliseconds before the input’s expected arrival. 1 Introduction
Neocortical circuits are dominated by massive excitatory feedback: more than 80% of the synapses made by excitatory cortical neurons are onto other excitatory cortical neurons (Douglas, Koch, Mahowald, Martin, & Suarez, 1995). Why is there such massive recurrent excitation in the neocortex, and what is its role in cortical computation? Previous modeling studies have suggested a role for excitatory feedback in amplifying feedforward inputs (Douglas et al., 1995; Suarez, Koch, & Douglas, 1995; Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Somers, Nelson, & Sur, 1995; Chance, Nelson, & Abbott, 1999). Recently, however, it has been shown that recurrent excitatory connections between cortical neurons are modied according to a spike-timingdependent temporally asymmetric Hebbian learning rule: synapses that are activated slightly before the cell res are strengthened whereas those that c 2001 Massachusetts Institute of Technology Neural Computation 13, 2221– 2237 (2001) °
2222
Rajesh P. N. Rao and Terrence J. Sejnowski
are activated slightly after are weakened (Levy & Steward, 1983; Markram, Lubke, Frotscher, & Sakmann, 1997; Gerstner, Kempter, van Hemmen, & Wagner, 1996; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Sejnowski, 1999). Information regarding the postsynaptic activity of the cell is conveyed back to the dendritic locations of synapses by backpropagating action potentials from the soma (Stuart & Sakmann, 1994). In this article, we explore the hypothesis that recurrent excitation in neocortical circuits subserves the function of prediction and generation of temporal sequences (for related ideas, see Jordan, 1986; Elman, 1990; Minai & Levy, 1993; Montague & Sejonowski, 1994; Abbott & Blum, 1996; Rao & Ballard, 1997; Barlow, 1998; Westerman, Northmore, & Elias, 1999). In particular, we show that a temporal-difference-based learning rule for prediction (Sutton, 1988), when applied to backpropagating action potentials in dendrites, reproduces the temporally asymmetric window of Hebbian plasticity obtained in physiological experiments (see section 3). We examine the stability of the learning rule in section 4 and discuss possible biophysical mechanisms for implementing this rule in section 5. We also provide a simple example demonstrating how such a learning mechanism may allow cortical networks to learn to predict their inputs using recurrent excitation. The model predicts that cortical neurons may employ different temporal windows of plasticity at different dendritic locations to allow them to capture correlations between pre- and postsynaptic activity at different timescales (see section 6). A preliminary report of this work appeared as Rao and Sejnowski (2000). 2 Temporal Difference Learning
To predict input sequences accurately, the recurrent excitatory connections in a given network need to be adjusted such that the appropriate set of neurons is activated at each time step. This can be achieved by using a temporal difference (TD) learning rule (Sutton, 1988; Montague & Sejnowski, 1994). In this paradigm of synaptic plasticity, an activated synapse is strengthened or weakened based on whether the difference between two temporally separated predictions is positive or negative. This minimizes the errors in prediction by ensuring that the prediction generated by the neuron after synaptic modication is closer to the desired value than before. The simplest example of a TD learning rule arises in the problem of predicting a scalar quantity z using a neuron with synaptic weights w ( 1) , . . . , w( k) (represented as a vector w ). The neuron receives as presynaptic input the sequence of vectors x 1 , . . . , xP m . The output of the neuron at time t is ( ) ( ). assumed to be given by Pt D i w i xt i The goal is to learn a set of synaptic weights such that the prediction P t is as close as possible to the target z. According to the temporal difference (TD(0)) learning rule (Sutton, 1988), the weights are changed at time t by an amount given by:
D wt
D a( Pt C 1 ¡ Pt ) x t ,
(2.1)
Spike-Timing-Dependent Plasticity B
30 mV
A
2223
C Soma
30 mV
Dendrite
3 ms
Figure 1: Model neuron response properties. (A) Response of a model neuron to a 70 pA current pulse injection into the soma for 900milliseconds. (B) Response of the same model neuron to Poisson distributed excitatory and inhibitory synaptic inputs at random locations on the dendrite. (C) Example of a backpropagating action potential in the dendrite of the model neuron as compared to the corresponding action potential in the soma (enlarged from the initial portion of the trace in B).
where a is a learning rate or gain parameter and Pm C 1 D z. Note that in such a learning paradigm, synaptic plasticity is governed by the TD in postsynaptic activity at time instants t C 1 and t in conjunction with presynaptic activity x t at time t. We use this paradigm of learning in the next section to model spike-timing-dependent plasticity. 3 Modeling Spike-Timing-Dependent Plasticity as TD Learning
In order to ascertain whether spike-timing-dependent temporally asymmetric plasticity in cortical neurons can be interpreted as a form of TD learning, we used a two-compartment model of a cortical neuron consisting of a dendrite and a soma-axon compartment (see Figure 1). The compartmental model was based on a previous study that demonstrated the
2224
Rajesh P. N. Rao and Terrence J. Sejnowski
ability of such a model to reproduce a range of cortical response properties (Mainen & Sejnowski, 1996). Four voltage-dependent currents and one calcium-dependent current were simulated (as in Mainen & Sejnowski, 1996): fast Na C , INa ; fast K C , IKv ; slow noninactivating K C , IKm; high voltageactivated Ca2 C , ICa ; and calcium-dependent K C current, IKCa . The following active conductance densities were used in the soma-axon compartment (in pS/m m2 ): gNa D 40,000 and gKv D 1400. For the dendritic compartment, we used: gNa D 20, gCa D 0.2, gKm D 0.1, and gKCa D 3, with leak conductance 33.3 m S/cm2 and specic membrane resistance 30 kV-cm2 . The presence of voltage-activated sodium channels in the dendrite allowed backpropagation of action potentials from the soma into the dendrite as shown in Figure 1C. Conventional Hodgkin-Huxley-type kinetics were used for all currents (integration time step D 25 m s, temperature D 37± C). Ionic currents I were calculated using the ohmic equation, I D gAx B( V ¡ E ) ,
(3.1)
where g is the maximal ionic conductance density, A and B are activation and inactivation variables, respectively (x denotes the order of kinetics— see Mainen & Sejonowski, 1996), and E is the reversal potential for the given ion species (EK D ¡90 mV, ENa D 60 mV, ECa D 140 mV, Eleak D ¡70 mV). For all compartments, the specic membrane capacitance was 0.75 m F/cm2 . Two key parameters governing the response properties of the model neuron are (Mainen & Sejnowski, 1996): the ratio of axo-somatic area to dendritic membrane area (r) and the coupling resistance between the two compartments (k ). For the simulations, we used the values r D 150 (with an area of 100 m m2 for the soma-axon compartment) and a coupling resistance of k D 8 MV. Poisson-distributed synaptic inputs to the dendrite (see Figure 1B) were simulated using alpha function–shaped (Koch, 1999) current pulse injections (time constant D 5 ms) at Poisson intervals with a mean presynaptic ring frequency of 3 Hz. To study plasticity, excitatory postsynaptic potentials (EPSPs) were elicited at different time delays with respect to postsynaptic spiking by presynaptic activation of a single excitatory synapse located on the dendrite. Synaptic currents were calculated using a kinetic model of synaptic transmission with model parameters tted to whole-cell recorded AMPA currents (see Destexhe, Mainen, & Sejnowski, 1998 for more details). Synaptic plasticity was simulated by incrementing or decrementing the value for maximal synaptic conductance by an amount proportional to the TD in the postsynaptic membrane potential at time instants t C D t and t for presynaptic activation at time t. The delay parameter D t was set to 10 ms to yield results consistent with previous physiological experiments (Markram et al., 1997; Bi & Poo, 1998). Presynaptic input to the model neuron was paired with postsynaptic spiking by injecting a depolarizing current pulse (10 ms, 200 pA) into
Spike-Timing-Dependent Plasticity
2225
the soma. Changes in synaptic efcacy were monitored by applying a test stimulus before and after pairing and recording the EPSP evoked by the test stimulus. Figure 2 shows the results of pairings in which the postsynaptic spike was triggered 5 ms after and 5 ms before the onset of the EPSP, respectively. While the peak EPSP amplitude was increased 58.5% in the former case, it was decreased 49.4% in the latter case, qualitatively similar to experimental observations (Markram et al., 1997; Bi & Poo, 1998). The critical window for synaptic modications in the model depends on the parameter D t as well as the shape of the backpropagating action potential. This window of plasticity was examined by varying the time interval between presynaptic stimulation and postsynaptic spiking (with D t D 10 ms). As shown in Figure 2C, changes in synaptic efcacy exhibited a highly asymmetric dependence on spike timing similar to physiological data (Markram et al., 1997). Potentiation was observed for EPSPs that occurred between 1 and 12 ms before the postsynaptic spike, with maximal potentiation at 6 ms. Maximal depression was observed for EPSPs occurring 6 ms after the peak of the postsynaptic spike, and this depression gradually decreased, approaching zero for delays greater than 10 ms. As in rat neocortical neurons (Markram et al., 1997), Xenopus tectal neurons (Zhang et al, 1998), and cultured hippocampal neurons (Bi & Poo, 1998), a narrow transition zone (roughly 3 ms in the model) separated the potentiation and depression windows. 4 Stability of the Learning Rule
A crucial question regarding the spike-based Hebbian learning rule described above is whether it produces a stable set of weights for a given training set of inputs. In the case of the conventional Hebbian learning rule, which only prescribes increases in synaptic weights based on pre- and postsynaptic correlations, numerous methods have been suggested to ensure stability, such as weight normalization and weight decay (see Sejnowski, 1977, and Montague & Sejnowski, 1994, for reviews). The classical TD learning rule is self-stabilizing because weight modications are dependent on the error in prediction ( Pt C 1 ¡ P t ) , which can be positive or negative. Thus, when the weights w become large, the predictions Pt tend to become large compared to Pt C 1 (e.g., at the end of the training sequence). This results in a decrease in synaptic strength, as needed to ensure stability. In the case of the spike-based learning rule, convergence to a stable set of weights depends on both the biophysical properties of the neuron and the TD parameter D t. To see this, consider the example in Figure 3, where a presynaptic spike occurs at time tpre and causes a backpropagating action potential (AP) at time tBPAP in a postsynaptic model neuron (model neuron parameters were the same as those given in section 3). Using a xed D t D 10 milliseconds, the synaptic conductance initially increases because the TD error ( Ptpre C D t ¡ Ptpre ) is positive. The increase in synaptic conduc-
2226
Rajesh P. N. Rao and Terrence J. Sejnowski
A
B before
40 mV
pairing
10 mV
after
15 ms
15 ms
15 ms
15 ms
10 mV
after
40 mV
pairing
15 ms
15 ms
S2
Change in EPSP Amplitude (%)
S1
C
10 mV
10 mV
before
15 0 10 0 50 0 50 10 0 40
20
0
20
40
Time of Synaptic Input (ms) Figure 2: Synaptic plasticity in a model neocortical neuron. (A) EPSP in the model neuron evoked by a presynaptic spike (S1) at an excitatory synapse (“before”). Pairing this presynaptic spike with postsynaptic spiking after a 5 ms delay (“pairing”) induces long-term potentiation (“after”). (B) If presynaptic stimulation (S2) occurs 5 ms after postsynaptic ring, the synapse is weakened, resulting in a corresponding decrease in peak EPSP amplitude. (C) Temporally asymmetric window of synaptic plasticity obtained by varying the delay between pre- and postsynaptic spiking (negative delays refer to presynaptic before postsynaptic spiking).
tance causes the postsynaptic neuron to re earlier, which in turn causes an increase in the TD error, until Ptpre C D t reaches a value close to Ptpre . This happens when the time of the somatic AP is almost equal to tpre and the
Spike-Timing-Dependent Plasticity
2227
40 mV
BPAP before Learning BPAP after Learning
A
t pre
t pre t BPAP
D t
25 ms t BPAP= t 0
D t
TD Error (mV)
Synaptic Weight ( m S)
40 20 0
0.007
0.005 0.04
60
B
0.02 0 D t
0
D t
20
C
10
D t
D t
0 0
0.007 0.005 50 100 0 50 100 Number of Trials Number of Trials
Figure 3: Stability of spike-timing based temporal difference learning. (A) A presynaptic spike at time tpre results in a backpropagating action potential (BPAP) a few milliseconds later at time tBPAP , causing an increase in synaptic strength due to spike-timing-dependent plasticity. The TD parameter D t was set to 10 milliseconds. After 100 trials of learning (pre- and postsynaptic pairing), the TD error is reduced to zero, and the maximal synaptic conductance converges to a stable value of 8.1 nS. The postsynaptic BPAP now occurs at an earlier time t 0 compared to its time of occurrence before learning. (B) When a shorter D t of 5 ms is used (with the same initial conditions as in A), the synaptic strength continues to grow without bound after 100 trials because the TD error remains positive. A burst of two spikes is elicited due to an unrealistically large synaptic conductance. (C) An upper bound of 8 nS on the maximal synaptic conductance prevents unbounded synaptic growth and ensures that a single spike is elicited.
backpropagating AP occurs slightly later at time tBPAP D t0 (Figure 3A, second graph). The synaptic conductance then remains stable for this pair of pre- and postsynaptic spikes (Figure 3A, right-most graph). On the other hand, if D t is much smaller than the width of the backpropagating AP, the maximal synaptic conductance may grow without bound, as shown in Figure 3B (right-most graph). In this case (D t D 5 ms), the backpropagating AP does not peak soon enough after the presynaptic spike, causing the TD error to remain positive. As a result, the synaptic conductance does not converge to a stable value, and multiple spikes, in the form of a burst, are elicited (see Figure 3B, second graph). One solution to the problem of unbounded growth in synaptic conductance is to place an up-
2228
Rajesh P. N. Rao and Terrence J. Sejnowski
per bound on the synaptic strength, as suggested by Abbott and Song (1999) and Gerstner et al. (1996). In this case, although the TD error remains positive, the synaptic conductance remains clamped at the upper bound value, and only a single spike is elicited (see Figure 3C). The use of an upper bound is partially supported by physiological experiments showing a dependence of spike-timing-based synaptic plasticity on initial synaptic strength: longterm potentiation (LTP) was found to occur mostly in weak synapses, while connections that already had large values for their synaptic strength consistently failed to show LTP (Bi & Poo, 1998). In summary, the stability of the TD learning rule for spike-timing-dependent synaptic plasticity depends crucially on whether the temporal window parameter D t is comparable in magnitude to the width of the backpropagating AP at the location of the synapse. An upper bound on the maximal synaptic conductance may be required to ensure stability in general (Gerstner et al., 1996; Abbott & Song, 1999; Song, Miller, & Abbott, 2000). Such a saturation constraint is partially supported by experimental data (Bi & Poo, 1998). An alternative approach that might be worth considering is to explore a learning rule that uses a continuous form of the TD error, where, for example, an average of postsynaptic activity is subtracted from the current activity (Montague & Sejnowski, 1994; Doya, 2000). Such a rule may offer better stability properties than the discrete TD rule that we have used, although other parameters, such as the window over which average activity is computed, may still need to be chosen carefully. 5 Biophysical Mechanisms
An interesting question is whether a biophysical basis can be found for the TD learning model described above. Neurophysiological and imaging studies suggest a role for dendritic Ca2 C signals in the induction of spike-timingdependent LTP and LTD (long-term depression) in hippocampal and cortical neurons (Magee & Johnston, 1997; Koester & Sakmann, 1998; Paulsen & Sejnowski, 2000). In particular, when an EPSP preceded a postsynaptic action potential, the Ca2 C transient in dendritic spines, where most excitatory synaptic connections occur, was observed to be larger than the sum of the Ca2 C signals generated by the EPSP or AP alone, causing LTP; on the other hand, when the EPSP occurred after the AP, the Ca2 C transient was found to be a sublinear sum of the signals generated by the EPSP or AP alone, resulting in LTD (Koester & Sakmann, 1998; Linden, 1999; Paulsen & Sejnowski, 2000). Possible sources contributing to the spinous Ca2 C transients include Ca2 C ions entering through NMDA receptors (Bliss & Collingridge, 1993; Koester & Sakmann, 1998), voltage-gated Ca2 C channels in the dendrites (Schiller, Schiller, & Clapham, 1998), and calcium-induced calcium release from intracellular stores (Emptage, 1999). How do the above experimental observations support the TD model? In a recent study (Franks, Bartol, Egelman, Poo, & Sejnowski, 1999), a Monte
Spike-Timing-Dependent Plasticity
2229
Carlo simulation program MCell (Stiles, Bartol, Salpeter, Salpeter, & Sejnowski, 2000) was used to model the Ca2 C dynamics in dendritic spines following pre- and postsynaptic activity and to track the binding of Ca2 C to endogenous proteins. The inux of Ca2 C into a spine is governed by the rapid depolarization pulse caused by the backpropagating AP. The width of the backpropagating AP is much smaller than the time course of glutamate binding to the NMDA receptor. As a result, the dynamics of Ca2 C inux and binding to calcium-binding proteins such as calmodulin depends highly nonlinearly on the relative timing of presynaptic activation (with release of glutamate) and postsynaptic depolarization (due to the backpropagating AP). In particular, due to its kinetics, the binding protein calmodulin could serve as a differentiator of intracellular calcium concentration, causing synapses either to potentiate or depress depending on the spatiotemporal prole of the dendritic Ca2 C signal (Franks et al., 1999). As a consequence of these biophysical mechanisms, the change in synaptic strength depends, to a rst approximation, on the time derivative of the postsynaptic activity, as postulated by the TD model. 6 Model Predictions
The shape and size of backpropagating APs at different locations in a cortical dendrite depend on the distance of the dendritic location from the soma. For example, as backpropagating APs progress from the soma to the distal parts of a dendrite, they tend to become broader as compared to more proximal parts of the dendrite. This is shown in Figure 4 for a reconstructed layer 5 neocortical neuron (Douglas, Martin, & Whitteridge, 1991) from cat visual cortex with ionic channels based on those used for this neuron in the study by Mainen and Sejnowski (1996). Since synaptic plasticity in our model depends on the TD in postsynaptic activity, the model predicts that synapses situated at different locations on a dendrite should exhibit different temporally asymmetric windows of plasticity. This is illustrated in Figure 4 for the reconstructed model neuron. Learning windows were calculated by applying the TD operator to the backpropagating APs with D t D 2 ms. The model predicts that the window of plasticity for distal synapses should be broader than for proximal synapses. Having broader windows would allow distal synapses to encode longer timescale correlations between pre- and postsynaptic activity. Proximal synapses would encode correlations at shorter timescales due to sharper learning windows. Thus, by distributing its synapses throughout its dendritic tree, a cortical neuron could in principle capture a wide range of temporal correlations between its inputs and its output. This in turn would allow a network of cortical neurons to predict sequences accurately and reduce possible ambiguities such as aliasing between learned sequences by tracking the sequence at multiple timescales (Rao & Ballard, 1997; Rao, 1999).
2230
Rajesh P. N. Rao and Terrence J. Sejnowski BP AP
TD Learning Window post --> pre
pre --> post
Synaptic Locations (a) Distal Dendrites
(b) 200 m m
(c) v
Proximal (d)
45 mV 20 ms
Figure 4: Dependence of learning window on synaptic location. The traces in the middle column show size and shape of a backpropagating AP at different dendritic locations in a compartmental model of a reconstructed layer 5 neocortical neuron. The corresponding TD learning window for putative synapses at these dendritic locations is shown on the right. Note the gradual broadening of the learning window in time as one progresses from proximal to distal synapses.
The TD model also predicts asymmetries in size and shape between the LTP and LTD windows. For example, in Figure 4, the LTD window for the two apical synapses (labeled a and b) is much broader and shallower than the corresponding LTP window. Such an asymmetry between LTP and LTD has recently been reported for synapses in rat primary somatosensory cortex (S1) (Feldman, 2000). In particular, the range of time delays between pre- and postsynaptic spiking that induces LTD was found to be much longer than the range of delays that induces LTP, generating learning windows similar to the top two windows in Figure 4. One computational consequence of such a learning window is that synapses that elicit subthreshold EPSPs in a manner uncorrelated with postsynaptic spiking will become depressed over time. In rat primary somatosensory cortex, plucking one whisker but sparing its neighbor causes neuronal responses to the deprived whisker in layer II/III to become rapidly depressed. The asymmetry in the LTP/LTD learning windows provides an explanation for this phenomenon: spontaneously spiking inputs from plucked whiskers are uncorrelated with postsynaptic spiking, and therefore, synapses receiving these inputs will become depressed (Feldman, 2000). Such a mechanism may contribute to experience-dependent depression of responses and related changes in the receptive eld properties of neurons in other cortical areas as well.
Spike-Timing-Dependent Plasticity
2231
7 Discussion
Our results suggest that spike-timing-dependent plasticity in neocortical synapses can be interpreted as a form of TD learning for prediction. To see how a network of model neurons can learn to predict sequences using such a learning mechanism, consider the simple case of two excitatory neurons N1 and N2 connected to each other, receiving inputs from two separate input neurons I1 and I2 (see Figure 5A). Model neuron parameters were the same as those used in section 3. Suppose input neuron I1 res before input neuron I2, causing neuron N1 to re (see Figure 5B). The spike from N1 results in a subthreshold EPSP in N2 due to the synapse S2. If input arrives from I2 between 1 and 12 ms after this EPSP and if the temporal summation of these two EPSPs causes N2 to re, synapse S2 will be strengthened. The synapse S1, on the other hand, will be weakened because the EPSP due to N2 arrives a few milliseconds after N1 has red. After several exposures to the I1–I2 training sequence, when I1 causes neuron N1 to re, N1 in turn causes N2 to re several milliseconds before input I2 occurs due to the potentiation of the recurrent synapse S2 in previous trials (see Figure 5C). Input neuron I2 can thus be inhibited by the predictive feedback from N2 just before the occurrence of imminent input activity (marked by the asterisk in Figure 5C). This inhibition prevents input I2 from further exciting N2, thereby implementing a negative feedback–based predictive coding circuit (Rao & Ballard, 1999). Similarly, a positive feedback loop between neurons N1 and N2 is avoided because the synapse S1 was weakened in previous trials (see the arrows in Figures 5B and 5C, top row). Figure 6A depicts the process of potentiation and depression of the two synapses as a function of the number of exposures to the I1–I2 input sequence. The decrease in latency of the predictive spike elicited in N2 with respect to the timing of input I2 is shown in Figure 6B. Notice that before learning, the spike occurs 3.2 ms after the occurrence of the input, whereas after learning, it occurs 7.7 ms before the input. Although the postsynaptic spike continues to occur shortly after the activation of synapse S2, this synapse is prevented from assuming larger values due to a saturation constraint of 0.03 m S on the maximal synaptic conductance (see section 4 for a discussion of this constraint). In related work (Rao & Sejnowkski, 2000), we have shown how such a learning mechanism can explain the development of direction selectivity in recurrent cortical networks, yielding receptive eld properties similar to those observed in awake monkey V1. The precise biophysical mechanisms underlying spike-timing-dependent temporal difference learning remain unclear. However, as discussed in section 5, calcium uctuations in dendritic spines are known to be strongly dependent on the timing between pre- and postsynaptic spikes. Such calcium transients may cause, via calcium-mediated signaling cascades, asymmetric synaptic modications that are dependent, to a rst approximation,
2232
Rajesh P. N. Rao and Terrence J. Sejnowski
A S1
Excitatory Neuron N1
S2
Input Neuron I1
Input Neuron I2
Input 1
B
Input 2
C
Before Learning
N1
After Learning
N1
I1
I1
N2
40 mV
40 mV
N2
Excitatory Neuron N2
15 ms
I2
15 ms
I2
*
Figure 5: Learning to predict using spike-timing-dependent Hebbian learning. (A) Network of two model neurons N1 and N2 recurrently connected via excitatory synapses S1 and S2, with input neurons I1 and I2. N1 and N2 inhibit the input neurons via inhibitory interneurons (dark circles). (B) Network activity elicited by the sequence I1 followed by I2. (C) Network activity for the same sequence after 40 trials of learning. Due to strengthening of recurrent synapse S2, recurrent excitation from N1 now causes N2 to re several milliseconds before the expected arrival of input I2 (dashed line), allowing it to inhibit I2 (asterisk). Synapse S1 has been weakened, preventing reexcitation of N1 (downward arrows show a decrease in EPSP).
Spike-Timing-Dependent Plasticity
2233
A Synaptic Strength ( m S)
0 .0 3
Synapse S2
0 .0 25 0 .0 2 0 .0 15 0 .0 1 0 .0 05
Synapse S1
0 0
10
20
30
40
Time (number of trials)
Latency of Predictive Spike (ms)
B 4 2 0 2 4 6 8 0
10
20
30
40
Time (number of trials) Figure 6: Synaptic strength and latency reduction due to learning. (A) Potentiation and depression of synapses S1 and S2, respectively, during the course of learning. Synaptic strength was dened as maximal synaptic conductance in the kinetic model of synaptic transmission (Destexhe et al., 1998). (B) Latency of predictive spike in N2 during the course of learning measured with respect to the time of input spike in I2 (dotted line).
2234
Rajesh P. N. Rao and Terrence J. Sejnowski
on the temporal derivative of postsynaptic activity. An interesting topic worthy of further investigation is therefore the development of more realistic implementations of TD learning based on, for instance, the temporal derivative of postsynaptic calcium activity rather than the TD in postsynaptic membrane potential as modeled here. An alternate approach to analyzing spike-timing-dependent learning rules is to decompose an observed asymmetric learning window into a TD component plus noise and to analyze the noise component. However, the advantage of our approach is that one can predict the shape of plasticity windows at different dendritic locations based on an estimate of postsynaptic activity, as described in section 6. Conversely, given a particular learning window, one can use the model to explain the temporal asymmetry of the window as a function of the neuron’s postsynaptic activity prole. Temporally asymmetric learning has previously been suggested as a possible mechanism for sequence learning in the hippocampus (Minai & Levy, 1993; Abbott & Blum, 1996) and as an explanation for the asymmetric expansion of hippocampal place elds during route learning (Mehta, Barnes, & McNaughton, 1997). Some of these models used relatively long temporal windows of synaptic plasticity, on the order of several hundreds of milliseconds (Abbott & Blum, 1996), while others used temporal windows in the submillisecond range for coincidence detection (Gerstner et al., 1996). Sequence learning in our model is based on a window of plasticity that spans approximately § 20 milliseconds, which is roughly consistent with recent physiological observations (Markram et al., 1997) (see also Abbott & Song, 1999; Roberts, 1999; Westerman et al., 1999; Mehta & Wilson, 2000; Song et al., 2000). Several theories of prediction and sequence learning in the hippocampus and neocortex have been proposed based on statistical and information theoretic ideas (Minai & Levy, 1993; Montague & Sejnowski, 1994; Abbott & Blum, 1996; Dayan & Hinton, 1996; Daugman & Downing, 1995; Barlow, 1998; Rao & Ballard, 1999). Our biophysical simulations suggest a possible implementation of such models in cortical circuitry. Given the universality of the problem of encoding and generating temporal sequences in both sensory and motor domains, the hypothesis of TD-based sequence learning in recurrent neocortical circuits may help provide a unifying principle for studying cortical structure and function.
Acknowledgments
This research was supported by the Sloan Center for Theoretical Neurobiology at the Salk Institute and the Howard Hughes Medical Institute. We thank Peter Dayan and an anonymous referee for their constructive comments and suggestions.
Spike-Timing-Dependent Plasticity
2235
References Abbott, L. F., & Blum, K. I. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cereb. Cortex, 6, 406–416. Abbott, L. F., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neural response variability. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 69–75). Cambridge, MA: MIT Press. Barlow, H. (1998). Cerebral predictions. Perception, 27, 885–888. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 92, 3844–3848. Bi, G. Q., & Poo, M. M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bliss, T. V., & Collingridge, G. L. (1993). A synaptic model of memory: Long-term potentiation in the hippocampus. Nature, 361, 31–39. Chance, F. S., Nelson, S. B., & Abbott, L. F. (1999). Complex cells as cortically amplied simple cells. Nature Neuroscience, 2, 277–282. Daugman, J. G., & Downing, C. J. (1995). Demodulation, predictive coding, and spatial vision. J. Opt. Soc. Am. A, 12, 641–660. Dayan, P., & Hinton, G. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8), 1385–1403. Destexhe, A., Mainen, Z., & Sejnowski, T. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1991).An intracellular analysis of the visual responses of neurons in cat visual cortex. J. Physiol.,440, 659–696. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245. Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Emptage, N. J. (1999). Calcium on the up: Supralinear calcium signaling in central neurons. Neuron, 24, 495–497. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Franks, K. M., Bartol, T. M., Egelman, D. M., Poo, M. M., & Sejnowski, T. J. (1999). Simulated dendritic inux of calcium ions through voltage- and ligand-gated channels using MCell. Soc. Neurosci. Abstr., 25, 1989. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–81. Jordan, M. (1986).Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society (pp. 531–546). Hillsdale, NJ: Erlbaum. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press.
2236
Rajesh P. N. Rao and Terrence J. Sejnowski
Koester, H. J., & Sakmann, B. (1998). Calcium dynamics in single spines during coincident pre- and postsynaptic activity depend on relative timing of back-propagating action potentials and subthreshold excitatory postsynaptic potentials. Proc. Natl. Acad. Sci. U.S.A., 95(16), 9596–9601. Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neuroscience, 8, 791–797. Linden, D. J. (1999). The return of the spike: Postsynaptic action potentials and the induction of LTP and LTD. Neuron, 22(4), 661–666. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–213. Mainen, Z., & Sejnowski, T. (1996). Inuence of dendritic structure on ring pattern in model neocortical neurons. Nature, 382, 363–366. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Mehta, M. R., Barnes, C. A., & McNaughton, B. L. (1997).Experience-dependent, asymmetric expansion of hippocampal place elds. Proc. Natl. Acad. Sci. U.S.A., 94, 8918–8921. Mehta, M. R., & Wilson, M. (2000). From hippocampus to V1: Effect of LTP on spatiotemporal dynamics of receptive elds. Neurocomputing, 32, 905–911. Minai, A. A., & Levy, W. B. (1993). Sequence learning in a single trial. In Proceedings of the 1993 INNS World Congress on Neural Networks II (pp. 505–508). Hillsdale, NJ: Erlbaum. Montague, P. R., & Sejnowski, T. J. (1994). The predictive brain: Temporal coincidence and temporal order in synaptic learning mechanisms. Learning and Memory, 1, 1–33. Paulsen, O., & Sejnowski, T. J. (2000). Natural patterns of activity and long-term synaptic plasticity. Curr. Opin. Neurobiol., 10(2), 172–179. Rao, R. P. N. (1999). An optimal estimation approach to visual perception and learning. Vision Research, 39(11), 1963–1989. Rao, R. P. N., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9(4), 721–763. Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive eld effects. Nature Neuroscience, 2(1), 79–87. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Computational Neurosci., 7, 235–246. Schiller, J., Schiller, Y., & Clapham, D. E. (1998). NMDA receptors amplify calcium inux into dendritic spines during associative pre- and postsynaptic activation. Nature Neurosci., 1(2), 114–118.
Spike-Timing-Dependent Plasticity
2237
Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 203–211. Sejnowski, T. J. (1999). The book of Hebb. Neuron, 24(4), 773–776. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448-5465. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Stiles, J. S., Bartol, T. M., Salpeter, M. M., Salpeter, E. E., & Sejnowski, T. J. (2000). Synaptic variability: New insights from reconstructions and Monte Carlo simulations with MCell. In M. Cowan & K. Davies (Eds.), Synapses. Baltimore: Johns Hopkins University Press. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Suarez, H., Koch, C., & Douglas, R. (1995). Modeling direction selectivity of simple cells in striate visual cortex with the framework of the canonical microcircuit. J. Neurosci., 15(10), 6700–6719. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3(1), 9–44. Westerman, W. C., Northmore, D. P. M., & Elias, J. G. (1999). Antidromic spikes drive Hebbian learning in an articial dendritic tree. Analog Integrated Circuits and Signal Processing, 18, 141–152. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received February 24, 2000; accepted December 27, 2000.
LETTER
Communicated by Michael Henson
Analysis and Neuronal Modeling of the Nonlinear Characteristics of a Local Cardiac Reex in the Rat Rajanikanth Vadigepalli Francis J. Doyle III¤ Department of Chemical Engineering, University of Delaware, Newark, DE 19716, U.S.A. James S. Schwaber Department of Pathology, Cell Biology and Anatomy, Thomas Jefferson University, Philadelphia, PA 19107, U.S.A. Previous experimental results have suggested the existence of a local cardiac reex in the rat. In this study, the putative role of such a local reex in cardiovascular regulation is quantitatively analyzed. A model for the local reex is developed from anatomical experimental results and physiological data in the literature. Using this model, a systems-level analysis is conducted. Simulation results indicate that the neuromodulatory mechanism of the local reex attenuates the nonlinearity of the relationship between cardiac vagal drive and arterial pressure. This behavior is characterized through coherence analysis. Furthermore, the modulation of phase-related characteristics of the cardiovascular system is suggested as a plausible mechanism for the nonlinear attenuation. Based on these results, it is plausible that the functional role of the local reex is highly robust nonlinear compensation at the heart, which results in less complex dynamics in other parts of the reex. 1 Introduction
Traditionally it has been considered that the central nervous system solely governs the neuronal regulation of arterial blood pressure, heart rate, atrioventricular conduction, and myocardial contractility. These variables in cardiovascular dynamics are shown to be regulated by vagus nerve (Massari, Johnson, & Gatti, 1995; Standish, Enquist, & Schwaber, 1994) and so the baroreceptor vagal reex plays a key role in cardiovascular regulation. The vagal reex contains three principal components of a feedback control system: sensors, controller or central integrator, and actuators. The information regarding changes in arterial pressure is transduced by the
¤
Address correspondence to [email protected].
c 2001 Massachusetts Institute of Technology Neural Computation 13, 2239– 2271 (2001) °
2240
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
baroreceptors (sensors) of the vagal afferent nerves. The nucleus of the solitary tract (NTS) in the brain stem receives and processes this information (the central integrator). The nal control action is delivered by the motor neurons (actuators) in the nucleus ambiguus (NA) and the dorsal motor nucleus of vagus (DmnX). The motor neurons in the NA and DmnX innervate the cardiac ganglionic plexuses around the heart. Traditionally these cardiac ganglia have been thought of as relay stations that transmit the central control signals to the sinoatrial node (SAN), atrioventricular node (AVN), and myocardium. Mounting experimental evidence on the cardiac ganglia has challenged this convention (Cheng, Powley, Schwaber, & Doyle, 1997; Hardwick, Mawe, & Parsons, 1995; Parsons, Neel, McKeon, & Carraway, 1987). In addition, various neural modulators and transmitters have been found in cardiac ganglia (Neel & Parsons, 1986a; Parsons, Neel, Konopka, & McKeon, 1989; Sosunov, Hassall, Loesch, Turmaine, & Burnstock, 1996; Sosunov et al., 1997; Steele, Gibbins, Morris, & Mayer, 1994). These neuromodulators could participate in the local reex through a variety of extra- and intracellular signaling mechanisms. The dynamic interaction between these signaling pathways results in a complex cellular network, capable of delivering functional variety and complexity (Weng, Bhalla, & Iyengar, 1999). Based on these results, it has been proposed recently that the cardiac ganglia may not be merely simple relay stations, but may also function as integrative control centers (Hardwick et al., 1995; Burnstock, 1995). Several models for baroreex control of cardiovascular dynamics have been published in the literature. Earlier approaches concentrated on inputoutput empirical modeling methods (Allison, Sagawa, & Kumada, 1969; Chen, Chai, Tung, & Chen, 1979; Korner, 1979; Sagawa, Kumada, & Shoukas, 1973; Sagawa, 1983), whereas recent modeling efforts have focused on fundamental mathematical description of reex control of cardiovascular system (Dexter, Levy, & Rudy, 1989; Karemaker, 1987; Rose & Shoukas, 1993; Schwaber, Graves, & Paton, 1993). However, the existing models for cardiovascular control do not include the intracardiac ganglia. In this study, we have developed a mathematical model for the local cardiac reex and, through simulations, have characterized the role of the reex. Experimental results from our group (Cheng, Powley, et al., 1997) have shown that cardiac ganglia contain heterogeneous types of neurons, including principal neurons (PN) and small, intensely uorescent (SIF) cells, and that vagal afferent nerve bers selectively innervate SIF cells (interneurons). These results suggested the existence of a local cardiac reex in the rat (Cheng, Powley, et al., 1997). Similar local reexes have been proposed in guinea pig (Sosunov et al., 1996; Tanaka, Hassall, & Burnstock, 1993; Hassall, James, & Burnstock, 1995) and mudpuppy (Hardwick et al., 1995; Kriebel, Angel, & Parsons, 1991; Parsons & Neel, 1987) based on experimental observations. Figure 1 shows the proposed reex circuitry for short-term regulation of cardiovascular system. The local reex architecture in Figure 1
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
NA and DmnX
2241
Arterial Receptors
NTS
Arterial Pressure
Vagal Activity
Principal Neurons (PNs) PN Activity
SAN Phase
Atrial Receptors
SIF Cells
Neuromodulators
Heart and Peripheral Vasculature
Sino Atrial Node (SAN)
Receptor Activity
Atrial Pressure
Figure 1: Schematic of parasympathetic and local reex control of cardiovascular system. This gure shows the modular implementation of the local cardiac reex model. The model for the central reex (shaded portion of the gure) is not included in this study. Instead, the vagal input is modeled as a pulse train of given frequency. Principal neuron is modeled as a Hodgkin-Huxley neuron, and its effect on the sinoatrial node is empirically modeled (Rose & Schwaber, 1996). Heart and the peripheral vasculature is modeled using a simplied equivalent circuit model shown in Figure 3. Atrial receptors are modeled as linear lters that measure average atrial pressure. The SIF cells release 5-HT and dopamine and are modeled as linear lters with different parameters for 5-HT and dopamine.
is consistent with the experimental results in Cheng, Powley, et al. (1997). Two assumptions underlie the hypothesis (Cheng, Kwatra, Schwaber, Powley & Doyle, 1997) of a local reex: (1) SIF neurons receive cardiac sensory inputs and then project to PNs and (2) selected PNs in specic locations and subserving specic functions receive discrete and extremely dense vagal input from the dorsal motor nucleus of the vagus (DMV). A schematic of the local reex is shown in Figure 2. Based on these results and assumptions, Cheng, Kwatra, et al. (1997) constructed a computer simulation model for the local cardiac reex. A simple model of principal neuron and blood pressure dynamics was used. Preliminary results (Cheng, Kwatra, et al., 1997) based on the simple model show that this local reex predicts behavior that is consistent with the Bainbridge reex (Bainbridge, 1915). This reex operates as follows: venous infusion
2242
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
SIF Cells
Atrial Receptors
Vagal Afferents
Vagal Efferents
( (Ach)
(Ach)
Conducting Tissue Pacemaker
)
5-HT Dopamine
Heart Muscle
Principal Neurons
Figure 2: Schematic of local cardiac reex control. The local cardiac reex consists mainly of three components. (1) SIF cells receive sensory inputs from the atrial receptors and project to the principal neurons (PNs). (2) PNs also receive input from vagal efferents from dorsal motor nucleus of the vagus, and (3) PN activity has a phase-dependent effect on the SA node pacemaker and hence on the heartbeat cycle. Shaded triangles denote inhibitory connections, and open triangles denote excitatory connections.
causes atrial stretch, which increases the heart rate. However, this model does not incorporate intracardiac neurons in the rat. Also, this model uses a simplied lumped model for the SA node and cardiovascular dynamics. Explicit incorporation of SA node dynamics is crucial to the study of local reex, as will be discussed in the subsequent sections. This article extends the results by Cheng, Kwatra, et al. (1997) by developing a model for the local cardiac reex based on the anatomical experimental results and physiological data in the literature (Adams & Harper, 1995; Dolphin, 1990; Jeong & Wurster, 1997; Trautwein & Hescheler, 1990; Xi-Moy & Dun, 1995; Xu & Adams, 1992a, 1992b). Simulation results indicate that the local cardiac reex could be effecting attenuation of the nonlinearity of the relationship between cardiac vagal drive and the arterial blood pressure. We explore the hypothesis that the functional role of dynamic neuromodulation by SIF cells is an input-output “linearizing” effect on the actuator (heart) dynamics. We employ coherence analysis to characterize the nonlinearity between the frequency of vagus nerve pulsatile input and the mean arterial pressure. We also investigate the role of modulatory synaptic transmitters in this function.
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2243
2 Modeling Local Cardiac Reex
The local cardiac reex model developed in this study has three main components: a PN model, a representative model of the cardiovascular system, and a simplied dynamic model of the SIF cell. The relevant details of modeling these components are discussed in this section. 2.1 Principal Neuron Model. The PN model used in this work consists of two components: a Hodgkin-Huxley type model (Hodgkin & Huxley, 1952) of the membrane dynamics and a uid compartment model characterizing the intra- and extracellular media associated with the soma. Under space-clamp conditions, the dynamic change of the neuron membrane potential V is described as
Cm ¢
X dV D gi ¢ ( Ei ¡ V ) C Iinj , dt i
(2.1)
where Cm is the neuron membrane capacitance, t is the time, Iinj is the injected current (in the case of synaptic inputs, Iinj D 0), and gi and Ei are the conductance and reversal potential of the channel i, respectively. The membrane channels used include specic Na C , K C , and Ca2 C ionic channels; the leakage channel (L); and excitatory (SynE) synaptic channels. Following the Hodgkin-Huxley formalism, conductances of the ionic channels are represented in the following form: b
gi D gNi ¢ Fi,mod ¢ mai i ( V, t) ¢ hi i ( V, t) dmi tmi ( V ) ¢ D m1i ( V ) ¡ mi dt dhi thi ( V ) ¢ D h1i ( V ) ¡ hi , dt
(2.2)
where gNi is the maximal conductance of the channel i; Fi,mod is the modulation factor for channel i; mi and hi are the variables that dene dynamics of, respectively, activation and inactivation of the channel i; and ai and bi are the powers of mi and hi , respectively. The steady-state values of mi and hi (m1i and h1i ) and their time constants (tmi and thi ) have a nonlinear dependence on the membrane potential V and in some cases on the calcium concentration inside the cell, [Ca2 C ]in . Fi,mod is a nonlinear function of the concentrations of the neurotransmitters that participate in the modulation of the channel i. The following ionic channels have been included in the model: fast sodium channels, NA f ast (gi D gNa ); potassium-delayed rectier channels, KDR (gi D gDR ); calcium-dependent potassium channels that provide longlasting afterhyperpolarization, KAHP ( Ca ) (gi D gAHP ); transient potassiumA channels, KA (gi D g A ); anomalously rectifying potassium channels,
2244
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
KAR (gi D gAR ); and high-threshold calcium channels consisting of L-type (gi D gCaL ) and N-type (gi D gCaN ) channels. The channels NA f ast and KDR are necessary for generation of action potentials and are present in all spiking neurons. The KAHP ( Ca) , KA , KAR , CaL , and CaN channels are incorporated into the model based on the available experimental data on intracardiac neurons (Adams & Harper, 1995; Dolphin, 1990; Jeong & Wurster, 1997; Trautwein & Hescheler, 1990; Xi-Moy & Dun, 1995; Xu & Adams, 1992a, 1992b) (see section 1). The leakage conductance is considered as a constant (gi D gleak D gN leak). The equations and the parameters for the principal neuron model are given in the appendix. The reversal potentials for sodium (ENa ) and potassium (EK ) channels are usually considered constant in models of the Hodgkin-Huxley style (Hodgkin & Huxley, 1952; McCormick & Huguenard, 1992; Schwaber et £ ¤ al., 1993). The calcium reversal potential ECa depends strongly on Ca2 C in in accordance with the Nernst equation. The reversal potentials for the synaptic (ESynE ) and the leakage (Eleak ) channels are constant. The value of Eleak is tuned to provide the required resting membrane potential in the absence of input signals. The inputs to the PN are the excitatory synaptic input from the vagus nerve and the neuromodulators serotonin (5-HT) and dopamine (DA) released by the SIF neurons. The effect of the neuromodulators on the dynamics of channel i is represented using the modulation factor Fi,mod . In the current model, the inwardly rectifying potassium KAR and the high-threshold L-type calcium channel CaL are the modulated ionic channels. In the current PN model, the inwardly rectifying K C channel is modulated by 5-HT via a cAMP mediated pathway. The model for cAMP modulation has been adapted from Butera, Clark, Canavier, Baxter, and Byrne (1995). Here, the maximal conductance for the anomalous rectier channel gN AR is multiplied with a factor F AR,mod to account for the modulation by cAMP. Boltzmann-type characterization of the binding site for cAMP leads to the following expressions for FAR,mod and IAR : F AR,mod ´ 1 C
IAR
± 1 C exp
KAR,mod ² ¡([cAMP]¡KAR,cAMP ) DAR, cAMP
1 ¡ 5.66 ¡ E V K ± ²A, D gN AR ¢ FAR,mod ¢ @ ( ¡15.3)ZF C 1 exp V¡EKRT
(2.3)
0
(2.4)
where gN AR is the maximal unmodulated conductance (m S) and F AR,mod is a dimensionless modulation term that represents the effect of cAMP on IAR . The parameters KAR,mod , KAR,cAMP , and DAR,cAMP are constants for the Boltzmann-type binding. The parameters in the mathematical description of this current, IAR , are chosen to be consistent with the experimental results on the potassium channel currents in rat intracardiac neurons (Xi-Moy &
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2245
Dun, 1995). The specic values used for the parameters are given in the appendix. Similarly, the L-type calcium channel maximal conductance is modulated with a factor FL,mod that represents the effects of both cAMP and DA (Dolphin, 1990; Trautwein & Hescheler, 1990): 2 0 13 ³ ´ K K DA ± L,mod ² A5 , (2.5) £ @1 C FL,mod D 4 ¡([cAMP]¡KL, cAMP ) [DA] C KDA 1 C exp DL, cAMP
where the parameters KL,mod , KL,cAMP , and DL,cAMP constants for Boltzmanntype binding and KDA represent the half inactivation concentration value for DA on ICaL . The dynamics of synaptic stimulation are described as follows, tE ¢
dgSynE D ¡gSynE C Ivagal , dt
(2.6)
where Ivagal is the excitatory synaptic stimulation by the vagus nerve and tE is the time constant of the stimulation. 2.1.1 Calcium Regulation. Incorporating calcium and calcium-dependent channels into the model requires a description of the dynamics of free calcium concentration below the membrane (inside a so-called thin shell; d D 0.1 m m) (Yamada, Koch, & Adams, 1989; Schwaber et al., 1993). The inux of calcium ions into the shell is through voltage-gated calcium channels, after which they are buffered and pumped out of the neuron. Dynamics of the free calcium concentration inside the shell are described by the following equation (Schwaber et al., 1993): £ 2C ¤ £ ¤ £ ¤ d Ca2 C in Ca in0 ¡ Ca2 C in ICa ¢ (1 ¡ PB ) C (2.7) D , 2¢F¢u tpump ( V ) dt where the rst term constitutes buffering and inux, and the second term describes pump kinetics; u is the volume of the shell, with a thickness of d D 0.1 m m and an area equal to the area of neuronal membrane, V: u D d¢V; [Ca2 C ]in0 is the equilibrium Ca2 C concentration of the pump; and tpump ( V ) is the time constant of the calcium pump, which depends on the membrane potential V (Yamada et al., 1989) and is described as follows: tpump ( V ) D 17.7 ¢ exp ( V / 35) . The combined contribution of all the calcium channels provides the inux, and the resulting expression for the calcium current is given as ± ² ² ¡ ¢ ± (2.8) ICa D gCaL C gCaN ¢ E Ca [Ca2 C ]in ¡ V ,
2246
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
where ECa ( [Ca2 C ]in ) is the reversal potential of calcium, which is a function of the internal calcium concentration [Ca2 C ]in and is described as follows: ± ² ± ² E Ca [Ca2 C ]in D 13.27 ¢ log 4.0/ [Ca2 C ]in .
(2.9)
Calcium buffering can be considered as a second-order instantaneous reaction with a forward rate kf and backward rate kb (Yamada et al., 1989; Schwaber et al., 1993). A probability P B is associated with Ca2 C ion being buffered when it enters the shell. The reaction mechanism can be described as B C Ca2 C
kf kb
Ca ¢ B,
and the reaction dynamics are given by £ ¤ d Ca2 C in dt
D
d [B] 2C D kb ¢ [Ca ¢ B] ¡ kf ¢ [Ca ]in ¢ [B] . dt
(2.10)
The sum of concentrations of free [B] and bound [Ca ¢ B] buffer is considered to be constant: [B] C [Ca ¢ B] D [B]total .
(2.11)
From equations 2.10 and 2.11, the expression for bound calcium [Ca ¢ B] can be readily derived for the stationary state of equation 2.10: £ 2C ¤ ¢ [B]total Ca ¢. [Ca ¢ B] D £ 2 C ¤ ¡ C kb / kf Ca
(2.12)
The probability of Ca2 C ion buffering can be derived as PB D £
Ca2 C
¤
[B] ¡ total¢ . C kb / kf C [B]total
(2.13)
All the values used for parameters in the model for calcium buffering are given in the appendix. 2.1.2 Cyclic-AMP Regulation. A lumped model for the modulation of cAMP concentration by 5-HT is employed (Butera et al., 1995). cAMP production from adenosine triphosphate (ATP) is catalyzed by andenylyl cyclase (AC). 5-HT increases the activity of AC (Lotshaw, Levitan, & Levitan, 1986), thus leading to an increase in the intracellular concentration of cAMP. The degradation term in the species balance is due to the cleavage of cAMP by phospodiesterase. These effects are lumped in to the following single
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2247
equation, which describes the modulation of cAMP by 5-HT with sufcient accuracy: ³ ³ ´´ [5HT] d[cAMP] D kadc 1 C Kmod [5HT] C K5HT dt
³
¡ ºpde
[cAMP] [cAMP] C Kpde
´
,
(2.14)
where the kinetic parameters kadc , K5HT , ºpde, and Kpde whose typical values are taken from Butera et al. (1995) and are specied in the appendix. The equation for the membrane potential for the principal neuron with the ion channels incorporated in the model is as follows: dV D (INa C IDR C IAR C IA C IAHP C ICaL C ICaN C Ileak dt ¢ C ISynE / Cm .
(2.15)
2.2 Heart and Circulatory System Model. The model for the heart with peripheral circulatory system has been adapted from Rose and Schwaber (1996). This model includes vascular hemodynamics and left ventricular dynamics. This model captures the important effects of heart rate on ventricular lling and ejection. In particular, the time-varying elastance model of the heart shows a decrease in the stroke volume as the heart rate increases, and the vascular model includes features such as conservation of blood volume and realistic arterial and venous pressure ow dynamics (Rose & Shoukas, 1993; Rose & Schwaber, 1996). The specics of this model are presented in Figure 3, which shows the schematic model of the heart and the systemic vascular bed. The heart is represented as an elastic chamber, of which the elastance (pressure-to-volume ratio) varies periodically with time. The time-varying elastance of the left ventricle, E lv , is described as follows:
£ ¤ Elv ( t) D Elv,min C Elv,max ¡ Elv,min ¢ e ( t ) £ ¤ ( 1¡cos (Ts2p / HR) ( 0 · W SAN · ( Ts / HR) ) e ( t) D 2 ( ( Ts / HR) · W SAN · 1) , 0
(2.16)
where Elv,max and Elv,min are the parameters that characterize the pressurevolume relationship; W SAN is the activity of the sinoatrial node (SAN); HR is the heart rate; and Ts is the duration of the systole. Several detailed models of sinoatrial node are available in the literature (Demir, Clark, Murphey, & Giles, 1994; Dokos, Celler, & Lovell, 1996). We have chosen to use a simplied model in which SAN activity (W SAN ) is modeled as periodically varying between 0 and 1, based on the heart rate.
2248
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
Arterial Pressure (Pa)
Blood Flow (Fc)
Venous Pressure (Pv)
Rc
Ra
Rv
Ca
Cv
Heart Cardiac Output AoV
MiV e(t)
Figure 3: Schematic diagram of model of heart and blood vessels. The vascular bed is represented by a network of three resistances—arterial (Ra ), small vessel (Rc ), and venous (Rv )—and two compliances—arterial (Ca ) and venous (Cv ). The heart is modeled as a time-varying elastance [e( t) ]. The mitrial and aortic valves (MiV and AoV) force ow through heart to be unidirectional, as indicated by cardiac output arrow. Redrawn, with permission, from Rose and Schwaber (1996).
This model also accounts for the phase-dependent effect of the PN action potentials on the SAN activity cycle. This is found to be sufcient to capture the essential characteristics of the relationship among PN activity, the SAN cycle, and the heartbeat cycle. In the original model (Rose & Schwaber, 1996), the heart rate (HR) is controlled by the arterial baroreex. In this model, HR is controlled by the activity of the PN, and the vagal input to PNs is varied to represent the arterial baroreex activity. The heartbeat cycle is maintained by the periodic SAN activity, which is regulated by the PN. Hence, the PN activity can effectively regulate the heart rate based on “how late” or “how early” the PN action potentials occur with respect to the SAN cycle. This phase-dependent relationship between the PN activity and the SA node cycle is critical to cardiac regulation. The local reex action apparently modulates this phasedependent behavior of the system, resulting in the dynamic characteristics dicussed in the next section.
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2249
2.3 SIF Cell Model. In the absence of sufcient electrophysiological data on SIF cell activity as modulated by atrial pressure receptors, a linear lter is employed as a representation of the signal being relayed. A second-order linear lter is used because it provided smoother changes in the neuromodulator release dynamics when compared to a rst-order lter. The SIF cells receive afferent input from the atrial receptors and release neuromodulators 5-HT and DA. The linear lter characteristics (gain and time constants) are different for 5-HT and DA release. This structure captures the plausibility of different populations of the SIF cells containing different neuromodulators. The activity of the atrial pressure receptors is modeled as a linear lter of the atrial pressure calculated from the vascular system model. This is consistent with the experimental observations published in the literature (Mary, 1987; Paintal, 1979). A threshold is included in the SIF cell model so that the SIF cell does not release any neuromodulator if the activity of the atrial receptors is less than or equal to the nominal mean atrial pressure. This “black box” model of the SIF cells closes the loop on the local reex model via modulation of the principal neuron. In the absence of clear experimental evidence, it is reasonable to assume that different populations of SIF cells do not necessarily modulate separate populations of PNs. In the current model, both 5-HT and DA modulate the PN dynamics. 2.4 Vagal Input Model. The vagal stimulation model used in Clairambault (1995) is considered. This model incorporates parameters for respiratory and baroreex-linked frequencies. The vagus nerve input is modeled as an impulse train of varying amplitude and frequency, and this evokes action potentials in the PN based on a threshold. Experimental observations indicate that the cardiac vagal motoneurones re at the cardiac rhythm and are also affected by the respiratory rhythm (McAllen & Spyer, 1978). Based on these results and the threshold-based pulse train model (Clairambault, 1995) a simplied equivalent vagal stimulation model is used in the present study. The excitatory synaptic input to the PN from the vagus nerve is modeled as a pulse train of varying frequency, which will be referred to as vagal frequency. The amplitude of the pulse train is not varied. Instead, a stochastic component is added to the vagal frequency to sufciently capture dynamic changes in the vagus nerve input as related to synaptic excitation of the PN. This vagal frequency could be interpreted as average frequency of the vagal pulse train that evokes action potentials in the PN. In this study, a range of vagal frequency between 2 and 9 Hz is considered. 2.5 Computing Methods. The model has been implemented in a modular form as represented in Figure 1 using SIMULINK environment in MATLAB. This allows easy incorporation of new modules for simulation of the local reex. The PN and the heart models were written in the C programming language using MATLAB’s C-MEX interface. Other components used in the model are constructed using MATLAB/SIMULINK’s object library. The
2250
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
simulations were performed using an explicit fourth-order Runge-Kutta method with variable step size. The mathematical model for the various components and the values for the associated parameters are given in the appendix. 3 Results
In this study, we consider vagal frequency as the “input” and mean arterial blood pressure (MAP) as the “output” of the system. In all the simulations, the output (MAP) of the system responding to positive and negative changes in the input (vagal frequency) is analyzed. The coherence function, in addition to the MAP response, is analyzed to characterize the role of the local reex dynamics. Relevant details of coherence analysis are presented in the next section. 3.1 Coherence Analysis. Coherence has been used as a nonlinearity measure in process modeling and control (Pearson & Ogunnaike, 1994; Stack & Doyle, 1997). The coherence between two stationary random processes u ( t ) and x( t) is given by Carter (1993): 2 ( c ux v) D
|Sux ( v) | 2 , Suu ( v) Sxx ( v)
where S ( v) are the two-sided cross-spectral density functions dened by S yw ( v) D
1 E[Y¤ ( v) W (v) ], ¡1 < v < 1. T
The coherence is a frequency-dependent function, and at each frequency (v) it can be interpreted as the fraction of the mean square value of the output of the system, which can be accounted for by a linear relationship with the input to the system. Conveniently, the coherence is bounded by zero and one and is equal to unity at frequencies where the input-output relationship is linear. The extent of coherence suppression, that is, how “far away” the coherence of the system is with reference to one, is a direct indicator of the degree of nonlinearity in the input-output relationship of the dynamical system. Because the coherence is a frequency-dependent quantity, it can represent dynamic nonlinearity in the system. For the coherence analysis performed in this work, the weighted overlapped segment averaging (WOSA) method (Carter, 1993) is used. In this method, the input-output data are divided into segments, with a 50% overlap between adjacent segments, which are detrended. Each segment is then multiplied by a Hanning weighting function, and a fast Fourier transform (FFT) is performed on each segment. This is used to compute the coherence as a frequency-dependent function. The mathematical issues related to the calculation of the coherence
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2251
function can be found in Stack and Doyle (1997) and are omitted here. In this study, the vagal pulse input and arterial pressure data are divided into segments, each of length 10 second simulation time, with a 50% overlap between adjacent segments. The sampling rate of this data is 100 Hz, and the 10-second-long segments were found to be appropriate for coherence calculations. The psd and spectrum functions in the System Identication Toolbox (Ljung, 1991) in MATLAB are used for performing the coherence analysis. Using the techniques described above, we perform an analysis of the difference in the system response in the presence and absence of local reex action to characterize the dynamic functional role of the local reex. The nonlinearity of the system response is readily demonstrated by introducing equal positive and negative changes in the input “vagal frequency.” We carry out this analysis for two different cases: (1) only 5-HT is participating in local reex and (2) neuromodulation by both 5-HT and dopamine is active. We present the results for each case separately. 3.2 5-HT Neuromodulation. In this case, neuromodulation of PNs is carried out by the 5-HT released by SIF cells. 5-HT increases IAR and ICaL via cAMP mediated pathways. Increases in IAR directly reduce the excitability of the PNs. Both the resting potential and the ring frequency of the PN are reduced with increase in IAR . Increase in the intracellular calcium concentration increases IAHP . This reduces the excitability of the PNs, leading to the calcium-dependent adaptation. The calcium inux into the cell is due to CaN and CaL channels, the latter of which is modulated by 5-HT. Thus, two of the crucial characteristics, ring frequency and adaptation, of the PN are dynamically modulated by the SIF cells. For low vagal frequency (about 2–5 Hz), the adaptation characteristics are less important, and the modulation of IAR by 5-HT is the dominant factor in local reex action. For higher vagal frequency (about 5–9 Hz), the adaptation characteristics of the PN are crucial. Figure 4 displays the change in the MAP with positive and negative changes in the mean frequency of the vagal pulse input. The MAP response is more linear in the presence of local reex action. It should be noted that the MAP response, in the presence of the local reex, is not truly linear as shown by the asymmetric change of MAP to positive and negative in the vagal frequency. However, the increase in the linearity of the MAP response is the interesting result of this simulation. Figure 5 shows the plot of coherence between the vagal pulsatile input and the arterial pressure. The increase in coherence in the presence of the local reex demonstrates that the local reex dynamics result in attenuation of nonlinearity. But the nonlinearity in this case is between the vagal input drive and the cyclic arterial pressure. The increase in coherence as seen in the Figure 5 is signicant and cannot be attributed to the noise-related aspects of coherence calculations. In general, the variance and the bias of the coherence estimate are approximately inversely proportional to the number of segments in co-
Vagal Frequency (Hz)
2252
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
4 3.5 3 2.5 0
50
100
150
200
250
300
150 200 Time (seconds)
250
300
MAP (mmHg)
98 96
without local reflex with local reflex
94 92 0
50
100
Figure 4: 5-HT neuromodulation only. Comparison of mean arterial pressure (MAP) response with and without local reex action. The vagal pulsatile input is kept the same for both the cases. The positive and negative changes in the frequency of the vagal pulse train elicit a nonlinear response in the MAP. When local reex is not represented in the model, the MAP changes in a nonlinear fashion for positive and negative changes in the vagal frequency, whereas when the local reex is taken into account in the simulation, the MAP changes sufciently in the expected direction—a decrease in MAP with an increase in vagal frequency, and vice versa.
herence calculation. In this work, we use 30,000 data points divided into 30 segments of 1024 points each, with 50% overlap between segments. The empirical results from Benignus (1969) indicate that the 95% condence intervals for coherence computations employing 32 segments with 64 point windows may be less than § 0.04. Hence, the coherence proles presented in this work are not sensitive to the apparent “uctuations” in the arterial pressure signal introduced by the moving-average process. This nonlinear attenuation occurs only for a short range of vagal frequencies. This is due to 5-HT modulated increase in adaptation of PNs at higher input frequencies. This demonstrates the limitations of local reex with only 5-HT as the neuromodulator. In the next section, the combined neuromodulatory inuence of 5-HT and dopamine is considered.
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2253
FFT(Vagalin)
4 3 2 1 0 0
1
2
3
4
5
6
7
8
9
Coherence
1 without local reflex with local reflex
0.5
0 0
1
2
3
4 5 6 Frequency (Hz)
7
8
9
Figure 5: 5-HT only. (Top) Finite Fourier transform (FFT) of the vagal pulse input. The peaks correspond to the frequency (including harmonics) of the vagal pulse train. (Bottom) Compares the coherence between arterial pressure and vagal pulse input, without and with local reex. The increase in the coherence when the local reex is considered in the simulation demonstrates the nonlinearity attenuation by the local reex. This nonlinear attenuation is between the vagal pulse train input and the cyclic arterial pressure.
3.3 5-HT and Dopamine Neuromodulation. For the simulations in this section, the release of both 5-HT and dopamine by the SIF cells is considered. The release characteristics (gain and time constants) are different for 5-HT and dopamine to capture the difference in the required concentration for effective neuromodulation of appropriate ion channels. Dopamine decreases ICaL , thus decreasing adaptation. This provides the interaction between neuromodulatory effects of dopamine and 5-HT through changes in ICaL . Figure 6 shows the change in the MAP with positive and negative changes in the vagal frequency. Even in the presence of both neuromodulators, the local reex action does result in nonlinear attenuation. The important change now, however, is that the nonlinearity reduction is extended to higher frequencies as compared to the case with 5-HT neuromodulation only. Figures 6 and 7, respectively, show the nonlinear attenuation and the
Vagal Frequency (Hz)
2254
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
7 6 5 0
50
100
150
200
250
300
MAP (mmHg)
96 94 92 without local reflex with local reflex
90 0
50
100
150 200 Time (seconds)
250
300
Figure 6: 5-HT and dopamine. Comparison of MAP with and without local reex action. The vagal pulsatile input is kept the same for both the cases. The positive and negative changes in the frequency of the vagal pulse train elicit a nonlinear response in the MAP. The gain is more asymmetric when local reex action is not present. However, local reex action still does not result in truly linear response.
increase in the coherence in the presence of the local reex action when both 5-HT and dopamine participate in the local reex action. The release prole of 5-HT and dopamine during the time course of the simulation is shown in the Figure 8. The differences in the release characteristics are directly related to the changes in the phase-dependent behavior of the system as discussed in the next section. Figures 5 and 7 show the nonlinearity attenuation between the pulsatile vagal input and the cyclic arterial pressure. The increase in coherence in the presence of the local reex indicates attenuation of nonlinearity between vagus nerve input and the arterial pressure. This does provide some preliminary understanding of the relationship between vagus nerve input and arterial pressure and the modulation of the same by the local reex. However, the effect of changes in vagal frequency is not evident in these coherence analysis results, but is clearly seen in Figure 6. Although the am-
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2255
FFT(Vagalin)
6 4 2 0 5
5.5
6
6.5
7
Coherence
1
7.5
8
without local reflex with local reflex
0.5
0 5
5.5
6
6.5 7 Frequency (Hz)
7.5
8
Figure 7: 5-HT and dopamine. (Top) The nite Fourier transform (FFT) of the vagal pulse input. The peaks correspond to the frequency of the vagal pulse train. (Bottom) Compares the coherence without and with local reex. The increase in the coherence when the local reex is considered in the simulation demonstrates the nonlinearity attenuation by the local reex. This nonlinear attenuation is between the vagal pulse train input and the cyclic arterial pressure.
plitude of the vagal pulse train is kept constant in the results depicted in Figures 4 and 6, the results shown in Figures 5 and 7 do not allow explicit separation between the effect of variation in the amplitude and frequency of the vagal pulse train. This is because the coherence in Figures 5 and 7 corresponds to vagal pulsatile input, whose net increase could arise from increase in either the vagal frequency or the amplitude of the single pulse. Therefore, the coherence between the varying vagal frequency signal and the MAP needs to be examined. However, the typical variation in vagal frequency does not contain rich frequency content for coherence analysis. Simulations were performed by changing the vagal frequency randomly in the range of 5 to 9 Hz for durations of 30 seconds for each value. The coherence between this “input” and the 20 second MAP is calculated. For the same changes in the vagal frequency, Figure 9 compares the 10 second MAP with and without local reex action. The difference in MAP response
2256
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber 3
5HT Conc. (mM)
1.5
x 10
1 0.5
Dopamine Conc. (mM)
0 0
50
100
150
200
250
300
without local reflex with local reflex
0.1
0.05
0 0
50
100
150 200 Time (seconds)
250
300
Figure 8: Time prole of 5-HT and dopamine release by the SIF cells to the changes in the vagal frequency shown in Figure 6. The proles that correspond to the case without local reex are shown here only for comparison. The PN does not receive this neuromodulatory input when the local reex action is not considered. Differences in the release proles when the local reex loop is closed and when it is open are due to different PN activity resulting in different atrial and arterial pressure. In both the cases, the same vagal input pulse train is used.
for certain changes in the “vagal frequency” is due to the SIF cell modulation of the “phase-locking” behavior of the system as discussed in the next section. Figure 10 shows that the dynamic relationship between changes in the vagal frequency and the MAP is more linear in the presence of dynamic neuromodulation by the local reex. However, the actual value of the coherence is closer to 0 than to 1, indicating the extent of nonlinearity between vagal frequency and MAP. 4 Discussion
The results discussed in section 3 demonstrate that the local cardiac reex modeled in this study has a linearizing effect on certain nonlinear cardiovascular dynamics. The nonlinearity here corresponds to the relationship
Vagal Freq. (Hz)
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
8 6 0
500
1000
1500
without local reflex with local reflex
96
MAP (mmHg)
2257
94 92 90 88 0
500
Time (seconds)
1000
1500
Figure 9: 5-HT and dopamine. Vagal frequency changes and the corresponding dynamics of 10 second MAP are shown. The difference between the absence and presence of local reex is compared here. The vagal frequency is changed uniformly randomly between 5 and 9 Hz. These input-output data are well suited for coherence analysis to compare the input-output nonlinearity. The results are shown in the Figure 10.
between the frequency of the vagal input pulse train and the mean arterial pressure, and the source of this nonlinearity has been analyzed. This arises because of phase locking between PN activity and heartbeat cycle. The phase relationship between the PN activity and the heartbeat cycle is one of the critical variables in the cardiac regulation. Figure 11 shows the phasic relationship among arterial pressure, SA node cycle, and PN activity, and the effect of the local reex on this relationship. In the absence of the local reex, PN activity almost directly corresponds to the vagal input when the adaptation effects are not critical. Also, the PN activity and the SA node phase are in almost constant phase with each other. This seems to affect the heart cycle to be “locked in phase” with the vagal input pulse train, resulting in nonlinearity of response in the MAP to the vagal frequency changes. This is demonstrated also in Figures 4 and 6. This strong entrainment between the vagal rhythm and the SA node phase is consistent with the observations
2258
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
1
without local reflex with local reflex
0.9 0.8
Coherence
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.05
0.1
0.15 0.2 Frequency (Hz)
0.25
0.3
Figure 10: 5-HT and dopamine. Coherence between the vagal frequency and 10 second mean arterial pressure (MAP). Vagal frequency is changed as shown in Figure 9. Increased coherence for slow changes in vagal frequency is seen in the presence of local reex neuromodulation. This indicates nonlinear attenuation between the vagal frequency and the MAP. However, the nal response is not completely linear even in the presence of the local reex action, as evidenced by the low values of the coherence (0.1–0.4).
in other models of SA rhythm (Demir et al., 1994; Dokos et al., 1996). The dynamic modulation of PNs by SIF cells attenuates this nonlinear response by modulating the phase characteristics of the PN response to the vagal input and hence affecting the phase-locking behavior of the system. This can be seen in the Figure 11, where there is no strong phase relationship between the PN activity and the SA node cycle. The results from section 3.2 show that modulation of the KAR channel by 5-HT is resulting in linearization. But increased 5-HT increases adaptation through increase of ICaL , and hence the net range of linearization by 5-HT alone is limited. In the current model, the interaction between 5-HT and DA effects on the PN seems to enhance the linearizing effect produced by 5-HT neuromodulation alone. This is due to the suppression of adaptation characteristics by DA, so that the net effect of 5-HT is modulation of PN ring frequency and excitability through modulation of the anomalously
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2259
Arterial Pressure SAN phase PN activity
233
233.5
234 234.5 Time (seconds)
235
235.5
236
Figure 11: 5-HT only. Comparison of the phase relationship between arterial pressure, SA node, and the PN activity. (Top) Without local reex action. (Bottom) With local reex neuromodulation. The phase-lock aspects of the local reex are clearly seen by comparing the top and the bottom plots. The PN activity and the SA node phase are almost in constant phase when local reex is not considered (top plot). The PN activity almost always occurs within the initial phase of SA node cycle. However, this phase locking is not as pronounced in the case with the local reex, as there is no consistent phase relationship between PN activity and beat to beat SA node cycle (bottom plot).
rectifying potassium channel kinetics. However, dopamine modulation of L-type calcium channel alone is not sufcient to ne-tune the interspike interval of PN to effect nonlinear attenuation. Increase in coherence is not observed in simulations performed with dopamine modulation only. Further incorporation of other ion channels that are potentially modulated by neurotransmitters could result in similar effects through dynamic interactions. The simulations illustrate that the SIF cells through release of the neurotransmitters 5-HT and DA can regulate the ways in which the PN processes synaptic inputs from the vagus nerve and translates this input signal into an output. The modulation of ring frequency and adaptation characteristics provide mechanisms for the PN to amplify, attenuate, or dynamically
2260
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
modulate the input signal. The second messenger pathways involved in this regulation are crucial to the functioning of the local reex. These results demonstrate that it is necessary to take into account the indirect effects of the second messengers that occur through the dynamic interactions of voltagedependent and calcium-dependent mechanisms regulating other unmodulated ion channels. Incorporation of other modulatory pathways involving release of other neurotransmitters such as galanin, calcitonin gene–related peptide, and substance P might provide for more degrees of freedom to the local reex to ne-tune the PN response to the vagal input. Previous local reex model of Cheng, Kwatra, et al. (1997), produced results consistent with the Bainbridge reex (Bainbridge, 1915). In this reex, venous infusion causes atrial stretch, which in turn increases heart rate and arterial blood pressure. In the local reex model of Cheng, Kwatra, et al. (1997), this is due to positive feedback effect of 5-HT neuromodulation. The reduction in PN activity as a result of neuromodulation by 5-HT is the primary aspect of the positive feedback. Our model extends these results in that both the net activity and the phase-related dynamics of the PN form important aspects of the positive feedback dynamics of the local reex. In the current model, 5-HT has a positive feedback effect, whereas dopamine modulation results in negative feedback. This dynamic interplay between opposing feedback effects is crucial to the resulting nonlinear attenuation. Incorporation of additional neuromodulatory dynamics would provide for more complex interaction and, hence, more complex integrating role of the local cardiac reex. As long as the model structure is preserved, the general principles uncovered by this study would remain valid even with alternate sets of parameters: only the quantitave details of the simulations would vary (Vadigepalli, Doyle, & Schwaber, 2000). In addition, the model results are not strongly dependent on the choice of anatomical assignment, as long as the fundamental modulatory relationships remain. Figure 12 shows the increase in coherence in the presence of the local reex as a function of two key parameters, KSIF,5HT and KSIF,DA . These SIF cell “gain” parameters are varied by § 50% from their nominal values given in Table 1. Note that the coherence consistently increases when local reex neuromodulation is present even though the quantitative amount of increase varies. In this study, we explore the functional consequences of dynamic interaction of these modulatory mechanisms. The essential elements are the dynamic release of neurotransmitters by the SIF cells, direct or indirect effect of these neurotransmitters on PN dynamics, and the phase-related interaction of the SA node and the PN activity. In the local reex system, the phase-dependent interaction of various components of the reex plays a crucial role. The modulation of the adaptation and excitability of the PN will directly affect the phasic behavior of the overall local reex system. Thus, the dynamic release of neurotransmitters by the SIF cells provides a mechanism for a good handle on certain cardiac dynamics.
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2261
D Coherence
0.3
0.2
0.1
0 0.7
0.6
0.5
100 x K_SIF
5HT
0.4
0.3
0.3
0.4
0.5
0.6
0.7
K_SIFDA
Figure 12: Increase in coherence in the presence of the local reex action is shown as a function of the SIF cell “gain” parameters KSIF,5HT and KSIF,DA . Up to 50% change from the nominal values is considered. The increase in coherence is consistent, albeit at different levels for different parameter sets.
The baroreex control of cardiovascular system is more linear than the control due to individual reexes that participate in overall regulation (Sagawa, 1983). The counteraction from parallel feedback reexes results in this reduction in the nonlinearity of the overall system. Sequential opening of various feedback loops results in increased nonlinearity between cardiac variables such as carotid sinus pressure and arterial pressure. The linearizing local cardiac reex presented in this study is consistent with these experimental observations and analysis summarized in Sagawa (1983). One good example of linearizing reex that is well studied is the vestibularocular reex (VOR) (Robinson, 1986; Smith & Galiana, 1991). This reex is a subset of the oculomotor system, which causes involuntary eye movements during head rotation. The VOR behaves much more linearly than expected from its components, and the linearization arises out of dynamics of interaction among various neuron subsystems participating in the reex (Smith & Galiana, 1991). In the local cardiac reex system discussed in this study, the linearization is also achieved through reex dynamics.
2262
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
Table 1: Parameters Used in the Local Reex Model. Quantity Cm gN Na gN DR gN AHP gN A gN CaL gN CaN gN AR gN leak ENa EK ESynE Eleak tE KAR,mod KAR,cAMP D AR,cAMP KL,cAMP DL,cAMP KL,mod Z F R T
Value 0.25 nF 3m S 0.9 m S 0.15 m S 0.15 m S 0.001 m S 0.002 m S 0.09 m S 0.05 m S 55 mV ¡94 mV 10 mV ¡43 mV 30 ms 1.5 1e-3 mM 0.4e-3 mM 2.5e-3 mM 0.4e-3 mM 10 2 96480 C/mol 8314 J/Kmol.K 308 K
Quantity
£
¤ 2C
Ca in0 u [B]total kb / kf kadc Kmod ºpde Kpde Elv, max Elv, min Ea Ev TPR Ra Rv Rc Vus0 HR 0 Ts KSIF,5HT t5HT KSIF,DA tDA Tavg
Value 5e-5 mM 2.5e-4 nl 0.03 mM 0.001 mM 0.6e-6 mM/ms 1.5 2.4e-6 mM/ms 3e-3 mM 140 mmHg.kg/ml 6.5 mmHg.kg/ml 100 mmHg.kg/ml 1.25 mmHg.kg/ml 48 mmHg.kg.s/ml 0.025 ¢ TPR 0.01 ¢ TPR 0.965 ¢ TPR 60 ml/kg 7.0 Hz 0.5 ¤ HR 0 5e-3 mM/mmHg 1.5 sec 5e-1 mM/mmHg 3 sec 10 sec
The concept of input-output linearizing “controllers” is well known in systems engineering (Isidori, 1995) and has been extended to linearize nonlinear actuators such as control valves (Kayihan & Doyle, 2000). The results of this study suggest that the biological systems use these linearizing loops in autonomic regulation of the cardiovascular system. In process systems, the utility of linearization is typically in simplifying the controller design. The degree of linearization that is signicant in feedback controller design has been studied in chemical process systems (Stack & Doyle, 1997). It has been shown that seemingly small changes in the coherence (or degree of linearization) could have a signicant effect on the complexity of the feedback controller that achieves a given performance. Under the assumption that the objective of feedback control is not signicantly nonlinear, the complexity of the feedback controller is directly related to the nonlinearity of the system that it is designed to regulate—the less nonlinear the regulated system is, the less complex the feedback controller needs to be. From the control engineering perspective, the reduction of nonlinearity of the plant is naturally guided by this incentive in the feedback controller design. The linearization result of this study is signicant in this context. Our interpretation of the signicance of the degree of linearization effected by the local
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2263
reex is guided by the results from nonlinearity analysis of chemical process control. These results support the general intuition regarding the intracardiac ganglion that it must be performing nontrivial integration of signals at the principal neuron level. The role of neuromodulators in “placing” the activity of the neurons at different states of activity has been extensively studied (Canavier, Baxter, Clark, & Byrne, 1994). This effect is typically invoked in the cognitive context involving plasticity, long-term memory, and so on (Byrne, Zwartjes, Homayouni, Critz, & Eskin, 1993). Our approach has been to interpret the effect of neuromodulation in terms of signal processing occurring in the system and its effect on feedback control. It is well established that neurons are nonlinear and nontrivial integrators of input signals. The results of this study indicate that dynamic neuromodulation not only modulates the nonlinear response at the neuronal level but could do so in such a way as to modulate the overall nonlinearity of the system signicantly. This is a novel “hypothesis” of neuromodulatory function that needs to be validated through extensive experimental studies. The connectivity in the model for the local cardiac reex presented in this study is based on the anatomical results of Cheng, Powley, et al. (1997). However, experimental evidence also points to other types of local reexes involving both baro- and chemoreceptors (Hardwick et al., 1995) and interacting parasympathetic and sympathetic neurons in cardiac ganglia (Neel & Parsons, 1986b). It is plausible that different organisms use differing mechanisms to achieve a similar objective. Whether that is true for the local cardiac reex could be known only by extensive modeling and simulation studies to complement the experimental observations. Our efforts in local reex modeling and analysis are a step in that direction. 5 Conclusion
A model of local cardiac reex in controlling the heart rate has been presented. The simulation results indicate that the dynamic neuromodulation in the local reex results in attenuation of certain input-output nonlinearity in the cardiovascular dynamics. These results are in agreement with the idea that local cardiac loop uses local information efciently to facilitate smooth central regulation of cardiovascular dynamics. Appendix
This appendix contains the equations and values of parameters for the components that constitute the local cardiac reex model. A.1 Principal Neuron Model. In these equations, time is in msec, current is in nA, potential is in mV, concentrations are in mM, and temperature in K.
2264
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
Membrane potential, V ¡ ¢ dV D INa C IDR C IAR C IA C IAHP C ICaL C ICaN C Ileak C ISynE / Cm dt Fast sodium, Na f ast INa D gN Na ¢ mNa ¢ hNa ¢ ( ENa ¡ V ) m1Na D
¡ ¢ 0.091 ¢ ( V C 38) / 1 ¡ exp ( ¡ ( V C 38) / 5) ¡ ¢ 0.091 ¢ ( V C 38) / 1¡¡ exp ( ¡ ( V C 38) / 5)¢ C 0.062 ¢ ( V C 38) / 1 ¡ exp ( (V C 38) / 5)
¡ ¡ ¢ tmNa D 1 / 0.091 ¢ ( V C 38) / 1 ¡ exp ( ¡ ( V C 38) / 5) C 0.062 ¡ ¢¢ ¢ ( V C 38) / 1 ¡ exp ( ( V C 38) / 5) h1Na D
¡ ¢ 0.016 ¢ exp ( ¡ ( V C 55) / 15) ¢ ¡ ¢ 0.016 ¢ exp ( ¡ ( V C 55) / 15) C 2.07 ¢ / 1 C exp ( ¡ ( V ¡ 17) / 21) ¡
¡ ¡ ¢ ¡ ¢¢ thNa D 1 / 0.016 ¢ exp ( ¡ (V C 55) / 15) C 2.07 ¢ / 1 C exp ( ¡ ( V ¡ 17) / 21) Potassium-delayed rectier, KDR IDR D gN DR ¢ m4DR ¢ ( EK ¡ V ) m1DR
¡ ¢ 0.01 ¢ ( V C 45) / 1 ¡ exp ( ¡ ( V C 45) / 5) ¡ ¢ D 0.01 ¢ ( V C 45) / 1 ¡ exp ( ¡ ( V C 45) / 5) C 0.17 ¢ exp ( ¡ ( V C 50) / 40)
¡ ¡ ¢ tmDR D 1 / 0.01 ¢ ( V C 45) / 1 ¡ exp ( ¡ ( V C 45) / 5) C 0.17 ¢ ¢ exp ( ¡ ( V C 50) / 40) Calcium-dependent potassium, KAHP IAHP D gN AHP ¢ m2AHP ¢ ( EK ¡ V ) m1AHP
£ ¤2 1.25 ¢ 108 ¢ Ca2 C in D £ ¤2 1.25 ¢ 108 ¢ Ca2 C in C 2.5
tmAHP D
103 ¢ £ ¤2 1.25 ¢ 108 ¢ Ca2 C in C 2.5
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2265
Transient potassium-A, KA ± ² IA D gN A ¢ 0.6 ¢ m4A1 ¢ hA1 C 0.4 ¢ m4A2 ¢ hA2 ¢ ( EK ¡ V ) ¡ ¢ m1A1 D 1 / 1 C exp ( ¡ ( V C 60) / 8.5)
¡ ¢ tmA1 D 1 / exp ( ( V C 35.82) / 19.69) C exp ( ¡ ( V C 79.69) / 12.7) C 0.37 ¡ ¢ h1A1 D 1 / 1 C exp ( ( V C 78) / 6)
if V ¸ ¡63 then thA1 D 19.0, else
¡ ¢ thA1 D 1 / exp ( (V C 46.05) / 5) C exp ( ¡ ( V C 238.4) / 37.45) C 0.37 ¡ ¢ m1A2 D 1 / 1 C exp ( ¡ ( V C 36) / 20)
¡ ¢ tmA2 D 1 / exp ( ( V C 35.82) / 19.69) C exp ( ¡ ( V C 79.69) / 12.7) C 0.37 ¡ ¢ h1A2 D 1 / 1 C exp ( ( V C 78) / 6)
if V ¸ ¡73 then thA2 D 60.0, else
¡ ¢ thA2 D 1 / exp ( ( V C 46.05) / 5) C exp ( ¡ ( V C 238.4) / 37.45) C 0.37
Inwardly rectifying potassium, KA R 0
IAR D gN AR ¢ FAR,mod
FAR,mod ´ 1 C
1 ¡ 5.66 ¡ E V K ± ²A ¢@ ( ¡15.3)ZF 1 C exp V¡EKRT ±
1 C exp
KAR,mod ² ¡([cAMP]¡KAR, cA MP ) DAR, cA MP
L-type calcium, CaL ± ±h i ² ² ¡V ICaL D gN CaL ¢ FL,mod ¢ m2CaL ¢ ECa Ca2 C in
FL,mod
2 ³ D4
KDA [DA] C KDA
m1CaL
´
0 £ @1 C
13 ± 1 C exp
KL,mod ¡([cAMP]¡KL, cAMP ) DL,cA MP
¡ ¢ 1.6 ¢ / exp ( ¡0.072 ¢ ( V ¡ 5) ) ¡ ¢ D 1.6 ¢ / exp (¡¡0.072 ¢ ( V ¡ 5) ) C 0.02 ¢ ¢ ( V ¡ 1.31) / exp ( ( V ¡ 1.31) / 5.36) ¡ 1
² A5
2266
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
¡ ¡ ¢ tmCaL D 1 / 1.6 ¢ / exp ( ¡0.072 ¢ ( V ¡ 5) ) C 0.02 ¡ ¢¢ ¢ ( V ¡ 1.31) / exp ( ( V ¡ 1.31) / 5.36) ¡ 1 N-type calcium, CaN : ± ² ² ¡ ¢ ± ICaN D gN CaN ¢ mCaN ¢ 0.55 ¢ hCaN1 C 0.45 ¢ hCaN2 ¢ ECa [Ca2 C ]in ¡ V ¡ ¢ m1CaN D 1.0 / 1 C exp ( ¡ ( V C 20) / 4.5) ± ² tmCaN D 0.364 ¢ exp ¡0.042 2 ¢ ( V C 31) 2 C 0.442
¡ ¢ h1CaN1 D 1.0 / 1 C exp ( ( V C 20) / 25) ± ² thCaN1 D 3.752 ¢ exp ¡0.0395 2 ¢ ( V C 30) 2 C 0.56 0.2 1.0 ¢ C ¡ ¢ h1CaN2 D ¡ 1 C exp ( ¡ ( V C 40) / 10) 1 C exp ( ( V C 20) / 40) ± ² thCaN2 D 25.2 ¢ exp ¡0.0275 2 ¢ ( V C 40) 2 C 8.4 ± ² ± ² ECa [Ca2 C ]in D 13.27 ¢ log 4.0/ [Ca2 C ]in
Excitatory synapse, ISynE tE ¢
d gSynE D ¡gSynE C Ivagal dt
Leak current, Ileak: £
Calcium regulation, Ca £ ¤ d Ca2 C in dt
2C
¤
Ileak D gN leak ¢ ( Eleak ¡ V) in
:
£ 2C ¤ £ ¤ Ca in0 ¡ Ca2 C in ICa ¢ ( 1 ¡ PB ) C D 2¢F¢u tpump ( V ) ICa D ICaL C ICaN PB D £
[B] ¤ ¡ total¢ Ca2 C C kb / kf C [B]total
tpump ( V ) D 17.7 ¢ exp ( V / 35)
cAMP regulation, [cAMP]
³ ´ ³ ³ ´´ [5-HT] [cAMP] d [cAMP] D kadc 1 C Kmod ¡ ºpde [5-HT] C K5HT [cAMP] C Kpde dt
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2267
A.2 Heart and Peripheral Vasculature Model. In these equations, time is in sec, pressure is in mmHg, and volume is in ml/kg.
dVlv D dt
± ² ifpos Ev ¢ VQ v ¡ Elv ¢ Vlv Rv
¡
dV a ifpos ( Elv ¢ Vlv ¡ Ea ¢ Va ) ¡ D dt Ra dV v D dt
± ² ifpos E a ¢ Va ¡ Ev ¢ VQ v Rc
¡
ifpos ( Elv ¢ Vlv ¡ E a ¢ Va ) Ra ± ² ifpos Ea ¢ Va ¡ Ev ¢ VQ v Rc
± ² ifpos Ev ¢ VQ v ¡ Elv ¢ Vlv Rv
VQ v D Vv ¡ Vus0 »
(x · 0) ( x > 0)
0 x
ifpos ( x) D
£ ¤ Elv ( t) D E lv,min C Elv,max ¡ Elv,min ¢ e ( t) ( e ( t) D
£
1¡cos
2p (Ts / HR)
¤
2
0
( 0 · W SAN · ( Ts / HR) ) ( ( Ts / HR) · W SAN · 1)
if Ea ¢ Va < Elv ¢ Vlv then the arterial pressure Pa D Elv ¢ Vlv , else Pa D Ea ¢ Va . Mean arterial pressure (MAP)
MAP D
1 Tavg
ZTavg ¢
Pa dt 0
A.3 SIF Cell Model. In these equations, concentration is in mM and time is in sec. The linear ordinary differential equation model of the SIF cell is given as 2 t5HT
d2 d [5-HT] C 2t5HT [5-HT] C [5-HT] D KSIF,5HT ¢ D atbract dt2 dt
2 tDA
d2 d [DA] C 2tDA [DA] C [DA] D KSIF,DA ¢ D atbract , 2 dt dt
where atbract is the atrial baroreceptor activity and D atbract represents the activity above the nominal value.
2268
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
A.4 Parameters. The values of the parameters used in the local reex model are given in Table 1. With the exception of the parameters in the SIF cell model, i.e., KSIF,5HT , t5HT , KSIF,DA and t5HT , all other parameter values given in Table 1 are based on the cited literature. Acknowledgments
We gratefully acknowledge support from the U.S. Ofce of Naval Research (grant NOOO-14-96-1-0695) and the National Science Foundation (grant BES-9896061). We acknowledge fruitful discussions with Robert F. Rogers, Thomas Jefferson University, Philadelphia, PA, on intrepreting the results presented in this work. References Adams, D. J., & Harper, A. A. (1995). Electrophysiological properties of autonomic ganglion neurons. In E. M. McLachlan (Ed.), Autonomic ganglia (pp. 153–212). Philadelphia: Harwood Academic Publishers. Allison, J., Sagawa, K., & Kumada, M. (1969). An open-loop analysis of the aortic arch barostatic reex. Am. J. Physiol., 217, 1576–1584. Bainbridge, F. A. (1915). The inuence of venous lling upon the rate of the heart. J. Physiol., 50, 65–84. Benignus, V. (1969). Estimation of the coherence spectrum and its condence interval using the fast Fourier transform. IEEE Trans. Audio and Electroacoustics, AU-17, 145–150. Burnstock, G. (1995).Preface to the series—historical and conceptual perspective of the autonomic nervous system book series. In E. M. McLachlan (Ed.), Autonomic ganglia (pp. vii–xi). Philadelphia: Harwood Academic Publishers. Butera, R. J., Clark, J. W., Canavier, C. C., Baxter, D. A., & Byrne, J. H. (1995). Analysis of the effects of modulatory agents on a modeled bursting neuron: Dynamic interactions between voltage and calcium dependent systems. J. Comp. Neurosci., 2, 19–44. Byrne, J. H., Zwartjes, R., Homayouni, R., Critz, S. D., & Eskin, A. (1993). Roles of second messenger pathways in neuronal plasticity and in learning and memory. In S. Shenolikar & A. C. Nairn (Eds.), Advances in second messenger and phosphoprotein research (pp. 47–107). New York: Raven Press. Canavier, C. C., Baxter, D. A., Clark, J. W., & Byrne, J. H. (1994). Multiple modes of activity in a model neuron suggest a novel mechnism for the effects of neuromodulators. J. Neurophysiol., 72(2), 872–882. Carter, G. C. (Ed.). (1993). Coherence and time delay estimation. New York: IEEE Press. Chen, H., Chai, Y., Tung, C., & Chen, H. (1979). Modulation of the carotic baroreex function during volume expansion. Am. J. Physiol., 237, H153–H158. Cheng, Z., Kwatra, H. S., Schwaber, J. S., Powley, T. L., & Doyle III, F. J. (1997). A simulation model for putative neuromodulatory mechanisms in local cardiac control. In Proceedings of the 19th Annual Conference of the IEEE–Engineering in Medicine and Biology Society (pp. 2176–2179). Chicago.
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2269
Cheng, Z., Powley, T. L., Schwaber, J. S., & Doyle III, F. J. (1997). Vagal afferent innervation of the atria of the rat heart reconstructed with laser confocal microscopy. J. Comp. Neurology, 381, 1–17. Clairambault, J. (1995). A model of the autonomic control of heart rate at the pacemaker cell level through g-proteins. In Proceedings of the 17th Annual Conference of the IEEE–Engineering in Medicine and Biology Society, no. 105. Montreal, Canada. Demir, S. S., Clark, J. W., Murphey, C. R., & Giles, W. (1994). A mathematical model of a rabbit sinoatrial node cell. Am. J. Physiol., 266, C832–C852. Dexter, F., Levy, M., & Rudy, Y. (1989). Mathematical model of the changes in heart rate elicited by vagal stimulation. Circ. Res., 65(5), 1330–1339. Dokos, S., Celler, B. G., & Lovell, N. H. (1996). Vagal control of sinoatrial rhythm: A mathematical model. J. Theor. Biol., 182(1), 21–44. Dolphin, A. C. (1990). G protein modulation of calcium currents in neurons. Annu. Rev. Physiol., 52, 243–255. Hardwick, J. C., Mawe, G. M., & Parsons, R. L. (1995). Evidence for afferent ber innervation of parasympathetic neurons of the guinea-pig cardiac ganglion. J. Aut. Nerv. Sys., 53, 166–174. Hassall, C. J. S., James, S., & Burnstock, B. (1995). A subpopulation of intracardiac neurons from the guinea pig heart expresses substance P binding sites. Cardioscience, 6, 157–163. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Isidori, A. (1995). Nonlinear control systems (3rd ed.). New York: Springer-Verlag. Jeong, S., & Wurster, R. D. (1997).Calcium channel currents in acutely dissociated intracardiac neurons from adult rats. J. Neurophysiol., 77(4), 1769–1778. Karemaker, J. M. (1987). Neurophysiology of the baroreceptor reex. In R. I. Kitney & O. Rompelman (Eds.), The beat-by-beat investigation of cardiovascular function (pp. 27–49). Oxford: Clarendon Press. Kayihan, A., & Doyle III, F. (2000). Friction compensation for a process control valve. Cont. Eng. Prac., 8, 799–812. Korner, P. (1979). Central nervous control of autonomic cardiovascular function. In R. Berne & N. Sperelakis (Eds.), Handbook of physiology: The cardiovascular system—The heart (pp. 691–739). Bethesda, MD: American Physiological Society. Kriebel, R., Angel, A., & Parsons, R. (1991).Biogenic amine localization in cardiac ganglion intrinsic neurons: Electron microscope histochemistry of SIF cells. Brain Res. Bulletin, 27, 175–179. Ljung, L. (1991). System identication toolbox user’s guide. Natick, MA: MathWorks. Lotshaw, D. P., Levitan, E. S., & Levitan, I. B. (1986). Fine tuning of neuronal electrical activity: Modulation of several ion channels by intracellular messengers in a single identied nerve cell. J. Exp. Biol., 124, 307–322. Mary, D. A. S. G. (1987). Electrophysiology of atrial receptors. In R. Hainsworth, P. N. McWilliam, & D. A. S. G. Mary (Eds.), Cardiogenic reexes (pp. 3–17). Oxford: Oxford University Press.
2270
R. Vadigepalli, F. J. Doyle, and J. S. Schwaber
Massari, V. J., Johnson, T. A., & Gatti, P. J. (1995). Cardiotropic organization of the nucleus ambiguous? An anatomical and physiological analysis of neurons regulating atrioventricular conduction. Brain Res., 679, 227–240. McAllen, R. M., & Spyer, K. M. (1978). The baroreceptor input to cardiac vagal motoneurones. J. Physiol., 282, 365–374. McCormick, D. A., & Huguenard, J. R. (1992). A model of the electrophysiological properties of thalamocortical relay neurons. J. Neurophysiol., 68, 1384– 1400. Neel, D. S., & Parsons, R. L. (1986a). Anatomical evidence for the interaction of sympathetic and parasympathetic neurons in the cardiac ganglion of necturus. J. Aut. Nerv. Sys., 15, 297–308. Neel, D. S., & Parsons, R. L. (1986b). Catecholamine, serotonin and substance plike peptide containing intrinsic neurons in the mudpuppy parasympathetic cardiac ganglion. J. Neurosci., 6, 1970–1975. Paintal, A. S. (1979). Electrophysiology of atrial receptors. In R. Hainsworth, C. Kidd, & R. J. Linden (Eds.), Cardiac receptors (pp. 73–88). Cambridge: Cambridge University Press. Parsons, R. L., & Neel, D. S. (1987). Distribution of calciton gene-related peptide immunoreactive nerve bers in the mudpuppy cardiac septum. J. Aut. Nerv. Sys., 21, 135–143. Parsons, R., Neel, D., Konopka, L., & McKeon, T. (1989). The presence and possible role of a galanin-like peptide in the mudpuppy heart. Neuroscience, 29, 749–759. Parsons, R. L., Neel, D. S., McKeon, T. W., & Carraway, R. E. (1987). Organization of a vertebrate cardiac ganglion: A correlated biochemical and histochemical study. J. Neurosci., 7, 837–846. Pearson, R. K., & Ogunnaike, B. A. (1994). Detection of unmodelled disturbance effects by coherence analysis. In IFAC Symposium on Advanced Control of Chemical Processes (pp. 458–463). Kyoto, Japan. Robinson, D. A. (1986). The systems approach to oculomotor system. Vision Res., 26(1), 91–99. Rose, W. C., & Schwaber, J. S. (1996). Analysis of heart rate–based control of arterial blood pressure. Am. J. Physiol., 271, H812–822. Rose, W., & Shoukas, A. (1993).Two-port analysis of systemic venous and arterial impedances. Am. J. Physiol., 265, H1577–H1587. Sagawa, K. (1983). Baroreex control of systemic arterial pressure and vascular bed. In K. Berne & N. Sperelakis (Eds.), Handbook of physiology—The cardiovascular system (pp. 453–496). Bethesda, MD: American Physiological Society. Sagawa, K., Kumada, M., & Shoukas, A. (1973). A systems approach to baroreceptor reex control of the circulation. Aust. J. Exp. Biol. Med. Sci., 51, 33–52. Schwaber, J. S., Graves, E. B., & Paton, J. F. R. (1993). Computational modeling of neuronal dynamics for systems analysis: Application to neurons of the cardiorespiratory NTS in the rat. Brain Res., 604, 126–141. Smith, H. L. H., & Galiana, H. L. (1991). The role of structural symmetry in linearizing ocular reexes. Biol. Cyber., 65(1), 11–22. Sosunov, A., Hassall, C., Loesch, A., Turmaine, M., & Burnstock, G. (1996). Nitric oxide synthase–containing neurones and nerve bres within cardiac ganglia
Nonlinear Characteristics of a Local Cardiac Reex in the Rat
2271
of rat and guinea-pig: An electron-microscopic immunocytochemical study. Cell and Tissue Res., 284, 19–28. Sosunov, A., Hassall, C., Loesch, A., Turmaine, M., E. Feh e r, & Burnstock, G. (1997). Neuropeptide Y-immunoreactive intracardiac neurones, granulecontaining cells and nerves associated with ganglia and blood vessels in rat and guinea-pig heart. Cell and Tissue Res., 289, 445–454. Stack, A. J., & Doyle III, F. J. (1997). Application of a control-law nonlinearity measure to a chemical reactor. AIChE J., 43, 425–447. Standish, A., Enquist, L. W., & Schwaber, J. S. (1994). Innervation of the heart and its central medullary origin dened by viral tracing. Science, 263, 232–234. Steele, P. A., Gibbins, I. L., Morris, J. L., & Mayer, B. (1994). Multiple populations of neuropeptide-containing intrinsic neurons in the guinea-pig heart. Neuroscience, 62, 241–250. Tanaka, K., Hassall, C., & Burnstock, G. (1993). Distribution of intracardiac neurones and nerve terminals that contain a marker for nitric oxide, NADPHdiaphorase, in the guinea-pig heart. Cell and Tissue Res., 273, 293–300. Trautwein, W., & Hescheler, J. (1990). Regulation of cardiac L-type calcium current by phosphorylation and G proteins. Annu. Rev. Physiol., 52, 257–274. Vadigepalli, R., Doyle III, F. J., & Schwaber, J. S. (2000). Modeling and analysis of local cardiac control in the rat. FASEB J., 14, A376. Weng, G., Bhalla, U. S., & Iyengar, R. (1999). Complexity in biological signalling systems. Science, 284(5411), 92–96. Xi-Moy, S. X., & Dun, N. J. (1995). Potassium currents in adult rat intracardiac neurones. J. Physiol., 486(1), 15–31. Xu, Z. J., & Adams, D. J. (1992a). Resting membrane potential and potassium currents in cultured parasympathetic neurones from rat intracardiac ganglia. J. Physiol., 456, 405–424. Xu, Z. J., & Adams, D. J. (1992b). Voltage-dependent sodium and calcium currents in cultured parasympathetic neurones from tat intracardiac ganglia, J. Physiol., 456, 425–441. Yamada, W. M., Koch, C., & Adams, P. R. (1989). Multiple channels and calcium dynamics. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling from synapses to networks (pp. 97–133). Cambridge, MA: MIT Press. Received January 26, 2000; accepted November 10, 2000.
LETTER
Communicated by John Lazzaro
Evaluating Auditory Performance Limits: I. One-Parameter Discrimination Using a Computational Model for the Auditory Nerve Michael G. Heinz Speech and Hearing Sciences Program, Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A., and Hearing Research Center, Biomedical Engineering Department, Boston University, Boston, MA 02215, U.S.A. H. Steven Colburn Laurel H. Carney Hearing Research Center, Biomedical Engineering Department, Boston University, Boston, MA 02215, U.S.A. A method for calculating psychophysical performance limits based on stochastic neural responses is introduced and compared to previous analytical methods for evaluating auditory discrimination of tone frequency and level. The method uses signal detection theory and a computational model for a population of auditory nerve (AN) ber responses. The use of computational models allows predictions to be made over a wider parameter range and with more complete descriptions of AN responses than in analytical models. Performance based on AN discharge times (allinformation) is compared to performance based only on discharge counts (rate-place). After the method is veried over the range of parameters for which previous analytical models are applicable, the parameter space is then extended. For example, a computational model of AN activity that extends to high frequencies is used to explore the common belief that rate-place information is responsible for frequency encoding at high frequencies due to the rolloff in AN phase locking above 2 kHz. This rolloff is thought to eliminate temporal information at high frequencies. Contrary to this belief, results of this analysis show that rate-place predictions for frequency discrimination are inconsistent with human performance in the dependence on frequency for high frequencies and that there is signicant temporal information in the AN up to at least 10 kHz. In fact, the all-information predictions match the functional dependence of human performance on frequency, although optimal performance is much better than human performance. The use of computational AN models in this study provides new constraints on hypotheses of neural encoding of frequency in the auditory system; however, the method is limited to simple tasks with deterministic stimuli. A companion article in this issue c 2001 Massachusetts Institute of Technology Neural Computation 13, 2273– 2316 (2001) °
2274
M. G. Heinz, H. S. Colburn, & L. H. Carney
(“Evaluating Auditory Performance Limits: II”) describes an extension of this approach to more complex tasks that include random variation of one parameter, for example, random-level variation, which is often used in psychophysics to test neural encoding hypotheses. 1 Introduction
One of the challenges in understanding sensory perception is to test, quantitatively, hypotheses for the physiological basis of psychophysical performance. The random nature of neural responses (i.e., that two presentations of an identical stimulus produce different discharge patterns) imposes limitations on performance. The fundamental requirement for relating neural encoding to psychophysical performance is a statistical description of neural discharge patterns in response to the relevant stimuli. In general, if the statistics of the response change signicantly as a stimulus parameter is varied, then accurate discrimination of that parameter is possible. Attempts to relate neural responses to psychophysical performance often involve the question of whether a small set of individual neurons can statistically account for performance, but can also involve studies of encoding by an entire population of neurons (see Parker & Newsome, 1998). Implicit in any hypothesis focused on a small set of neurons is the assumption that the brain ignores statistically signicant information in other neurons within the population. To avoid this assumption, the total amount of information in a neural population can be quantied. The amount of total information, and the trends in information as stimulus parameters are varied, can be used to test hypotheses of neural encoding and suggest neural informationprocessing mechanisms. These studies, which require modeling approaches due to the inability to record from all neurons within a population, have typically been limited to simple psychophysical tasks with simple stimuli that can be studied with analytical models. This article describes a general method based on detection and estimation theory that allows computational models for stochastic neural responses to be used in studies that relate physiological properties to psychophysical performance. Computational models allow one to calculate the response pattern of neural activity for an arbitrary stimulus, whereas analytical models are typically functional descriptions of activity to a well-dened class of stimuli. A companion article in this issue (“Evaluating Auditory Performance Levels: II”) describes a further generalization of this approach to include more complex psychophysical tasks in which one stimulus parameter is randomly varied in order to limit the cues available to a subject. 1.1 Relating Physiology to Psychophysics in the Auditory System.
The study reported here uses the auditory system as an example of the application of signal detection theory (SDT) with computational neural models to quantify the total information in a neural population; however, the gen-
Performance Limits Using Computational Models
2275
eral types of questions that this method is able to address quantitatively are relevant to neural encoding in any sensory system. The auditory nerve (AN) is an obligatory pathway between the cochlea (inner ear) and the central nervous system (Ryugo, 1992), and thus contains all of the information about an auditory stimulus that a listener can process. Siebert (1965, 1968, 1970) was the rst to evaluate limits on psychophysical performance in auditory discrimination tasks by combining methods from SDT with models of peripheral auditory signal processing.1 (See Delgutte, 1996, for a review of Siebert’s approach and many subsequent studies.) While Siebert (1965, 1968, 1970) applied his approach to the auditory system, the general questions addressed with his method are fundamental to theoretical neuroscience, such as whether physiological noise inherent in neural encoding can account for human performance limits and whether temporal information in neural responses is used to encode sensory stimuli (Abbott & Sejnowski, 1999). In addressing these questions, Siebert (1965, 1968, 1970) used simple analytical models that included only those AN response properties considered to be of primary importance: bandpass frequency tuning, saturating rate-level curves, phase locking to tonal stimuli, a population of 30,000 AN bers with logarithmic characteristic frequency (CF, the frequency of best response) spacing, and nonstationary Poisson statistics of AN discharges (for reviews, see Kiang, Watanabe, Thomas, & Clark, 1965; Ruggero, 1992; and Delgutte, 1996). These studies were limited to simple psychophysical tasks, such as pure-tone level and frequency discrimination, because of their relatively simple interpretations and the limits of the analytical AN models. Valuable insight into the encoding of frequency and level in the peripheral auditory system was garnered from these previous studies; however, fundamental questions are still debated, such as the frequency ranges over which average rate and temporal information are used to encode frequency and how a system in which the majority of AN bers have a small dynamic range can encode sound level over a broad dynamic range. 1.2 Auditory Encoding of Frequency. The physiological mechanism of auditory frequency encoding has been a disputed issue since (at least) the 1940s, when spectral (place) cues (von Helmholtz, 1863) were challenged by temporal cues (Schouten, 1940; Wever, 1949). The task of pure-tone frequency discrimination has been widely used to test rate-place and temporal hypotheses for frequency encoding by measuring the just-noticeabledifference (JND) in frequency between two tones (see Moore, 1989, and Houtsma, 1995, for reviews of human performance; also see Figure 4). Rateplace schemes are based on the frequency selectivity (or tuning) of AN bers (Kiang et al., 1965; Patuzzi & Robertson, 1988). Two tones of differing
1 Fitzhugh (1958) previously used this approach to investigate visual psychophysical performance based on the random nature of optic nerve responses.
2276
M. G. Heinz, H. S. Colburn, & L. H. Carney
frequency excite different AN bers, and thus the listener can discriminate these tones by detecting differences in the sets of AN bers that are activated. Temporal schemes are based on the ability of AN bers to phase-lock in response to tones, that is, AN discharges tend to occur at a particular phase of the tone (Kiang et al., 1965; Johnson, 1980; Joris, Carney, Smith, & Yin, 1994). It is generally believed that average-rate information must be used at high frequencies (at least above 4–5 kHz; Wever, 1949; Moore, 1973, 1989; Dye & Hafter, 1980; Wakeeld & Nelson, 1985; Javel & Mott, 1988; Pickles, 1988; Moore & Glasberg, 1989), because of the rolloff in phase locking above 2–3 kHz (Johnson, 1980; Joris et al., 1994; or see Figure 1c). For low frequencies, there is continuing debate as to whether rate-place or temporal information is used to encode frequency (Pickles, 1988; Javel & Mott, 1988; Moore, 1989; Cariani & Delgutte, 1996a, 1996b; Kaernbach & Demany, 1998). The assumption that there is no temporal information above 4–5 kHz has been used to interpret numerous psychophysical experiments and develop theories for the encoding of sound (Moore, 1973; Viemeister, 1983). However, no study has accurately quantied the total amount of temporal information in the AN at high frequencies or compared the trends in rate-place and temporal information versus frequency at high frequencies using the same AN model. Siebert (1970) used an analytical AN model and SDT to evaluate the ability of temporal and rate-place information to account for human frequency discrimination. His model included a nonstationary Poisson process with a time-varying discharge rate that described the AN bers’ phase locking to low-frequency tonal stimuli; however, the model did not include rolloff in phase locking above 2–3 kHz, and thus Siebert’s predictions were limited to low frequencies. Siebert (1970) found that temporal information signicantly outperformed rate-place information at low frequencies, but that optimal performance based on rate-place is more similar to levels of human performance. Goldstein and Srulovicz (1977) found that when a simple description of the rolloff in phase locking at high frequencies was included in Siebert’s (1970) AN model, predicted performance based on temporal information (in terms of D f / f ) became worse as frequency increased above 1 kHz, similar to human performance (Moore, 1973). However, the description of the lowpass rolloff in phase locking used by Goldstein and Srulovicz (1977) had a cutoff frequency that was too low (600 Hz rather than 2500 Hz) and a slope that was too shallow (40 dB/dec rather than 100 dB/dec) when compared to more recent data from cat (see Figure 1c and Johnson, 1980; Weiss & Rose, 1988). In addition, Goldstein and Srulovicz (1977) quantied performance based on only a small set of AN bers with CF equal to the tone frequency, and thus it is not possible from their predictions to know whether there is signicant temporal information at high frequencies or to calculate performance bounds. Another signicant result from Siebert (1970) was that optimal performance based on all information improved much too quickly as a function of duration compared to human performance, at least for long durations.
Performance Limits Using Computational Models
2277
Siebert concluded that human listeners are severely limited in their ability to use temporal information in AN discharges and that there is sufcient rate-place information to account for human performance if this information were used efciently. Goldstein and Srulovicz (1977) demonstrated that a restricted temporal scheme based on only the intervals between adjacent discharges (rst-order intervals) corrected the duration dependence at long durations, and they concluded that temporal schemes should not be ruled out. However, both of these analytical models are based on steady-state responses and therefore are limited to long-duration stimuli due to the absence of onset and offset responses as well as neural adaptation. 2 This limitation is signicant because the slope of the dependence of human frequencydiscrimination performance on duration has been shown to differ for longand short-duration stimuli (Moore, 1973; Freyman & Nelson, 1986). 1.3 Auditory Encoding of Level. The encoding of level in the auditory system is an interesting problem given the discrepancy between the wide dynamic range of human hearing (at least 120 dB; Viemeister and Bacon, 1988) and the limited dynamic range of the majority of AN bers (less than 30 dB; May & Sachs, 1992) (see Viemeister, 1988a, 1988b, for reviews of this “dynamic-range problem”). For many types of stimuli, the JND in level is essentially constant across a wide range of levels. The observation of constant performance across level (i.e., constant JND in decibels or, equivalently, that the smallest detectable change in intensity is proportional to intensity) is referred to as Weber ’s law. Weber ’s law is observed for broadband noise (Miller, 1947) and for narrow-band signals in the presence of band-reject noise (Viemeister, 1974, 1983; Carlyon & Moore, 1984). However, for narrowband signals in quiet where spread of excitation across the population of AN bers is possible, performance improves slightly as a function of level, an effect referred to as the “near-miss” to Weber ’s law (McGill & Goldberg, 1968; Rabinowitz, Lim, Braida, & Durlach, 1976; Jesteadt, Wier, & Green, 1977; Florentine, Buus, & Mason, 1987). Siebert (1965, 1968) demonstrated that a rate-place encoding scheme based on a population of AN bers with low thresholds and a limited dynamic range (i.e., the majority of AN bers; Liberman, 1978) can produce Weber’s law for pure-tone level discrimination through spread of excitation. However, Siebert’s (1965, 1968) model did not predict the near-miss to Weber ’s law for pure tones and would not be expected to predict Weber’s law for broadband noise or narrowband stimuli in band-reject noise. Other studies have implicated the role of a small population of AN bers that have higher thresholds and broader dynamic ranges (Liberman, 1978) in producing Weber ’s law for conditions in which spread
2 An analytical model that included transient responses was used by Duifhuis (1973)to evaluate the consequences of peripheral ltering on nonsimultaneous masking; however, his model did not include neural adaptation.
2278
M. G. Heinz, H. S. Colburn, & L. H. Carney
of excitation is assumed to be restricted (Colburn, 1981; Delgutte, 1987; Viemeister, 1988a, 1988b; Winslow & Sachs, 1988; Winter & Palmer, 1991). However, when models have been used to quantify the total information available in a restricted CF region (i.e., assuming no spread of excitation) with physiologically realistic distributions of threshold and dynamic range, performance degrades as level increases above 40 dB SPL (Colburn, 1981; Delgutte, 1987), inconsistent with Weber’s law and with trends in human performance. In addition, no model has accurately quantied the effect of noise maskers on the spread of rate or temporal information across the population of AN bers; instead simple assumptions for the inuence of the noise maskers have been used. 1.4 Combining Com putational Models with Signal Detection Theory. There are now computational AN models (Payton, 1988; Carney, 1993;
Gigu`ere & Woodland, 1994a, 1994b; Patterson, Allerhand, & Gigu`ere, 1995; Robert & Eriksson, 1999; Zhang, Heinz, Bruce, & Carney, 2001) that can provide more accurate physiological responses over a wider range of stimulus parameters than the analytical models already described. In addition, computational models can simulate AN responses to arbitrary stimuli and thus can be used to study a wider range of psychophysical tasks, such as those involving noise maskers. This article focuses on extension of the SDT approach to allow the use of computational models to predict psychophysical performance limits for auditory discrimination based on information encoded in the stochastic AN discharge patterns. Several studies have used SDT to relate computational auditory models to psychophysical performance (Dau, Puschel, ¨ & Kohlrausch, 1996a, 1996b; Dau, Kollmeier, & Kohlrausch, 1997a, 1997b; Gresham & Collins, 1998; Huettel & Collins, 1999). Dau et al. (1996a, 1996b, 1997a, 1997b) developed computational models of effective auditory processing with the goal of matching predicted and human performance (i.e., in terms of both absolute values and trends for various stimulus parameters). Their auditory models are physiologically motivated but are not intended to describe the processing at specic locations in the auditory pathway, and therefore are not compared directly to physiological responses. Gresham and Collins (1998) and Huettel and Collins (1999) used SDT to evaluate information loss at different stages of several more physiologically based computational auditory models. Psychophysical performance was limited in their analysis only by the random variability associated with the noise stimulus (i.e., their analysis did not include any form of internal physiological noise). These previous studies using computational auditory models and SDT have not taken into consideration the fact that information is encoded in the population of AN bers in terms of neural discharges, that is, stochastic all-or-none events that represent a severe transformation of information from the continuous waveforms analyzed in previous studies using computational models with SDT. In contrast, Siebert (1965, 1968, 1970) and Colburn’s (1969, 1973) approach quantied the effect of neural
Performance Limits Using Computational Models
2279
encoding on AN population information, but was limited to analytical AN models. This and the companion study describe a general method that combines and extends previous approaches by quantifying the information in the discharge patterns of a specic neural population of AN bers using physiologically based computational models. The potential of rate-place (based on AN discharge counts) and all-information (based on AN discharge times) encoding schemes to account for human performance is characterized for both frequency and level discrimination. The adjective all-information is used to emphasize the fact that this encoding scheme makes use of all the information in the distribution of all discharges times across the population of AN bers and includes averagerate information. The contribution of temporal information in the responses can be inferred by comparing the predictions of the rate-place and allinformation models. Note that the all-information model does not assume any specic form of temporal processing, such as calculating synchrony coefcients or creating interval histograms. All-information predictions thus provide an absolute limit on achievable performance given the total information available in AN discharge patterns. It is critical, before proceeding to more complex computational AN models, to verify the computational method by comparing it to previous analytical model predictions (Siebert, 1965, 1968, 1970). Thus, optimal performance predictions were made for pure-tone frequency and level discrimination as a function of frequency, level, and duration based on a relatively simple computational AN model. This simple AN model does not include several complex physiological response properties that have been hypothesized to inuence auditory frequency and level discrimination, and thus these limitations are discussed in terms of the conclusions about auditory encoding that can be made from this study. The general agreement between computational and analytical predictions provides condence that the computational method is valid. Previous studies have typically focused on either frequency or level discrimination, have often examined only a limited parameter range, and have used different AN models. By using the same AN model to make both rate-place and all-information predictions for both frequency and level discrimination versus frequency, level, and duration, this study provides a unied description of previously suggested constraints on rate-place and temporal models. This study also quanties that there is signicant temporal information in the AN for frequency discrimination up to at least 10 kHz, and that all-information performance demonstrates the same trends versus frequency as human performance, whereas rate-place does not. These new predictions are important to theories of frequency encoding in the auditory system because they contradict the generally held belief that the rolloff in AN phase locking eliminates all temporal information at high frequencies. Thus, this study and the companion study demonstrate several benets of using computational neural models to predict psychophysical per-
2280
M. G. Heinz, H. S. Colburn, & L. H. Carney
formance limits and illustrate the types of general neural encoding questions that can be addressed by quantifying the total information (based on discharge times or counts) in a specic neural population. In addition to the general applicability of this quantitative approach to theoretical neuroscience, several engineering applications could potentially benet from the use of this approach to study basic auditory encoding issues. Audio coding algorithms for speech or music are often based on auditory masking phenomena that can be studied using this general approach (Heinz, 2000). In addition, the discriminability of the original and encoded versions of an audio sample could be evaluated based on AN information with this method. Hearing aid algorithms could be developed and tested by quantifying the effect of the algorithm on AN information in the impaired auditory system. This method could be used to develop optimal processors for specic types of auditory information, which could be useful in algorithms for speech recognition or blind-source separation. 2 General Methods 2.1 Computational Auditory Nerve Model. The type of computational AN model for which this study was designed can process an arbitrary stimulus and produce a time-varying discharge rate that can be used with a nonstationary point process (e.g., Poisson) to simulate AN discharge times. Examples of this type of model that have been compared directly to extensive physiological AN response properties have been described by Payton (1988), Carney (1993), and Robert and Eriksson (1999). A simplied version of the Carney (1993) model was used in this study to provide a better comparison to the analytical AN model used by Siebert (1970) and to provide a simple comparison for future predictions with more complex AN models.3 Figure 1 shows pure-tone response properties of the computational AN model that are important for this study (see Ruggero, 1992, for a review of basic AN response properties). Table 1 provides details of the implementation of the model. A linear fourth-order gamma-tone lter bank was used to represent the frequency selectivity of AN bers (see the population response in Figure 1a). Model lter bandwidths were based on estimates of human bandwidths from the psychophysical notched-noise method for estimating auditory lter shapes (Glasberg & Moore, 1990; see Table 1). Psychophysically measured lters have been shown to match AN frequency tuning when both were measured in guinea pig (Evans, Pratt, Spenner, & Cooper, 1992). The bandpass lter was followed by a memoryless, asymmetric, saturating nonlinearity (implemented as an arctan with a 3:1 asymmetry), which represents the mechano-electric transduction of the inner hair cell (IHC; see
3 Code for the AN model used in this study is available online at http://earlab.bu. edu/.
Performance Limits Using Computational Models
2281
Mountain & Hubbard, 1996, for a review). The saturating nonlinearity contributes to the limited dynamic range of AN bers (see Figure 1b). (Note that the dynamic range is also affected by the IHC-AN synapse; see below and Patuzzi & Robertson, 1988.) All AN model bers had a rate threshold of roughly 0 dB SPL, a spontaneous rate of 50 spikes per second, and a maximum sustained rate of roughly 200 spikes per second. The model dynamic range for the sustained rate of these high-spontaneous-rate (HSR) bers was roughly 20–30 dB, while the dynamic range for onset rate was much larger (see Figure 1b). The HSR bers described by the AN model represent the majority (61%) of the total AN population (Liberman, 1978). Medium- (23%) and low-SR (16%) bers, which have higher thresholds and larger dynamic ranges, are not described by our model. The synchrony-level curves had a threshold roughly 20 dB below rate threshold, a maximum at a level that was just above rate threshold, and a slight decrease in synchrony as level was increased further (comparable to Johnson, 1980, and Joris et al., 1994). An important property for this study is the rolloff in phase locking as frequency increases above 2–3 kHz (see Figure 1c; Johnson, 1980; Joris et al., 1994). Weiss and Rose (1988) compared synchrony versus frequency in ve species on a log-log scale and reported that the data from all species were well described by a lowpass lter with roughly 100 dB per decade rolloff (the only difference across species was the 3 dB cutoff frequency, e.g., fc D 2.5 kHz for cat, and fc D 1.1 kHz for guinea pig). To achieve the proper rolloff in synchrony (for all species) and cutoff frequency (for cat), seven rst-order lowpass lters were used, each with a rst-order cutoff frequency of 4800 Hz. The resulting rolloff in phase locking had a 3 dB cutoff frequency near 2500 Hz and » 100 dB/decade rolloff in the frequency range 4–6 kHz (see Figure 1c). The model synchrony coefcients above 5 kHz are a simple extrapolation of the physiological data, consistent with the lowpass shape and slope across many species reported by Weiss and Rose (1988). For comparison, a typical description of synchrony rolloff used in analytical AN models is also shown (Goldstein & Srulovicz, 1977) and is seen to underestimate the slope at high frequencies. Neural adaptation was introduced through a simple three-stage diffusion model based on data from Westerman and Smith (1988) for the IHC-AN synapse. The continuous-time version of this adaptation model used by Carney (1993) was simplied by using xed values for the immediate and local volumes and the local and global permeabilities (Lin & Goldstein, 1995; and see Figure 3 of Westerman & Smith, 1988). The immediate permeability was a function of the input amplitude, where the relation was essentially linear above the resting permeability, and exponentially decayed to zero for negative inputs (see Table 1). The model’s time constants for rapid adaptation (1.3 ms at high levels) and short-term adaptation (63 ms at high levels) are consistent with those reported for AN bers (Westerman & Smith, 1988). The output of the model (see Figure 1d) represents the instantaneous discharge rate r( t, f, L ) of an individual high-spontaneous-rate, low-threshold AN ber.
2282
M. G. Heinz, H. S. Colburn, & L. H. Carney
2.2 Signal Detection Theory. The application of SDT to evaluate auditory discrimination performance limits in this study is an extension of Siebert’s (1970) study of frequency discrimination (see Figure 2). A model of the auditory periphery describes the time-varying discharge rate ri ( t, f, L ) of the ith AN ber in response to a tonal stimulus of level L, frequency f ,
Performance Limits Using Computational Models
2283
and duration T. In order to calculate performance limits based on the activity in the AN, the information in all bers in the AN population must be considered, that is, 30,000 total AN bers (Rasmussen, 1940) spanning the entire CF range (20 Hz–20 kHz; Greenwood, 1990). The total information in the AN population depends on how the stochastic activity in each AN ber varies with the stimulus and on the statistical relations between AN bers. A Poisson process provides a good description of the random nature of AN discharges (e.g., exponential interval distribution, and independent successive intervals; Colburn, 1969, 1973; Siebert, 1970). All AN bers in the population were assumed in this study to have independent dischargegenerating mechanisms, including the 10–30 AN bers that innervate each IHC (Spoendlin, 1969; Liberman, 1980; Ryugo, 1992). This assumption implies that the stochastic discharge activity of any two AN bers will be independent when the effects due to a common stimulus drive have been removed and is consistent with evidence to date from studies that have investigated correlated AN activity (Johnson & Kiang, 1976; Young & Sachs, 1989; Kiang, 1990). For the deterministic tonal stimuli used in this study, the independence assumption appears reasonable. Two types of analyses are considered in this study. Rate-place predictions are based on the assumption that the set of AN discharge counts over the duration of the stimulus, fKi g, is the only information used by the listener. Thus, the rate-place analysis (left branch of Figure 2) is based on a set of independent, stationary Poisson processes with rates equal to the average discharge rates ri ( f, L ) of the AN-model bers (Siebert, 1968). In contrast, the all-information predictions (right branch of Figure 2) are based on the assumption that the set of observations are the population of discharge times and counts fti1 , . . . , tiKi g produced by a set of independent, nonstationFigure 1: Facing page. Basic pure-tone response properties of the computational AN model. (a) Sustained-rate population responses for the 60 AN-model CFs used in this study to pure tones at several levels. Tones were 970 Hz, 62 ms (10 ms rise/fall), with levels from 0 to 80 dB SPL. Sustained rate (calculated as the average rate over all full stimulus cycles within a temporal window from 10 to 52 ms) is plotted versus characteristic frequency (CF) of each model ber. (b) Onset rate, sustained rate, and synchrony for a 970Hz model ber responding to a 970 Hz, 62 ms tone as a function of level. Onset rate was calculated as the maximum average rate over one stimulus cycle. The synchrony coefcient (units on right axis), represents the vector strength (Johnson, 1980) of the model response calculated over one cycle beginning at 40 ms. (c) Maximum synchrony coefcient over level for tones at CF as a function of frequency. Circles are data from cat (Johnson, 1980). Dashed line is from analytical model used by Goldstein and Srulovicz (1977). (d) Stimulus waveform and instantaneous discharge rates r ( t, f, L) for a 970 Hz ber in response to f D 970 Hz, 25 ms (2 ms rise/fall) tones over a range of levels (L).
2284
M. G. Heinz, H. S. Colburn, & L. H. Carney
Table 1: Equations and Parameters Used to Implement Computational AN Model. Symbol
Description (units)
Human cochlear
x f (x)
Equations/Values
map a
distance from apex (mm) frequency corresponding to a position x (Hz)
D 165.4(100.06x ¡ 0.88)
Gamma-ton e lters b
CF ERB t gtf [k]
characteristic frequency (kHz) equivalent rectangular bandwidth (Hz) time constant of gamma-tone lter (s) output of gamma-tone lter
D 24.7(4.37CF C 1) D [2p (1.019)ERB] ¡1
Inner-hair cell
ihc[k] K b ihcL [k]
output of saturating nonlinearity ihc[k] D farctan(K ¢ gtf [k] C b ) ¡ arctan(b )g / fp / 2 ¡ arctan(b )g controls sensitivity 1225 sets 3:1 asymmetric bias ¡1 lowpass-ltered inner-hair-cell output (see text)
Neural adaptation model c
Ts ro VI VL PG PL PIrest PImax CG PI [k] CI [k] CL [k] r[k]
sampling period (s) (see text) spontaneous discharge rate (spikes/s) 50 immediate “volume” 0.0005 local “volume” 0.005 global permeability (“volume”/s) 0.03 local permeability (“volume”/s) 0.06 resting immediate permeability (“volume”/s) 0.012 maximum immediate permeability (“volume”/s) 0.6 global concentration (“spikes/volume”) CG D CL [0](1 C PL / PG ) ¡ CI [0]PL / PG D 6666.67 immediate permeability (“volume”/s) PI [k] D 0.0173 lnf1 C exp(34.657 ¢ ihcL [k])g immediate concentration (“spikes/volume”) CI [k C 1] D CI [k] C (Ts / VI )f¡PI [k]CI [k] C PL [k](CL [k] ¡ CI [k])g CI [0] D ro / PIrest D 4166.67 local concentration (“spikes/volume”) CL [k C 1] D CL [k] C (Ts / VL )f¡PL [k](CL [k] ¡ CI [k]) C PG (CG ¡ CL [k])g CL [0] D CI [0](PIrest C PL ) / PL D 5000.00 D PI [k]CI [k] instantaneous discharge rate (spikes/s)
Notes: a Greenwood (1990). b See Patterson, Nimmo-Smith, Holdsworth, and Rice (1987); Glasberg & Moore (1990); Carney (1993). c See Westerman and Smith (1988); Carney (1993).
ary Poisson processes with time-varying rates ri ( t, f, L ) provided by the AN model (Siebert, 1970). Descriptions of the probability density of the observations, p ( ti1 , . . . , tiKi | ri ) , are used to evaluate the optimal performance achievable by any decision process (see section 3). Optimal performance based on
Performance Limits Using Computational Models
2285
Stimulus A cos (2p f t ), 0 £ t £ T L=20 log10(A/Aref)
Auditory Periphery Model CF i, i = 1,…, n
Rate-Place Model Average rate
ri(t,f,L)
1 T dt Tò 0
ith AN fiber
ri(f,L)
ri(t,f,L)
Stationary Poisson Process
Probability of Discharge Counts
Observations p(Ki | ri )
Signal Detection Theory
All-Information Model
Nonstationary Poisson Process ri(t,f,L)
p(t i ,…, t iK | ri(t) ) 1 i
Optimal Performance
DfRP (f,L,T)
Time-varying discharge rate
Probability of Discharge Times
Signal Detection Theory DfAI (f,L,T)
DfHUMAN (f,L,T)
Human Performance
Figure 2: Overview of signal detection theory (SDT) used with stochastic models of the auditory periphery to evaluate fundamental limits of performance on a psychophysical task (frequency discrimination shown).
rate-place, D fRP ( f, L, T ), and all-information, D fAI ( f, L, T ) , is compared directly to human JNDs measured psychophysically, D fHUMAN ( f, L, T ) . It is worthwhile, as Siebert (1965, 1968, 1970) has done, to discuss the potential outcomes of a comparison between human and optimal performance based on a particular type of neural information (e.g., rate-place or all-information) in the AN. The best possible outcome is if human performance matches optimal performance in both absolute value and the trends versus stimulus parameters. This result would suggest that human performance is essentially determined by the peripheral transformation from acoustic waveform to action potentials in the AN and that the central nervous system can be viewed roughly as an optimal processor for this task. Another straightforward result is if human performance is superior to optimal performance. This result suggests that the information represented by the model of peripheral transformations is inadequate to account for human performance. A more likely result is that optimal performance is superior to human performance. In this case, the strongest conclusion that can be drawn is that there is sufcient information in the AN discharge patterns to account for human performance. However, if optimal performance is uniformly better than human performance (i.e., parallel to human
2286
M. G. Heinz, H. S. Colburn, & L. H. Carney
performance across the entire range of stimulus parameters), then a parsimonious hypothesis is that the AN information represented by the neural model determines human performance, but is used in an inefcient manner that uniformly degrades optimal performance. On the other hand, if optimal performance exceeds human performance in a nonuniform way and there is no realistic processor that would be nonuniformly inefcient in the required manner, then a fairly strong hypothesis can be made that although the information provided to the optimal detector is sufcient, this information is not likely to account solely for human performance. 3 General Analytical Results: Performance Limits for One-Parameter Discrimination 3.1 Overview of Basic Result. The results in this section are presented in terms of discrimination of a single stimulus parameter a, which is either frequency f (Hz) or level L (dB SPL) in this study. A measure (da0 [CF]) 2 , which is closely related to the sensitivity measure d0 from SDT (Green & Swets, 1966), is used to represent the information from an individual AN ber with characteristic frequency CF, whereda0 represents the sensitivity per unit a (see section 3.2).4 Based on the assumption of independent AN bers (see section 2.2), the total information from the population is the sum of the information from individual AN bers. The JND in a, D aJND , is inversely related to the amount of information available from the AN population and can be calculated as
(
D aJND
D
)¡ 1 2
i
( D
X¡ ¢2 da0 [CFi ] XZ i
T 0
1 ri ( t, a)
µ
@ri ( t, a) @a
)¡ 1
¶2
2
dt
,
(3.1)
as described in section 3.2. Equation 3.1 describes the optimal performance possible from all information available in the AN in terms of the timevarying discharge rates ri ( t, a) of the population of AN bers. Optimal rate-place performance is calculated by using the average discharge rate ri ( a) in equation 3.1 rather than ri ( t, a) . Equation 3.1 is extremely general because it is applicable to any single-parameter discrimination experiment, 4
Durlach and Braida (1969) used the metric d 0 to represent sensitivity per bel for intensity discrimination [i.e., d0 D d0 (I1 , I2 ) / log 10 (I2 / I1 )] (see also Braida & Durlach, 1988). In our study, da0 is used to represent sensitivity per unit a, where a can be level L in dB, or frequency f in Hz. Thus, the normalized sensitivity metric dL0 differs from that used by Durlach and Braida (1969) only in the unit of level used (i.e., dL0 represents sensitivity per decibel).
Performance Limits Using Computational Models
2287
and it can be used with any neural model (e.g., analytical or computational) that describes the transformation from sensory stimulus to neural response and assumes statistically independent Poisson discharges. 3.2 Mathematical Analysis. A complete analysis of a general one-parameter discrimination experiment based on stochastic neural responses is described below using SDT and extends several previous results (e.g., Siebert, 1970; Colburn, 1981; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Prediction of performance limits based on the Crame r-Rao bound is described rst, because this was the method rst described by Siebert (1970); however, this approach lacks an explicit description of the processing that is required to achieve these performance limits. The use of a likelihood ratio test, which is described second, provides insight into the meaning of the CramÂer-Rao bound result, describes the form of an optimal processor, and demonstrates that the Crame r-Rao bound can be met with equality in this case (i.e., an efcient estimator exists; van Trees, 1968). The assumption that a nonstationary Poisson process is a good model for the AN discharge patterns allows the random nature of the observations to be described (see Parzen, 1962; Snyder & Miller, 1991; Rieke et al., 1997). The joint probability density of the unordered discharge times fti1 , . . . , tiKi g on the ith AN ber is given by
p ( ti1 ,
...,
tiKi
| a) D
QKi
n D 1 ri
( tin , a)
Ki !
" Z exp ¡
#
T 0
ri ( t, a) dt ,
(3.2)
where ri ( t, a) is the time-varying discharge rate, and a is the stimulus parameter to be discriminated. (Note that this formula applies to unordered discharge times; i.e., the times are not constrained such that ti1 < ti2 < ¢ ¢ ¢ < tiKi .) 3.2.1 CramÂer-Rao Bound. The Poisson description of the random nature of the observations can be used in the CramÂer-Rao bound (Crame r, 1951; van Trees, 1968) to describe performance limits for estimating the value of a based on the response of a single AN ber in terms of its time-varying discharge rate ri ( t, a) . The variance saO2 [i] of any unbiased estimate of the parameter a from the observed discharge times on the ith ber is bounded by the Crame r-Rao inequality, 1 ·E 2 saO [i] D
(µ
@ @a
1 Z X
Ki D 0 0
¶2 ) ln p ( ti1 , Z
T
¢¢¢
T 0
...,
µ
@ @a
tiKi
| a)
,
(3.3) ¶2
ln p( ti1 , . . . , tiKi | a)
£ p ( ti1 , . . . , tiKi | a) dti1 ¢ ¢ ¢ dtiKi ,
(3.4)
2288
M. G. Heinz, H. S. Colburn, & L. H. Carney
where the expectation in equation 3.3 is over the random observations fti1 , . . . , tiKi g and represents the Fisher information in the random discharge times about a. This inequality provides a lower bound on the variance of any unbiased estimator by relating the accuracy of the estimate to the rate of change with a of the likelihood of the observations. The variance saO2 [i] can be related to the time-varying discharge rate ri ( t, a) by substituting the joint probability density from equation 3.2 into equation 3.4 and performing some extensive simplications to obtain 1 · saO2 [i]
Z
T 0
1 ri ( t, a)
µ
@ri ( t, a)
¶2
@a
(3.5)
dt.
Equation 3.5 provides a lower bound on the variance of any unbiased estimate of a based on the observations from a single AN ber in terms of the time-varying discharge rate ri ( t, a) . The JND D aJND of the ideal observer is equal to the minimum standard deviation of any estimator based on the observations, when threshold is dened as 75% correct in a two-interval, two-alternative, forced-choice task (Green & Swets, 1966; Siebert, 1968, 1970). This denition of threshold corresponds to a sensitivity d0 D 1, where d0 D D aJND / saO [i], and produces predicted JNDs that are directly comparable to JNDs measured psychophysically. The JND based on the population of AN bers is given by equation 3.1, because the information from each AN ber adds under the assumption that the discharge patterns of all bers are statistically independent (see section 2.2). 3.2.2 Likelihood Ratio Test. This analysis extends previous results from Colburn (1981) and Rieke et al. (1997). It is shown that the performance of a sufcient statistic for a nonstationary Poisson process derived from a likelihood ratio test (LRT) meets the Crame r-Rao bound with equality (i.e., that an efcient estimator exists). This alternative approach provides more intuition in the present case and is therefore more accessible than the Crame r-Rao bound analysis; however, the LRT approach does not always lead to a form of an optimal processor that is simple enough to allow the evaluation of performance analytically, in which case the Crame r-Rao bound may be more useful (see the companion article in this issue). In the present case, the LRT analysis shows that the quantity D
(da0 [CFi ]) 2 D
Z
T 0
1 ( ri t, a)
µ
@ri ( t, a) @a
¶2 dt
(3.6)
represents the information available on the ith AN ber for discriminating a, where da0 [CFi ] represents the normalized sensitivity (see note 4) of the ith ber, and is dened as the sensitivity d0 per unit a (see Durlach & Braida, 1969; Braida & Durlach, 1988).
Performance Limits Using Computational Models
2289
The form of the optimal processor for discriminating between a and a C D a can be derived from a log-likelihood test, ln
p ( ti1 , . . . , tiKi |a C D a) p ( ti1 , . . . , tiKi |a)
a CD a a
0,
(3.7)
where the threshold of 0 corresponds to the minimum-probability-of-error criterion with equal a priori probabilities (van Trees, 1968, pp. 23–30). Substituting the joint probability density of the discharge times, equation 3.2, into equation 3.7 and simplifying, one obtains the following form of an optimal test, Y( ti1 , . . . , tiKi )
D
D
Ki X nD 1
Z a CD a a
ln
ri ( tin , a C D a) ri ( tin , a)
T 0
[ri ( t, a C D a) ¡ ri ( t, a) ] dt,
(3.8)
where the decision variable Y depends on the observed data, and the right side of the comparison in equation 3.8 is a criterion based on a priori information. The sensitivity metric d0 can be used to characterize performance completely if the decision variable Y follows a gaussian distribution and has equal variance under both hypotheses, a and a C D a (Green & Swets, 1966; van Trees, 1968). These two conditions can be reasonably assumed to hold in this analysis. Since the decision variable Y is a sum of random variables, a gaussian assumption is reasonable based on the central limit theorem if Ki (the number of discharges on the ith ber) is large. Even for conditions in which Ki is small (e.g., a 25 ms stimulus and an average discharge rate of 200 spikes per second), the gaussian assumption is reasonable for the total population decision variable (i.e., based on all 30,000 AN bers; see below and section 4). The total population decision variable is a linear combination of the individual ber decision variables based on the independence assumption. The variance of the decision variable Y is shown below to depend on a, and thus the variances under the two hypotheses are not strictly equal. However, for the small values of D a associated with optimal JNDs, the equal-variance assumption is reasonable and makes the performance analysis signicantly more simple. Thus, the d0 metric will be used to characterize performance of the optimal detector Y( ti1 , . . . , tiKi ), where ¡
d0 [CFi ]
¢2
D
fE[Y( ti1 , . . . , tiKi ) | a C D a] ¡ E[Y( ti1 , . . . , tiKi ) | a]g2 Var[Y ( ti1 , . . . , tiKi ) | a]
. (3.9)
2290
M. G. Heinz, H. S. Colburn, & L. H. Carney
It can be shown (e.g., see Rieke et al., 1997) that for any decision variable Y of the form Y ( t1 , . . . , tK ) D
K X
g ( tn ) ,
(3.10)
nD1
where g( tn ) is any function of the Poisson discharge time tn , the expected value and variance of Y are given by Z T ) |a] (3.11) E [Y( t1 , . . . , tK D g( t) ri ( t, a) dt, 0
Z Var [Y( t1 , . . . , tK ) |a] D
T 0
g2 ( t) ri ( t, a) dt.
(3.12)
Given these relations with g ( t) D ln[ri ( t, a C D a) / ri ( t, a) ], (Z )2 , T ¡ 0 ¢2 ri ( t, a C D a) ( ( ln [ri t, a C D a) ¡ ri t, a) ] dt d [CFi ] D ri ( t, a) 0 (Z µ ) ¶2 T ri ( t, a C D a) ln (3.13) ri ( t, a) dt . ri ( t, a) 0 With the additional assumption that ri ( t, a) varies linearly and slowly over the incremental range from a to a C D a (Colburn, 1981), that is, ri ( t, a C D a) ’ ri ( t, a) C rPi ( t, a) D a,
(3.14)
where rPi ( t, a) D @ri ( t, a) / @a, one obtains (Z )2 , µ ¶ T ¡ 0 ¢2 rPi ( t, a) D a ln 1 C [Pri ( t, a) D a] dt d [CFi ] ’ ri ( t, a) 0
³Z » µ T 0
rPi ( t, a) D a ln 1 C ri ( t, a)
¶¼
2
´ ri ( t, a) dt .
(3.15)
The approximation ln(1 C x ) ’ x for small x then leads to the equations 5 " #2 , " # Z T Z T ¡ 0 ¢2 rPi ( t, a) 2 rPi ( t, a) 2 2 2 ( D a) d [CFi ] ’ ( D a) dt dt ri ( t, a) 0 0 ri ( t, a) 5 Note that the “small-x” approximation may not be appropriate for low discharge rates ri (t, a). To avoid this potential problem (e.g., in the valleys of the phase-locked response), a constant was added to all discharge rates ri (t, a) prior to the SDT analysis in this study. A value of ’ 7 spikes per second was used, based on the minimum value of ri (t, f, L) in Siebert’s (1970) model. The precise value of this constant is not signicant, as long as the instantaneous discharge rate does not go to zero.
Performance Limits Using Computational Models
µ ¶2 1 @ri ( t, a) D ( D a) dt @a 0 ri ( t, a) ¡ ¢2 D ( D a) 2 da0 [CFi ] . Z
2
2291
T
(3.16) (3.17)
Finally, taking d0 D 1 as the JND and combining optimally over independent bers, one obtains equation 3.1. The summation over bers in equation 3.1 results from the independence assumption (see section 2.2) and comes directly in the LRT analysis for the population response since the probability function in equation 3.2 is a product of the individual ber probabilities. In this case, the total population decision variable is a weighted combination of the individual ber decision variables given in equation 3.8. The equivalence of the equations derived using the CramÂer-Rao bound and LRT analyses demonstrates that the Crame r-Rao bound provides a tight bound on achievable performance in this case (i.e., an efcient estimator exists) and thus describes optimal performance. 4 Computational Methods: Use of Auditory Nerve Models
The time-varying discharge rates ri ( t, a) from the computational AN model can be substituted into equation 3.1 to evaluate optimal performance for various stimulus conditions. However, the partial derivative of the rate waveform with respect to a must be calculated as a function of time. The partial derivative was approximated from the responses of the computational AN model, for example, @ri ( t, a) @a
’
ri ( t, a C D a) ¡ ri ( t, a) . Da
(4.1)
An approximation of the partial derivative with respect to frequency using D f D 0.0001 Hz is illustrated in Figure 3. The top panel shows the two stimuli (although only one waveform is apparent due to the small D f ), and the middle panel illustrates the two time-varying discharge rate waveforms from the AN model ber. The bottom panel represents the approximation of the partial derivative squared as a function of time. The distinct growth of the derivative squared as time increases up to 30 ms (bottom panel of Figure 3) is a result of the increasing difference in the phase between the two tones as a function of time. Even if a random phase were used, as in some psychophysical experiments, the ability to distinguish between a difference in phase and a difference in frequency increases with duration. The bimodal shape within each stimulus period of the derivative squared arises because the two rate waveforms differ in both the upward- and downward-going portions of each response period. Values of D f D 0.0001 Hz and D L D 0.0001 dB were used in equation 4.1 to approximate the partial derivatives with respect to frequency and level. These values were chosen by verifying that
2292
M. G. Heinz, H. S. Colburn, & L. H. Carney
Figure 3: Approximation of the partial derivative in equation 3.1 for evaluating optimal frequency-discrimination performance for the stimulus condition: L D 40 dB SPL, f D 500 Hz, and T D 25 ms (2 ms rise/fall). Two tonal stimuli differing in frequency by D f D 0.0001 Hz are shown in the top panel. The middle panel shows the two time-varying rate waveforms from the computational AN model (CF D 500 Hz) in response to the two stimuli. The two functions cannot be distinguished visually in the top and middle panels. The bottom panel illustrates the partial derivative squared, as approximated by equation 3.18.
the derivatives for a variety of conditions were unaffected by increases in D f or D L by an order of magnitude. All predictions of performance limits using the computational AN model (see section 2.1) are based on the total high-spontaneous-rate (HSR), lowthreshold AN ber population, which makes up 61% of the AN population in cat (Liberman, 1978). The total HSR ber population response was simulated using 60 model CFs, which ranged from 100 Hz to 10 kHz and were uniformly spaced in location according to a human cochlear map (Greenwood, 1990; see Table 1).6 The model CF range represents two orders of magnitude, while the commonly accepted human CF range (20 Hz–20 kHz; 6 The choice of 60 model CFs represents a (somewhat arbitrary) compromise between a dense sampling of the frequency range and reduced computation time.
Performance Limits Using Computational Models
2293
Greenwood, 1990) represents three orders of magnitude. Thus, roughly twothirds of the 30,000 total AN bers in human (Rasmussen, 1940) are represented by the model CF range. The total number of HSR bers represented by the 60 model CFs was therefore 12,200, and thus roughly 200 independent AN bers were assumed to be driven by each of the 60 model responses. Stimulus conditions were chosen based on psychophysical experiments of interest, 7 and a sampling rate of 500 kHz was used. 8 Stimulus duration was dened to be the duration between half-amplitude points on the stimulus envelope. All rise-fall ramps were generated from raised cosine functions. The temporal analysis window (see equation 3.1) included the model response beginning at stimulus onset and ending 25 ms after stimulus offset, to allow for the response delay and the transient onset and offset responses associated with AN bers (see Figures 1d and 3) over the range of CFs and stimulus parameters used in this study. 5 Computational Results
Human and predicted optimal performance on psychophysical discrimination tasks are compared in this section in terms of the JND in either tone frequency or level. Higher JNDs correspond to worse performance in the psychophysical task and in the optimal case indicate a reduction in the amount of sensory information about changes in the given stimulus parameter. Comparisons between optimal and human performance are made in terms of both absolute values and trends versus stimulus parameters, as discussed in section 2.2. 5.1 Frequency Discrimination in Quiet. Predicted optimal frequency JNDs for the population of HSR AN bers were calculated as a function of frequency f , level L, and duration T for the all-information and rate-place encoding schemes using equation 3.1. Figure 4 shows the frequency discrimination predictions along with predictions from Siebert’s (1970) study and human performance. Rate-place and all-information predictions from
7 Two factors occasionally led to a slight mismatch between the stimulus parameters used in this study and those that were used in the psychophysical studies. (1) Stimulus conditions were chosen to be the same for frequency- and level-discrimination predictions in order to allow for direct comparisons of model results between frequency and level discrimination. (2) Tone frequencies were always chosen to be equal to one of the 60 model CFs in order to avoid artifacts related to the spatial undersampling of the human frequency range. 8 A sampling frequency of 500 kHz was used for all predictions, except for three highfrequency conditions for which higher sampling rates were chosen to avoid subharmonic distortion generated by the saturating nonlinearity for specic relations between stimulus and sampling frequency. In general, this sampling artifact was most prominent for high levels and high frequencies, and thus for high frequencies the predictions were limited to low to medium levels.
2294
M. G. Heinz, H. S. Colburn, & L. H. Carney
Figure 4: Optimal performance for pure-tone frequency discrimination. Predictions from the computational AN model are shown by lled circles for the rate-place (RP) scheme, and by lled squares for the all-information (AI) scheme. Predictions from Siebert’s (1970) analytical model are shown for the two encoding schemes by corresponding open symbols. Typical human performance is illustrated by stars. (a) The Weber fraction D f / f is plotted as a function of frequency. All model predictions are for L D 40 dB SPL and T D 200 ms (20 ms rise/fall). Human data are from Moore (1973)for T D 200 ms and equal loudness across frequency (L D 60 dB SPL at 1 kHz). The two high-frequency points not connected by lines for each encoding scheme are conditions for which the upper limit of the computational model CF range inuenced the results. (b) The difference limen D f is plotted as a function of duration. Predictions are for L D 40 dB SPL, f D 970 p Hz, and 4 ms rise/fall ramps. p Predictions from Siebert (1970) follow D f / T ¡1 for rate-place and D f / T ¡3 for all-information. Human data are from Moore (1973) for L D 60 dB SPL, f D 1000 Hz and 2 ms rise/fall ramps. (c) The difference limen D f is plotted as a function of level. Predictions were made for f D 970 Hz and T D 500 ms (20 ms rise/fall). Human data are from Wier et al. (1977) for f D 1000 Hz and T D 500 ms.
the computational AN model match predictions from Siebert well, both qualitatively and quantitatively, over the range of stimulus parameters for which Siebert’s analytical model is applicable (200 < f < 2000 Hz, T ¸ 100 ms, and 20 · L · 50 dB SPL). This general agreement supports the validity of the computational approach. The effect of frequency on pure tone frequency discrimination is shown in Figure 4a. Rate-place predictions are essentially at as a function of fre-
Performance Limits Using Computational Models
2295
quency due to the roughly constant lter shape on a logarithmic axis. The predictions from the computational AN model show a slight improvement in performance as frequency increases due to the small increase in the sharpness of tuning (quality factor Q D CF/ ERB) as frequency increases (Table 1; Glasberg & Moore, 1990). Rate-place predictions match human performance very closely below 2 kHz but deviate signicantly at high frequencies. As discussed in section 6.2.1, none of the limitations of the present simple AN model is likely to account for the large discrepancy between rate-place and human performance at high frequencies. All-information performance is roughly two orders of magnitude better than rate-place and human performance, consistent with Siebert (1970). At low frequencies, predicted performance based on all-information improves as frequency increases, due to the increased number of stimulus cycles as frequency increases (i.e., the two tones become more out of phase by the end of the stimulus duration for higher frequencies; see the bottom panel of Figure 3). All-information predictions, when extended to high frequencies, demonstrate a sharp worsening in performance above 2–3 kHz due to the rolloff of phase locking at high frequencies included in the computational AN model (see Figure 1c). This worsening matches the general trend observed in human data (Moore, 1973; Wier, Jesteadt, & Green, 1977, Moore & Glasberg, 1989). The all-information predictions converge toward the rate-place predictions as frequency increases, as expected due to the loss of temporal information. However, all-information performance is still one order of magnitude better than rate-place performance at 10 kHz (see Figure 4a), despite low synchrony coefcients at high frequencies (see Figure 1c). The upper-CF limit in the model slightly inuenced both rate-place and all-information predictions for high frequencies by underestimating the amount of information available (by p up to a factor of 2), and thus overestimating the JND (by up to a factor of 2). Performance as a function of duration is shown in Figure 4b. Computational can generally be described by p predictions for long durations p ¡1 ¡3 D f / T for rate-place and D f / T for all-information, consistent with the analytical predictions from Siebert (1970). Predicted performance from the all-information scheme improves much faster with duration than the rate-place predictions or the human data. The rapid improvement with increased duration is due to allowing the decision process to make all possible temporal comparisons throughout the entire time course of the stimulus (see the distinct growth of information with time in the bottom panel of Figure 3). Both the rate-place and all-information predictions have a slightly smaller slope at shorter durations than at longer durations, with the allinformation slope matching the slope of human performance for T · 16 ms. The computational AN model included realistic onsets and offsets, as well as neural adaptation, which would be expected to be signicant for shorter durations.
2296
M. G. Heinz, H. S. Colburn, & L. H. Carney
Frequency discrimination performance as a function of level is shown in Figure 4c. At low levels, both rate-place and all-information predictions from the computational AN model demonstrate the same sharp improvement in performance with increasing level, as seen in the human data. Above 20 dB SPL, rate-place performance is roughly at, with the predicted JNDs from the computational AN model demonstrating a slight increase as level increases. Both all-information and human performance continue to improve slowly as level increases. The improvement with level in the all-information model is due to the increase in the number of excited bers as level increases. The lack of improvement in rate-place performance with level results from the limited dynamic range and low thresholds of HSR bers. Inclusion of AN bers with higher thresholds and broader dynamic ranges may produce an improvement in rate-place performance with level, as discussed in section 6.2.1. In order to understand the basis for predicted rate-place and all-information frequency discrimination performance, it is useful to examine the distribution of information across the population of AN bers. Information proles (d 0f [CF]) 2 are plotted in Figure 5 for all 60 AN model CFs for low-, medium-, and high-frequency tones. The information proles (d 0f [CF]) 2 are plotted on a logarithmic scale to allow comparison across the three tone frequencies shown. The information proles for the all-information and rate-place schemes are different in shape for the HSR bers included in the computational AN model. In the all-information scheme, a contiguous range of CFs centered at the frequency of the tone provides information. Above 10 dB SPL, bers near the frequency of the tone have similar high levels of synchrony, and thus contribute nearly equally to frequency discrimination. In contrast, in the rate-place encoding scheme, model AN bers ring at the maximum average rate do not provide any information for estimating frequency. For HSR bers, only AN bers at the edge of the rate-response area provide useful information for estimating frequency based on rate-place (Siebert, 1968). The small populations of AN bers with higher thresholds and broader dynamic ranges could provide useful rate-place information for CFs close to the tone frequency. 5.2 Level Discrimination in Quiet. Predictions of optimal performance for pure-tone level discrimination are compared with human performance in terms of D L as a function of level, duration, and frequency in Figure 6.9 Rate-place predictions from an analytical AN model with HSR bers are 9 Level discrimination will be discussed in terms of the level difference between the two tones, dened as D L D 20 log 10 [(p C Dp) / p] dB, where p is the sound pressure of the standard tone. While there has been much debate about the most appropriate metric for level discrimination (see Grantham & Yost, 1982; Green, 1988), D L has been reported to be proportional to sensitivity d0 by several groups (e.g., Rabinowitz et al., 1976; Buus & Florentine, 1991), and thus this metric will be used in this study.
Performance Limits Using Computational Models
2297
Figure 5: Information responsible for optimal frequency discrimination for low(487 Hz), medium- (1951 Hz), and high-frequency (6803 Hz) conditions (indicated by arrows) with L D 40 dB SPL and T D 200 ms (20 ms rise/fall). The top and bottom panels show the information (d 0f [CF]) 2 for each of the 60 AN model CFs (roughly 200 independent AN bers per model CF) in the all-information and rate-place encoding schemes, respectively. The linear decline in the peaks of the rate-place information proles as frequency increases contrasts the sharp drop in the all-information peaks at the highest frequency. These trends correspond directly to the at rate-place D f / f and to the increasing all-information D f / f seen in Figure 4a as frequency increases.
also shown for comparison (Siebert, 1968). Optimal performance is roughly an order of magnitude better than human performance, and all-information JNDs are roughly a factor of two lower than rate-place predictions (due to the level information from AN phase locking; Colburn, 1981). Both the human and computational model predictions demonstrate a decrease in D L as level is increased in Figure 6a, that is, the near-miss to Weber ’s law. The slope of D L versus level in the computational model predictions is similar to that in the human data for levels above 30 dB SPL. In contrast, Siebert’s (1968) model demonstrates Weber ’s law above 40 dB SPL. This difference is a consequence of the nontriangular lters used in the computational model (see section 6.3). In addition, the inclusion of nonlinear tuning and distributions of AN-ber threshold and dynamic range contributes additional information about changes in level that is not in-
2298
M. G. Heinz, H. S. Colburn, & L. H. Carney
Figure 6: Optimal performance in terms of D L for pure-tone level discrimination. Predictions from the computational AN model are shown by lled circles for the rate-place scheme and lled squares for the all-information scheme. Rateplace predictions from an analytical AN model (Siebert, 1968)are shown as open circles. Typical human performance is illustrated by stars. (a) Effect of level. All model predictions are for f D 970 Hz and T D 500 ms (20 ms rise/fall). Human data are from Florentine et al. (1987) for f D 1000 Hz and T D 500 ms (20 ms rise/fall). (b) Effect of duration. Predictions are for L D 40 dB SPL, f D 970 Hz, and 4 ms rise/fall ramps. Human data are from Florentine (1986) for L D 40 dB SPL and f D 1000 Hz, and 1 ms rise/fall ramps. The solid line illustrates the slope associated with D L / T ¡1 / 2 . (c) Effect of frequency. Predictions are for L D 40 dB SPL and T D 200 ms (20-ms rise/fall). Human data are from Florentine et al. (1987) for L D 40 dB SPL and T D 500 ms (20 ms rise/fall). The two high-frequency points not connected by lines for each encoding scheme are conditions for which the upper limit of the computational model CF range resulted in overestimates of the JNDs.
cluded in the model’s HSR population (Heinz, 2000), as discussed in section 6.3. As a function of duration, the computational model predictions for both all-information and rate-place illustrate the same rate of improvement in performance as is observed in the human data (see Figure 6b). The rate of improvement is shallower in slope than the predictions from the analytical p AN model (open circles) and than the solid line representing D L / T ¡1 . The decay in discharge rate due to neural adaptation is likely to be respon-
Performance Limits Using Computational Models
2299
Figure 7: Information responsible for optimal level discrimination for three level conditions (20, 40, and 60 dB SPL) with f D 970 Hz (indicated by arrow)and T D 500 ms (20 ms rise/fall). The top and bottom panels show the information (dL0 [CF]) 2 for each of the 60 AN model CFs (roughly 200 independent AN bers per model CF) in the all-information and rate-place encoding schemes, respectively.
sible for the shallow slope in optimal performance demonstrated by the computational AN model. As a function of frequency (see Figure 6c), the rate-place predictions are essentially at except for the highest-frequency conditions for which the upper limit of the model CF range limited performance. The all-information predictions converge toward the rate-place predictions as frequency increases due to the loss of temporal phase-locking information at high frequencies. The shallow slope versus frequency in the all-information predictions is similar to that in the human JNDs for this stimulus level. A much sharper increase with frequency is observed in human data at high levels (Florentine et al., 1987), but this effect is not likely to be present in model predictions at higher levels. Level-discrimination information proles (see Figure 7) for rate-place and all-information have the same general shape, unlike the information proles for frequency discrimination (see Figure 5). For level discrimination based on HSR bers with a limited dynamic range, the edges of the
2300
M. G. Heinz, H. S. Colburn, & L. H. Carney
response area provide the only information in the model for both the allinformation and rate-place encoding schemes due to the saturation of both rate and synchrony as level increases. Thus, spread of excitation across the population of AN bers is the primary source of robust level encoding across a wide range of levels based on HSR bers with a limited dynamic range (Siebert, 1968). Additional information (both rate and temporal) about changes in level is present in AN bers with CFs near the tone frequency based on nonlinear cochlear tuning (Heinz, 2000), which is not included in the present AN model and is discussed further in section 6.3. The primary source of the near-miss in the model predictions is an increase in the number of bers that contribute signicantly as level increases, especially above the frequency of the tone. 6 Discussion 6.1 Analytical Versus Computational Approach. The limits on performance for level and frequency discrimination imposed by the random nature of AN responses were rst examined roughly 30 years ago by Siebert (1965, 1968, 1970). The primary goal of this study is to extend the signal detection theory approach Siebert used to allow the use of more general (computational, rather than analytical) AN models. In order to verify the computational method, a simplied AN model was used in this study to permit direct comparisons to Siebert’s predictions based on his analytical AN model. The predictions from this study matched Siebert’s (1965, 1968, 1970) predictions well in almost all aspects over the parameter range for which the analytical AN model was applicable:
1. There is signicantly more temporal information than rate-place information in the AN for encoding changes in frequency. 2. Rate-place performance in terms of D f / f is roughly at as a function of frequency. 3. All-information performance in terms of D f / f improves as frequency increases for low frequencies. p 4. All-information performance improves as D / f T ¡3 , while ratep ¡1 place improves as D f / T . 5. All-information D f continues to decrease as level increases, while rateplace D f is roughly asymptotic above 20 dB SPL. 6. Spread of excitation can produce constant level-discrimination performance over a wide range of levels based on only low-threshold bers with a limited dynamic range. 7. Level discrimination is essentially independent of frequency.
Performance Limits Using Computational Models
2301
The predictions from the computational AN model differed from Siebert’s (1965, 1968, 1970) predictions in two cases: (1) level discrimination in the computational model improved at a slower rate versus duration and similarly to human data, due to the effect of neural adaptation that was not included in Siebert’s (1965, 1968) model, and (2) level-discrimination performance in our model demonstrated the near-miss to Weber ’s law that is observed in human performance, due to the nontriangular lters used in the computational model, whereas Siebert’s (1965, 1968) model predicted Weber’s law. Both of these discrepancies represent cases in which the computational model provides a better match to human performance. The general agreement between the predictions from the computational and analytical models establishes the validity of the computational approach. The predictions from the computational AN model extended Siebert’s (1965, 1968, 1970) predictions in two cases: (1) all-information predictions were able to be made at high frequencies due to the inclusion of the rolloff in phase locking, and (2) predictions were able to be made at short durations due to the inclusion of realistic onset and offset responses and neural adaptation. Both of these extensions raise important issues for auditory encoding of frequency and level, as discussed below. 6.2 Encoding of Frequency.
6.2.1 Rate-Place. The absolute JNDs for optimal use of rate-place information are close to human performance—much closer to human JNDs than are all-information predictions (see Figure 4). Further, rate-place predictions match the trends in human performance versus frequency below 2 kHz and versus duration above 20 ms. However, there are several inconsistencies between predicted rate-place and human performance that must be explained if a rate-place model is to account for human performance. First, predicted frequency-discrimination performance based on rateplace information is inconsistent with the distinct worsening in human performance above 2 kHz (see Figure 4a). Our prediction of invariant performance with frequency is consistent with all other rate-place models for frequency discrimination that incorporate realistic lter slopes as a function of frequency (e.g., Siebert, 1968, 1970; Javel & Mott, 1988). Thus, in order for a rate-place model to account for human frequency discrimination, some alteration must be included to account for the order-of-magnitude increase in D f / f between 2 and 8 kHz. Several possible explanations are considered: 1. A decrease in the efciency of processing rate-place information above 2 kHz could explain human performance; however, it is not clear why a mechanism for counting AN discharges would be optimal at low frequencies but an order of magnitude less efcient at high frequencies. 2. A reduction in the innervation density of independent AN bers at high frequencies by a factor of 100 could explain the order of magni-
2302
M. G. Heinz, H. S. Colburn, & L. H. Carney
tude worsening in human performance. However, this is inconsistent with the increase in AN-ber innervation density from apex to base of the cat cochlea (Keithley & Schreiber, 1987; Liberman, Dodds, & Pierce, 1990). 3. The maximum potential increase in D f / f that would result from the disappearance of the entire upper side of the information prole (see Figure p 5) off the basal end of the cochlea as tone frequency increases is 2, which is too small to account for human performance. 4. A decrease in the sharpness of tuning at high frequencies is inconsistent with reports of human (Glasberg & Moore, 1990, used in this study) and cat tuning (Liberman, 1978; Miller, Schilling, Franck, & Young, 1997), which become sharper as frequency increases. Thus, there appears to be no physiologically realistic explanation for the worsening of rate-place performance as frequency increases above 2 kHz, as observed in the human data. The second possible explanation is that asymptotic rate-place performance as level increases (actually slight increase in computational-model JNDs) is inconsistent with the shallow improvement in human performance as level increases (see Figure 4c). The inclusion of high-threshold, lowspontaneous-rate AN bers could be expected to lead to improved performance at high levels. However, Erell (1988) demonstrated that inclusion of medium- and low-spontaneous-rate bers (with higher thresholds and larger dynamic ranges; Liberman, 1978) did not signicantly extend the limited range of levels over which rate-place performance improves. 6.2.2 All-Information. In contrast to rate-place, all-information performance matches the trends in human performance across all frequencies, especially at high frequencies, and across level (see Figures 4a and 4c). The current predictions quantify that there is signicant temporal information in the AN up to at least 10 kHz, despite very low synchrony coefcients for AN bers at high frequencies (see Figure 1c). However, there are several inconsistencies between predicted all-information performance and human performance. First, there is a two-orders-of-magnitude discrepancy between human performance and optimal all-information performance at both low and high frequencies. Second, all-information performance for frequencypdiscrimina¡3 tion improves p much too rapidly as duration increases (D f / T rather ¡1 than D f / T ; see Figure 4b). This rapid improvement is due to allowing all possible comparisons between discharges to be used by the optimal processor (see Figure 3). 6.2.3 Potential Signal-Processing Mechanisms. It is of interest to hypothesize signal processing mechanisms that would incorporate the useful as-
Performance Limits Using Computational Models
2303
pects of both rate-place and all-information encoding schemes for explaining human performance. A processor that makes use of the temporal discharge patterns appears to be required to produce the U-shaped dependence of D f / f on frequency (see Figure 4a); however, the temporal information must be used inefciently to account for the duration effect (see Figure 4b). In a pilot modeling study, the computational AN model population was used to make predictions based on Goldstein and Srulovicz’s (1977) analysis of 10 rst-order p intervals. While the rst-order interval behavior demonstrated p ¡1 a D f / T relation at long durations (T ¸ 50 ms), a D f / T ¡3 relation was observed at short durations. The steep improvement at short durations is similar to Goldstein and Srulovicz’s predictions, but is steeper than the improvement in human performance at short durations (see Figure 4b; Moore, 1973). The U-shaped dependence of performance versus frequency was maintained by the computational rst-order-interval scheme, similar to Goldstein and Srulovicz, with a parallel upward shift of the entire curve by roughly one order of magnitude (i.e., roughly halfway closer to human performance). A temporal processor that considered all discharge times (i.e., all-order intervals) within a limitedptemporal window would be expected to demonstrate the desired D f / T ¡1 dependence for stimulus durations much longer than the length of the temporal window. For durations shorter than the temporal window, such a scheme would be expected to behave similarly to the all-information processor, which matches the rate of improvement in human performance with duration for T · 16 ms in this study (see Figure 4b). A restricted temporal processor of this type would demonstrate the same shallow improvement with level as the all-information scheme, consistent with human performance, and would be considerably closer to overall human performance levels due to the large amount of temporal information discarded (see Figures 3 and 4b). It is reasonable to assume that neurons in the central nervous system perform signal processing over restricted temporal windows, given that cell membranes have nite time constants; however, specic restricted temporal processors must be proposed and tested quantitatively to determine whether the above expectations hold. 6.3 Encoding of Level. Our predictions for pure-tone level discrimination did not demonstrate a large difference between performance based on
10 There are several simplifying assumptions in Goldstein and Srulovicz’s (1977) analysis of frequency-discrimination performance based on rst-order intervals. Several of the simplications that were made by assuming that RT was large (where R is average rate), were removed in the pilot study with the computational AN model and had little effect. However, the assumption that all intervals within the duration of the stimulus are independent, which is not true for a limited-duration stimulus or for a nonstationary Poisson process, was not able to be removed. The signicance of this assumption for predicting optimal performance based on rst-order intervals is unclear.
2304
M. G. Heinz, H. S. Colburn, & L. H. Carney
the all-information and rate-place encoding schemes. Performance based on optimal use of the information available in the AN was roughly one order of magnitude better than human performance. Optimal performance demonstrated the same general trends seen in human performance for the conditions studied (see Figure 6). This result is of interest because the AN model used in this study did not include many physiological properties that have been hypothesized to be important for level discrimination, as discussed below. The computational AN model used in this study did not include several important aspects of the nonlinear tuning on the basilar membrane (BM): changes in gain, bandwidth, and phase with level (Rhode, 1971; Anderson, Rose, Hind, & Brugge, 1971; Geisler & Rhode, 1982; Ruggero, Rich, Recio, Narayan, & Robles, 1997; Recio, Rich, Narayan, & Ruggero, 1998), and suppression (Sachs & Kiang, 1968; Delgutte, 1990; Ruggero, Robles, & Rich, 1992). The perceptual signicance of the compressive gain observed on the BM has been primarily examined psychophysically (see Moore, 1995, and Moore & Oxenham, 1998, for reviews). The compressive gain can extend the dynamic range over which changes in level can be detected in individual AN bers (Heinz, 2000). The level-dependent phase response of AN bers has a much larger dynamic range than rate information (Anderson et al., 1971) and thus can provide additional temporal information for level discrimination at high levels for which the majority of AN bers are saturated in terms of average rate (Carney, 1994; Heinz, 2000). The absence of suppression in our model prohibits the accurate simulation of AN responses to complex stimuli, as discussed in section 6.5. The near-miss to Weber’s law in the computational model predictions matches human performance (see Figure 6a), but contrasts with Siebert’s (1965, 1968) prediction of Weber ’s law. Both the computational AN model and Siebert’s (1965, 1968) model have only high-spontaneous-rate (HSR), low-threshold bers and rely strongly on spread of excitation across the AN population to encode changes in level for medium and high levels. In the computational model, an increase in the number of CFs contributing useful information as level increases, especially above the frequency of the tone, produces the near-miss (see Figure 7). Potentially signicant differences between the computational AN model and Siebert’s (1965, 1968) analytical model include (1) the use of gamma-tone lters rather than triangular lters, (2) the form of the variation in lter bandwidth with CF, and (3) the distribution of CF across the population of AN bers. Our results, and those of Teich and Lachs (1979), demonstrate that the near-miss does not require Weber’s law to hold in narrow frequency regions (an assumption that is often made in psychophysical models; e.g., Florentine & Buus, 1981). However, the simplied AN model used in this study has several signicant limitations that preclude any denitive conclusions regarding level encoding. Our model would not likely predict Weber ’s law for broadband noise (Miller, 1947) or for narrowband noise in band-reject noise (Viemeister,
Performance Limits Using Computational Models
2305
1983), or the nonmonotonic dependence of D L on level at high frequencies (Carlyon & Moore, 1984; Florentine et al., 1987). A signicant limitation of the AN model used in this study is the exclusion of LSR, high-threshold AN bers (16% of the population; Liberman, 1978), which have been implicated in accounting for Weber’s law in limited frequency regions (Colburn, 1981; Delgutte, 1987; Viemeister, 1988a, 1988b; Winslow & Sachs, 1988; Winter & Palmer, 1991). However, more efcient processing of the LSR bers than the HSR bers is required to achieve Weber ’s law in narrow frequency regions, and thus the near-miss when information is combined across the population of CFs (Delgutte, 1987). A more complex AN model that includes the nonlinear changes in gain and phase as well as the LSR population is needed to quantify the relative contributions of these physiological properties to the encoding of sound level at high levels (Heinz, 2000). An interesting nding from this study, despite the limitations mentioned, is that the shallow improvement in D L with duration observed in the computational model predictions matches human performance (see Figure 6b). This shallow improvement is likely due to neural adaptation, as suggested by Buus and Florentine (1992), and indicates that a limitation in the temporal integration capabilities of a central processor is not needed to explain the shallow improvement in level-discrimination performance as duration increases. 6.4 Relation Between Frequency and Level Encoding. The bandpass tuning in the auditory system implies that a change in rate on an AN ber can result from a change in either stimulus level or frequency. The implications of this property for random-level frequency discrimination are evaluated quantitatively in the companion study. Siebert (1968) used a simple AN model with triangular-shaped lters to demonstrate a fundamental relationship between level and frequency discrimination based on rate-place information. A shortcoming of rate-place models has been their inability to account for both frequency and level discrimination in terms of the ratio of Weber fractions W A / W F (Siebert, 1968; Erell, 1988). Siebert (1968) predicted that the ratio of the Weber fraction for amplitude, W A D D A / A, to the Weber fraction for frequency, WF D D f / f , should be roughly 15 for rate-place models based on the shapes of AN tuning curves. Values of 10 to 15 for this ratio are typical of most rate-place models; however, human psychophysical data typically yield a ratio closer to 50 (Erell, 1988). Erell (1988) suggested that this discrepancy could be resolved by using lter bandwidths that were four to ve times narrower than those based on tuning in cat and that human tuning is sharper than in cat. Typical values of the ratio W A / WF from this study (based on human tuning estimated psychophysically) were 11 for rate-place and 710 for all-information (for f D 970 Hz, L D 40 dB SPL, T D 64 ms, and 4 ms rise/fall). Human behavior lies between the predicted behavior based on rate-place and all-information encoding schemes. This result supports the notion that a restricted tempo-
2306
M. G. Heinz, H. S. Colburn, & L. H. Carney
ral encoding scheme (one intermediate to rate-place and all-information) should be pursued. 6.5 Limitations of This Study. 6.5.1 Computational Auditory Nerve Model. The computational AN model used in this study has a basic lter shape that is the same for low and high CFs, with the only difference being lter bandwidth. This assumption ignores the tails of AN tuning curves observed for high-CF AN bers (Kiang & Moxon, 1974). Inclusion of tails would affect the details of the predicted performance as a function of level; however, the signicance of this assumption is minimized for the predictions versus frequency and duration, which were limited to low- and mid-level stimuli. The omission of tuning curve tails and of nonlinear interaction between different CFs, for example, suppression (Sachs & Kiang, 1968; Delgutte, 1990), also limits the applicability of this AN model for complex (broadband) sounds. The Poisson assumption used in our model ignores refractory effects and could affect some of the details of the performance limits predicted in this study. Siebert (1965, 1968, 1970) and Colburn (1969, 1973) have argued that this assumption is not signicant because AN bers never respond at a sustained rate for which the typical interspike interval is less than four to ve times larger than the absolute refractory period; however, this argument does not rule out a signicant inuence of the relative refractory period for sustained responses and does not address short-duration stimuli. The standard deviation of AN discharge counts to long-duration stimuli (i.e., ¸ 50 ms) has been reported to be less than that predicted by the Poisson model (Teich & Khanna, 1985; Young & Barta, 1986; Delgutte, 1987, 1996; Winter & Palmer, 1991). The maximum deviation from the Poisson model occurs at high discharge rates, and the Poisson model becomes more accurate as discharge rate decreases. Young and Barta (1986) reported that the standard deviation of the highest discharge counts for 200-ms stimuli ranged from 55% to 80% of the Poisson value, quantitatively consistent with other reports that used 50-ms stimuli (e.g., Delgutte, 1987, and Winter & Palmer, 1991). Thus, the maximum discrepancy from using the Poisson model for long-duration stimuli (¸ 50 ms) could be an overestimate of some rate-place JNDs by a factor of two; however, the discrepancy will typically be much less since bers with high discharge rates do not contribute signicant rateplace information (see Figures 5 and 7; Colburn, 1981; Winter & Palmer, 1991). Miller, Barta, and Sachs (1987) provided evidence that single-unit phase-locked responses are consistent with a nonstationary Poisson model. They showed that the mean and variance of period histograms (from sustained responses) had roughly the same pattern and absolute value. Thus, for stimulus durations greater than 50 ms, the errors from using a Poisson model without refractory effects are small compared with the important trends in our predictions, which often range over more than an order of magnitude.
Performance Limits Using Computational Models
2307
For short-duration stimuli, refractory effects could be expected to have a larger inuence than for long-duration stimuli; however, there are insufcient data to estimate quantitatively the effect of refractoriness on rateplace and all-information predictions for short-duration stimuli. The overestimation of rate-place JNDs may be more signicant for short-duration stimuli because onset responses have higher discharge rates (and larger dynamic ranges; see Figure 1b) than sustained responses and therefore may deviate more from the Poisson model. The effect of refractoriness on the all-information model can be thought of as removing those discharges that fall within the refractory period of a previous discharge. Thus, allinformation predictions underestimate JNDs because refractoriness results in less information (i.e., fewer observations) than the Poisson model. However, all-information predictions for frequency discrimination may not be signicantly inuenced by refractoriness, even for short-duration stimuli, because low-order intervals are removed, and all-information performance is dominated by information from large-order intervals (see Figure 3). Thus, refractoriness would generally act to worsen all-information performance and improve rate-place performance, and this prediction can be used to bound the potential effects of refractoriness in this study—for example, a small effect is predicted for level-discrimination (see Figure 6). 6.5.2 Signal Detection Theory. The Crame r-Rao bound is restricted in that it only provides a limit on the performance of all possible decision processes for a Poisson process, and thus should be viewed primarily as a method for quantifying an upper bound on the total information available in the auditory nerve. When an efcient estimator is shown to exist, as in the case here, the Crame r-Rao bound quanties the total information exactly. The CramÂer-Rao-bound method by itself does not describe the type of processing required to achieve the performance limit, which is often signicantly better than human performance (e.g., see Figure 4). Derived optimal processors (e.g., using a likelihood ratio test) typically include many physiologically unrealistic operations on the observed population of AN discharges, such as the ability to compare all discharge times across the entire duration of the stimulus, as well as the ability to compare the discharges across all AN bers (Colburn, 1969, 1973; Siebert, 1970). It is often the case that physiologically realistic, but nonoptimal, processors can be described that perform at levels more consistent with human performance (e.g., Colburn, 1969, 1973, 1977a, 1977b; Goldstein & Srulovicz, 1977; see Delgutte, 1996). This approach provides intuition regarding the actual processing performed in the auditory system and is thus a natural second step after the fundamental performance limits have been described based on the total AN information. There is particular interest in the potential to evaluate quantitatively the effect of noise maskers in experiments that have been used to test rate-place and temporal encoding schemes (e.g., Viemeister, 1983; Moore & Glasberg,
2308
M. G. Heinz, H. S. Colburn, & L. H. Carney
1989). However, the SDT analysis used in this study is limited to singleparameter discrimination tasks with deterministic stimuli (e.g., frequency or level discrimination). Analysis of performance limits in tasks with random stimuli requires several extensions from this study. The inuence of stimulus (external) variability on the information in individual AN bers must be calculated in addition to the inuence of the physiological (internal) variability that was evaluated in the study. Random stimuli also result in correlations between activity of AN bers with overlapping frequency tuning due to the common stimulus drive (Young & Sachs, 1989). Thus, the independence assumption used in this study is invalid for random stimuli, and the correlation between AN bers must be considered in evaluating performance limits based on the entire AN population. Suppression (i.e., nonlinear interaction between different CFs; Sachs & Kiang, 1968; Delgutte, 1990), which is not included in our AN model, inuences AN responses to complex stimuli. Suppression could potentially produce signicant interactions between a target and an off-frequency noise masker that are not intuitively clear or consistent with the assumption that all ber information is eliminated within the noise masker (e.g., Viemeister, 1983; Moore & Glasberg, 1989). Therefore, suppression must be included in AN models to evaluate performance limits for tasks with random-noise stimuli. There are now computational AN models that include the effects of suppression and produce responses to arbitrary stimuli (e.g., Robert & Eriksson, 1999; Zhang et al., 2001). An initial extension of the SDT analysis to random stimuli is presented in the companion paper in this issue, which evaluates performance limits for psychophysical discrimination tasks in which a single nuisance parameter is randomly varied (e.g., random-level frequency discrimination). A more general extension of the SDT analysis to random stimuli has been developed and applied to the detection of tones in noise maskers (Heinz, 2000). In addition to the effects of suppression and across-ber correlation, the information contained in the temporal patterns (either ne-time or envelope) of AN ber responses was shown to be signicant in this study. 7 Conclusion
Signal detection theory can be combined with computational models to predict psychophysical performance limits based on an entire neural population. Auditory discrimination performance limits based on the computational AN model matched predictions from previous studies over the parameter range for which analytical AN models were applicable. The parameter range over which psychophysical performance can be predicted was extended using the computational AN model. The following three conclusions related to the encoding of frequency are not likely to depend on the details of the AN model used in this study: Optimal use of rate-place information is consistent with human frequency discrimination performance at low frequencies but inconsis-
Performance Limits Using Computational Models
2309
tent with the trends in performance versus frequency at high frequencies. Frequency discrimination based on optimal use of all-information is roughly two orders of magnitude better than human performance at all frequencies but demonstrates the correct trends versus frequency, based on the frequency dependence of phase locking in the cat. There is signicant temporal information in the AN for frequency discrimination up to at least 10 kHz, and thus temporal schemes cannot be rejected at high frequencies based on the decrease in phase locking in the AN. Future studies using the computational approach with more complex AN models are now justied and will be particularly interesting for the task of level discrimination. In addition, this approach is applicable to any neural population in a sensory system for which a model of the discharge pattern statistics is available. Acknowledgments
We thank Bertrand Delgutte, Christopher Long, Christine Mason, Martin McKinney, Susan Moscynski, Bill Peake, Tom Person, Timothy Wilson, Xuedong Zhang, and Ling Zheng for providing valuable comments on an earlier version of this article. We greatly appreciate the constructive comments and suggestions provided by three anonymous reviewers. We thank Mary Florentine for providing the data from Florentine (1986) shown in Figure 4b and Don Johnson for his synchrony data from cat shown in Figure 1c. This study was part of a graduate dissertation in the Speech and Hearing Sciences Program of the Harvard-MIT Division of Health Sciences and Technology (Heinz, 2000). Portions of this work were presented at the 21st and 22nd Midwinter Meeting of the Association for Research in Otolaryngology. The simulations in this study were performed on computers provided by the Scientic Computing and Visualization group at Boston University. This work was supported in part by the National Institutes of Health, Grants T32DC00038 and R01DC00100, as well as by the Ofce of Naval Research, Grant Agmt Z883402. References Abbott, L. F., & Sejnowski, T. J. (1999). Introduction. In L. F. Abbott & T. J. Sejnowski (Eds.), Neural codes and distributed representations: Foundations of neural computation. Cambridge, MA: MIT Press. Anderson, D. J., Rose, J. E., Hind, J. E., & Brugge, J. F. (1971). Temporal position of discharges in single auditory nerve bers within the cycle of a sinewave stimulus: Frequency and intensity effects. J. Acoust. Soc. Am., 49, 1131– 1139.
2310
M. G. Heinz, H. S. Colburn, & L. H. Carney
Braida, L. D., & Durlach, N. I. (1988). Peripheral and central factors in intensity perception. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Auditory function: Neurobiological bases of hearing (pp. 559–583). New York: Wiley. Buus, S., & Florentine, M. (1991). Psychometric functions for level discrimination. J. Acoust. Soc. Am., 90, 1371–1380. Buus, S., & Florentine, M. (1992). Possible relation of auditory-nerve adaptation to slow improvement in level discrimination with increasing duration. In Y. Cazals, L. Demany, & K. Horner (Eds.), Auditory physiology and perception (pp. 279–288). New York: Pergamon. Cariani, P. A., & Delgutte, B. (1996a). Neural correlates of the pitch of complex tones: I. Pitch and pitch salience. J. Neurophysiol., 76, 1698–1716. Cariani, P. A., & Delgutte, B. (1996b). Neural correlates of the pitch of complex tones: II. Pitch shift, pitch ambiguity, phase invariance, pitch circularity, rate pitch, and the dominance region for pitch. J. Neurophysiol., 76, 1717– 1734. Carlyon, R. P., & Moore, B. C. J. (1984). Intensity discrimination: A severe departure from Weber’s law. J. Acoust. Soc. Am., 76, 1369–1376. Carney, L. H. (1993). A model for the responses of low-frequency auditory-nerve bers in cat. J. Acoust. Soc. Am., 93, 401–417. Carney, L. H. (1994).Spatiotemporal encoding of sound level: Models for normal encoding and recruitment of loudness. Hear. Res., 76, 31–44. Colburn, H. S. (1969). Some physiological limitations on binaural performance. Unpublished doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA. Colburn, H. S. (1973). Theory of binaural interaction based on auditory-nerve data. I. General strategy and preliminary results on interaural discrimination. J. Acoust. Soc. Am., 54, 1458–1470. Colburn, H. S. (1977a). Theory of binaural interaction based on auditory-nerve data. II. Detection of tones in noise. J. Acoust. Soc. Am., 61, 525–533. Colburn, H. S. (1977b). Theory of binaural interaction based on auditory-nerve data. II. Detection of tones in noise. Supplementary material. (AIP Document No. PAPS JASMA-61-525-98). Available online at: http://www.aip.org/epaps/ howorder.html. Colburn, H. S. (1981). Intensity perception: Relation of intensity discrimination to auditory-nerve ring patterns (Internal Memorandum). Cambridge, MA: Research Laboratory of Electronics, Massachusetts Institute of Technology. CramÂer, H. (1951). Mathematical methods of statistics. Princeton: Princeton University Press. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997a).Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am., 102, 2892–2905. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997b). Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. J. Acoust. Soc. Am., 102, 2906–2919. Dau, T., Puschel, ¨ D., & Kohlrausch, A. (1996a). A quantitative model of the “effective” signal processing in the auditory system I. Model structure. J. Acoust. Soc. Am., 99, 3615–3622.
Performance Limits Using Computational Models
2311
Dau, T., Puschel, ¨ D., & Kohlrausch, A. (1996b). A quantitative model of the ’effective’ signal processing in the auditory system II. Simulations and measurements. J. Acoust. Soc. Am., 99, 3623–3631. Delgutte, B. (1987). Peripheral auditory processing of speech information: implications from a physiological study of intensity discrimination. In M. E. H. Schouten (Ed.), The psychophysicsof speech perception(pp. 333–353). Dordrecht: Nijhoff. Delgutte, B. (1990). Two-tone rate suppression in auditory-nerve bers: Dependence on suppressor frequency and level. Hear. Res., 49, 225–246. Delgutte, B. (1996). Physiological models for basic auditory percepts. In H. L. Hawkins, T. A. McMullen, A. N. Popper, & R. R. Fay (Eds.), Auditory computation (pp. 157–220). New York: Springer-Verlag. Duifhuis, H. (1973). Consequences of peripheral frequency selectivity for nonsimultaneous masking. J. Acoust. Soc. Am., 54, 1471–1488. Durlach, N. I., & Braida, L. D. (1969). Intensity perception I: Preliminary theory of intensity resolution. J. Acoust. Soc. Am., 46, 372–383. Dye, R. H., & Hafter, E. R. (1980). Just-noticeable differences of frequency for masked tones. J. Acoust. Soc. Am., 67, 1746–1753. Erell, A. (1988). Rate coding model for discrimination of simple tones in the presence of noise. J. Acoust. Soc. Am., 84, 204–214. Evans, E. F., Pratt, S. R., Spenner, H., & Cooper, N. P. (1992). Comparisons of physiological and behavioural properties: Auditory frequency selectivity. In Y. Cazals, L. Demany, & K. Horner (Eds.), Auditory physiology and perception (pp. 159–169). New York: Pergamon. Fitzhugh, R. (1958). A statistical analyser for optic nerve messages. J. Gen. Physiol., 41, 675–692. Florentine, M. (1986). Level discrimination of tones as a function of duration. J. Acoust. Soc. Am., 79, 792–798. Florentine, M., & Buus, S. (1981). An excitation pattern model for intensity discrimination. J. Acoust. Soc. Am., 70, 1646–1654. Florentine, M., Buus, S., & Mason, C. R. (1987).Level discrimination as a function of level for tones from 0.25 to 16 kHz. J. Acoust. Soc. Am., 81, 1528–1541. Freyman, R. L., & Nelson, D. A. (1986). Frequency discrimination as a function of tonal duration and excitation-pattern slopes in normal and hearing-impaired listeners. J. Acoust. Soc. Am., 79, 1034–1044. Geisler, C. D., & Rhode, W. S. (1982).The phases of basilar-membrane vibrations. J. Acoust. Soc. Am., 71, 1201–1203. Gigu`ere, C., & Woodland, P. C. (1994a). A computational model of the auditory periphery for speech and hearing research. I. Ascending path. J. Acoust. Soc. Am., 95, 331–342. Gigu`ere, C., & Woodland, P. C. (1994b). A computational model of the auditory periphery for speech and hearing research. II. Descending paths. J. Acoust. Soc. Am., 95, 343–349. Glasberg, B. R., & Moore, B. C. J. (1990). Derivation of auditory lter shapes from notched-noise data. Hear. Res., 47, 103–138. Goldstein, J. L., & Srulovicz, P. (1977). Auditory-nerve spike intervals as an adequate basis for aural spectrum analysis. In E. F. Evans & J. P. Wilson (Eds.),
2312
M. G. Heinz, H. S. Colburn, & L. H. Carney
Psychophysics and physiology of hearing (pp. 337–347). New York: Academic Press. Grantham, D. W., & Yost, W. A. (1982). Measures of intensity discrimination. J. Acoust. Soc. Am., 72, 406–410. Green, D. M. (1988). Prole analysis: Auditory intensity discrimination. New York: Oxford University Press. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Greenwood, D. D. (1990). A cochlear frequency-position function for several species—29 years later. J. Acoust. Soc. Am., 87, 2592–2605. Gresham, L. C., & Collins, L. M. (1998). Analysis of the performance of a model-based optimal auditory signal processor. J. Acoust. Soc. Am., 103, 2520– 2529. Heinz, M. G. (2000). Quantifying the effects of the cochlear amplier on temporal and average-rate information in the auditory nerve. Unpublished doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA. Houtsma, A. J. M. (1995). Pitch perception. In B. C. J. Moore (Ed.), Hearing. New York: Academic Press. Huettel, L. G., & Collins, L. M. (1999). Using computational auditory models to predict simultaneous masking data: Model comparison. IEEE Trans. Biomed. Eng., 46, 1432–1440. Javel, E., & Mott, J. B. (1988). Physiological and psychophysical correlates of temporal processing in hearing. Hear. Res., 34, 275–294. Jesteadt, W., Wier, C. C., & Green, D. M. (1977). Intensity discrimination as a function of frequency and sensation level. J. Acoust. Soc. Am., 61, 169–177. Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve bers to single tones. J. Acoust. Soc. Am., 68, 1115–1122. Johnson, D. H., & Kiang, N. Y. S. (1976). Analysis of discharges recorded simultaneously from pairs of auditory-nerve bers. Biophys. J., 16, 719–734. Joris, P. X., Carney, L. H., Smith, P. H., & Yin, T. C. T. (1994). Enhancement of neural synchrony in the anteroventral cochlear nucleus. I. Responses to tones at the characteristic frequency. J. Neurophysiol., 71, 1022–1036. Kaernbach, C., & Demany, L. (1998). Psychophysical evidence against the autocorrelation theory of auditory temporal processing. J. Acoust. Soc. Am., 104, 2298–2306. Keithley, E. M., & Schreiber, R. C. (1987). Frequency map of the spiral ganglion in the cat. J. Acoust. Soc. Am., 81, 1036–1042. Kiang, N. Y. S. (1990). Curious oddments of auditory-nerve studies. Hear. Res., 49, 1–16. Kiang, N. Y. S., & Moxon, E. C. (1974). Tails of tuning curves of auditory-nerve bers. J. Acoust. Soc. Am., 55, 620–630. Kiang, N. Y. S., Watanabe, T., Thomas, E. C., & Clark, L. F. (1965). Discharge patterns of single bers in the cat’s auditory nerve. Cambridge, MA: MIT Press. Liberman, M. C. (1978). Auditory-nerve response from cats raised in a low-noise chamber. J. Acoust. Soc. Am., 63, 442–455.
Performance Limits Using Computational Models
2313
Liberman, M. C. (1980). Morphological differences among radial afferent bers in the cat cochlea: An electron microscopic study of serial sections. Hear. Res., 3, 45–63. Liberman, M. C., Dodds, L. W., & Pierce, S. (1990). Afferent and efferent innervation of the cat cochlea: Quantitative analysis with light and electron microscopy. J. Comp. Neurol., 301, 443–460. Lin, T., & Goldstein, J. L. (1995). Quantifying 2-factor phase relations in nonlinear responses from low characteristic-frequency auditory-nerve bers. Hear. Res., 90, 126–138. May, B. J., & Sachs, M. B. (1992). Dynamic range of neural rate responses in the ventral cochlear nucleus of awake cats. J. Neurophysiol., 68, 1589–1602. McGill, W. J., & Goldberg, J. P. (1968).A study of the near-miss involving Weber’s law and pure-tone intensity discrimination. Perception Psychophysics, 4, 105– 109. Miller, G. A. (1947). Sensitivity to changes in the intensity of white noise and its relation to masking and loudness. J. Acoust. Soc. Am., 19, 606–619. Miller, M. I., Barta, P. E., & Sachs, M. B. (1987). Strategies for the representation of a tone in background noise in the temporal aspects of the discharge patterns of auditory-nerve bers. J. Acoust. Soc. Am., 81, 665–679. Miller, R. L., Schilling, J. R., Franck, K. R., & Young, E. D. (1997). Effects of acoustic trauma on the representation of the vowel /e / in cat auditory nerve bers. J. Acoust. Soc. Am., 101, 3602–3616. Moore, B. C. J. (1973). Frequency difference limens for short-duration tones. J. Acoust. Soc. Am., 54, 610–619. Moore, B. C. J. (1989). An introduction to the psychology of hearing. New York: Academic Press. Moore, B. C. J. (1995).Perceptualconsequences of cochleardamage. New York: Oxford University Press. Moore, B. C. J., & Glasberg, B. R. (1989). Mechanisms underlying the frequency discrimination of pulsed tones and the detection of frequency modulation. J. Acoust. Soc. Am., 86, 1722–1732. Moore, B. C. J., & Oxenham, A. J. (1998). Psychoacoustic consequences of compression in the peripheral auditory system. Psychol. Rev., 105, 108–124. Mountain, D. C., & Hubbard, A. E. (1996). Computational analysis of hair cell and auditory nerve processes. In H. L. Hawkins, T. A. McMullen, A. N. Popper, & R. R. Fay (Eds.), Auditory computation (pp. 121–156). New York: Springer-Verlag. Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Annu. Rev. Neurosci., 21, 227–277. Parzen, E. (1962). Stochastic processes. San Francisco: Holden-Day. Patterson, R. D., Allerhand, M. H., & Gigu`ere, C. (1995). Time-domain modeling of peripheral auditory processing: A modular architecture and a software platform. J. Acoust. Soc. Am., 98, 1890–1894. Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987, Dec. 14–15). An efcient auditory lterbank based on the gammatone function. Paper presented at a meeting of the IOC Speech Group on Auditory Modeling at RSRE.
2314
M. G. Heinz, H. S. Colburn, & L. H. Carney
Patuzzi, R., & Robertson, D. (1988). Tuning in the mammalian cochlea. Physiol. Rev., 68, 1009–1082. Payton, K. L. (1988). Vowel processing by a model of the auditory periphery: A comparison to eighth-nerve responses. J. Acoust. Soc. Am., 83, 145–162. Pickles, J. O. (1988). An introduction to the physiology of hearing. New York: Academic Press. Rabinowitz, W. M., Lim, J. S., Braida, L. D., & Durlach, N. I. (1976). Intensity perception VI: Summary of recent data on deviations from Weber’s law for 1000-Hz tone pulses. J. Acoust. Soc. Am., 59, 1506–1509. Rasmussen, G. L. (1940). Studies of the VIIIth cranial nerve in man. Laryngoscope, 50, 67–83. Recio, A., Rich, N. C., Narayan, S. S., & Ruggero, M. A. (1998). Basilar-membrane responses to clicks at the base of the chinchilla cochlea. J. Acoust. Soc. Am., 103, 1972–1989. Rhode, W. S. (1971). Observations of the vibration of the basilar membrane in squirrel monkeys using the Mossbauer ¨ technique. J. Acoust. Soc. Am., 49, 1218–1231. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Robert, A., & Eriksson, J. L. (1999). A composite model of the auditory periphery for simulating responses to complex sounds. J. Acoust. Soc. Am., 106, 1852– 1864. Ruggero, M. A. (1992). Physiology and coding of sound in the auditory nerve. In A. N. Popper & R. R. Fay (Eds.), The mammalian auditory pathway: Neurophysiology (pp. 34–93). New York: Springer-Verlag. Ruggero, M. A., Rich, N. C., Recio, A., Narayan, S. S., & Robles, L. (1997). Basilarmembrane responses to tones at the base of the chinchilla cochlea. J. Acoust. Soc. Am., 101, 2151–2163. Ruggero, M. A., Robles, L., & Rich, N. C. (1992). Two-tone suppression in the basilar membrane of the cochlea: Mechanical basis of auditory-nerve rate suppression. J. Neurophysiol., 68, 1087–1099. Ryugo, D. K. (1992). The auditory nerve: Peripheral innervation, cell body morphology, and central projections. In D. B. Webster, A. N. Popper, & R. R. Fay (Eds.), The mammalian auditory pathway: Neuroanatomy (pp. 23–65). New York: Springer-Verlag. Sachs, M. B., & Kiang, N. Y. S. (1968). Two-tone inhibition in auditory-nerve bers. J. Acoust. Soc. Am., 43, 1120–1128. Schouten, J. F. (1940). The residue and the mechanism of hearing. Proc. Kon. Nedel. Akad. Wetenschap, 43, 991–999. Siebert, W. M. (1965). Some implication of the stochastic behavior of primary auditory neurons. Kybernetik, 2, 206–215. Siebert, W. M. (1968). Stimulus transformation in the peripheral auditory system. In P. A. Kolers & M. Eden (Eds.), Recognizing patterns (pp. 104–133). Cambridge, MA: MIT Press. Siebert, W. M. (1970). Frequency discrimination in the auditory system: place or periodicity mechanisms? Proc. IEEE, 58, 723–730.
Performance Limits Using Computational Models
2315
Snyder, D. L., & Miller, M. I. (1991). Random point processes in time and space. New York: Springer-Verlag. Spoendlin, H. (1969). Innervation patterns in the organ of Corti of the cat. Acta. Otolaryngol. (Stockh), 67, 239–254. Teich, M. C., & Khanna, S. M. (1985). Pulse-number distribution for the neural spike train in the cat’s auditory nerve. J. Acoust. Soc. Am., 77, 1110–1128. Teich, M. C., & Lachs, G. (1979). A neural-counting model incorporating refractoriness and spread of excitation. I. Application to intensity discrimination. J. Acoust. Soc. Am., 66, 1738–1749. van Trees, H. L. (1968). Detection, estimation, and modulation theory: Part I. New York: Wiley. Viemeister, N. F. (1974). Intensity discrimination of noise in the presence of band-reject noise. J. Acoust. Soc. Am., 56, 1594–1600. Viemeister, N. F. (1983). Auditory intensity discrimination at high frequencies in the presence of noise. Science, 221, 1206–1208. Viemeister, N. F. (1988a). Psychophysical aspects of auditory intensity coding. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Auditory function: Neurobiological bases of hearing (pp. 213–241). New York: Wiley. Viemeister, N. F. (1988b).Intensity coding and the dynamic range problem. Hear. Res., 34, 267–274. Viemeister, N. F., & Bacon, S. P. (1988). Intensity discrimination, increment detection, and magnitude estimation for 1-kHz tones. J. Acoust. Soc. Am., 84, 172–178. von Helmholtz, H. L. F. (1863). Die Lehre von den Tonempndungen als physiologis¨ die Theorie der Musik (F. Vieweg und Sohn, Braunschweig, che Grundlage fur Germany). Translated as: On the Sensations of Tone as a Physiological Basis for the Theory of Music, by A. J. Ellis from the 4th German edition, 1877, Leymans, London, 1885 (reprinted by Dover, New York, 1954). Wakeeld, G. H., & Nelson, D. A. (1985). Extension of a temporal model of frequency discrimination: Intensity effects in normal and hearing-impaired listeners. J. Acoust. Soc. Am., 77, 613–619. Weiss, T. F., & Rose, C. (1988).A comparison of synchronization lters in different auditory receptor organs. Hear. Res., 33, 175–180. Westerman, L. A., & Smith, R. L. (1988). A diffusion model of the transient response of the cochlear inner hair cell synapse. J. Acoust. Soc. Am., 83, 2266– 2276. Wever, E. G. (1949). Theory of hearing. New York: Wiley. Wier, C. C., Jesteadt, W., & Green, D. M. (1977). Frequency discrimination as a function of frequency and sensation level. J. Acoust. Soc. Am., 61, 178– 184. Winslow, R. L., & Sachs, M. B. (1988). Single-tone intensity discrimination based on auditory-nerve rate responses in backgrounds of quiet, noise, and with stimulation of the crossed olivocochlear bundle. Hear. Res., 35, 165-190. Winter, I. M., & Palmer, A. R. (1991).Intensity coding in low-frequency auditorynerve bers of the guinea pig. J. Acoust. Soc. Am., 90, 1958–1967. Young, E. D., & Barta, P. E. (1986). Rate responses of auditory-nerve bers to tones in noise near masked threshold. J. Acoust. Soc. Am., 79, 426–442.
2316
M. G. Heinz, H. S. Colburn, & L. H. Carney
Young, E. D., & Sachs, M. B. (1989). Auditory nerve bers do not discharge independently when responding to broadband noise. Abstracts of the 12th Midwinter Meeting of the Association for Research in Otolaryngology, 121. Zhang, X., Heinz, M. G., Bruce, I. C., & Carney, L. H. (2001). A phenomenological model for the responses of auditory-nerve bers: I. Nonlinear tuning with compression and suppression. J. Acoust. Soc. Am., 109, 648–670. Received January 11, 2000; accepted January 3, 2001.
LETTER
Communicated by John Lazzaro
Evaluating Auditory Performance Limits: II. One-Parameter Discrimination with Random-Level Variation Michael G. Heinz Speech and Hearing Sciences Program, Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A., and Hearing Research Center, Biomedical Engineering Department, Boston University, Boston, MA 02215, U.S.A. H. Steven Colburn Laurel H. Carney Hearing Research Center, Biomedical Engineering Department, Boston University, Boston, MA 02215, U.S.A. Previous studies have combined analytical models of stochastic neural responses with signal detection theory (SDT) to predict psychophysical performance limits; however, these studies have typically been limited to simple models and simple psychophysical tasks. A companion article in this issue (“Evaluating Auditory Performance Limits: I”) describes an extension of the SDT approach to allow the use of computational models that provide more accurate descriptions of neural responses. This article describes an extension to more complex psychophysical tasks. A general method is presented for evaluating psychophysical performance limits for discrimination tasks in which one stimulus parameter is randomly varied. Psychophysical experiments often randomly vary a single parameter in order to restrict the cues that are available to the subject. The method is demonstrated for the auditory task of random-level frequency discrimination using a computational auditory nerve (AN) model. Performance limits based on AN discharge times ( all-information) are compared to performance limits based only on discharge counts (rate place). Both decision models are successful in predicting that random-level variation has no effect on performance in quiet, which is the typical result in psychophysical tasks with random-level variation. The distribution of information across the AN population provides insight into how different types of AN information can be used to avoid the inuence of random-level variation. The rate-place model relies on comparisons between bers above and below the tone frequency (i.e., the population response), while the all-information model does not require such acrossber comparisons. Frequency discrimination with random-level variation in the presence of high-frequency noise is also simulated. No effect is predicted for all-information, consistent with the small effect in human c 2001 Massachusetts Institute of Technology Neural Computation 13, 2317– 2339 (2001) °
2318
M. G. Heinz, H. S. Colburn, and L. H. Carney
performance; however, a large effect is predicted for rate-place in noise with random-level variation. 1 Introduction
The use of signal detection theory (SDT) combined with stochastic models of neural responses has provided much insight into neural encoding of sensory stimuli (e.g., Fitzhugh, 1958; Siebert, 1965, 1968, 1970; Colburn, 1969, 1973, 1977a, 1977b, 1981; Goldstein & Srulovicz, 1977; Delgutte, 1987; Erell, 1988; see Delgutte, 1996, and Parker & Newsome, 1998, for reviews). These studies evaluated psychophysical performance limits based on the stochastic behavior of neural responses. However, the application of this approach has been limited to simple psychophysical tasks due to the use of simple analytical models and by the restricted use of SDT with these models to deterministic stimuli. Computational neural models can describe more accurate physiological responses to a much wider range of stimuli than analytical models. Previous methods that have combined SDT and computational auditory models to predict psychophysical performance have either not included physiological (internal) noise (e.g., Gresham & Collins, 1998; Huettel & Collins, 1999) or have used arbitrary internal noise that was not directly related to physiological variability (e.g., Dau, Puschel, ¨ & Kohlrausch, 1996a, 1996b; Dau, Kollmeier, & Kohlrausch, 1997a, 1997b). Our companion article in this issue (”Evaluating Auditory Performance Limits: II”) describes a general method that extends previous studies that have quantied the effects of physiological noise on psychophysical performance using analytical auditory nerve (AN) models (e.g., Siebert, 1968, 1970; Colburn, 1969, 1973) to incorporate the use of computational models; however, the SDT analysis in the companion article was limited to deterministic discrimination experiments. This study extends the general SDT approach described in the companion article to allow more complicated psychophysical tasks to be evaluated, specically discrimination tasks in which one parameter is randomly varied from trial to trial. Many psychophysical experiments have used random variation of certain stimulus parameters in order to limit the cues available to the subject. For example, McKee, Silverman, and Nakayama (1986) observed that human visual velocity discrimination was unaffected by random variation of either contrast or temporal frequency, and concluded that performance was mediated by sensing velocity. In the auditory system, this method has been used in prole analysis experiments to demonstrate that level discrimination of a single component within a tone complex or detection of a tone in noise is not affected by the randomization of overall level, and thus these tasks can be performed without relying on an absolute energy cue (Green, Kidd, & Picardi, 1983; Kidd, Mason, Brantley, & Owen, 1989; see Green, 1988 for a review). While psychophysical performance in the presence of overall-level randomization is typically unchanged, performance based on
Auditory Performance Limits with Random Level
2319
single neurons would likely be severely degraded. Therefore, it is important to evaluate quantitatively how physiological responses can account for behavior in these types of psychophysical tasks. Durlach, Braida, and Ito (1986) described a quantitative model for prole analysis tasks based on acrossfrequency-level comparisons, which included the effects of both external (stimulus) variations and internal processing noise. However, the internal noise used in their model was not directly related to the known physiological noise that exists in AN bers, and thus was somewhat arbitrary. Huettel and Collins (1999) evaluated the information loss in physiological auditory models that resulted from randomization of phase in a tone detection in noise experiment; however, their analysis did not include internal noise. Our study here extends the analysis in the companion article, which quantied performance limits for deterministic discrimination tasks based on Poisson neural discharges, to include the effect of random stimulus variation in a single parameter on psychophysical performance limits. In order to demonstrate this method, predictions for a random-level frequency discrimination task are evaluated using the same computational AN model used in the companion study. Predictions are made for both rate-place (based on discharge counts) and all-information (based on discharge times) encoding schemes, as was done in the companion study of pure-tone frequency and level discrimination. It is often stated that listeners use rate-place information to encode frequency at high frequencies, (e.g., Wever, 1949; Moore, 1973, 1989; Dye & Hafter, 1980; Wakeeld & Nelson, 1985; Javel & Mott, 1988; Pickles, 1988; Moore & Glasberg, 1989), because AN phase locking rapidly degrades above 2 to 3 kHz (Johnson, 1980; Joris, Carney, Smith, & Yin, 1994; or see Figure 1c in the companion article). This analysis illustrates quantitatively the relation between random-level frequency discrimination and xed-level frequency and level discrimination in terms of average-rate and temporal information. Random-level variation has been used in several auditory frequency discrimination experiments to test rate-place models for frequency encoding (Henning, 1966; Verschuure & van Meeteren, 1975; Emmerich, Ellermeier, & Butensky, 1989; Moore & Glasberg, 1989). In this task, the listener is asked to discriminate the frequency of two tones whose levels are varied randomly and independently from trial to trial. Frequency-discrimination-inquiet could hypothetically be performed by observing the average discharge rate of a single-frequency channel tuned either above or below the tone frequency (e.g., Zwicker, 1956, 1970; Henning, 1967). Such single-channel rate-based models would be expected to be affected greatly by randomlevel variation because changes in level could not be discriminated from changes in frequency. Conversely, temporal models would not be expected to be affected by random-level variation. Several studies have observed an effect of random-level variation on frequency discrimination (e.g., Henning, 1966; Emmerich et al., 1989); however, the observed effect of random-level variation is likely to be largely due to
2320
M. G. Heinz, H. S. Colburn, and L. H. Carney
pitch shifts associated with the changes in level over the broad range of level variation used in these studies (Verschuure & van Meeteren, 1975; Emmerich et al., 1989). Moore and Glasberg (1989) measured frequency discrimination as a function of frequency with a smaller range of random-level variation and observed virtually no effect. Moore and Glasberg also measured frequency discrimination in the presence of high-frequency noise that was used to mask the characteristic frequencies (CF, the most sensitive frequency of the AN ber) above the tone frequency. This experiment was designed to test the idea that rate-place models could avoid the effect of random-level variation by comparing information from CFs above and below the tone frequency. A small but signicant effect of adding the high-frequency noise was observed, with a slightly larger effect when the noise was added to the random-level condition. Despite a similar magnitude of effect at all tone frequencies, Moore and Glasberg (1989) concluded that their results were consistent with the “duplex theory” for frequency encoding (Wever, 1949), that is, that rate-place information is used at high frequencies and temporal information is used at low frequencies. The results in this study question this conclusion by quantifying that there is insufcient rate-place information in the AN model to account for human performance with random-level variation under the assumption that the high-frequency noise masks all CFs above the tone frequency. In addition to the specic results on the auditory task of random-level frequency discrimination, the current study is also a presentation of general methods for using SDT and computational neural models to address quantitatively questions of general interest to theoretical neuroscience. These methods are applicable to any sensory system for which there are statistical descriptions of neural responses as a function of the relevant stimulus parameters. 2 General Methods 2.1 Computational Auditory-Nerve Model. The computational AN model used in this study was the same model described in the companion study.1 This AN model was a simplied version of a previous nonlinear AN model (Carney, 1993) and was used in order to simplify the verication of the computational SDT methods. The major components of the AN model are summarized below (see the companion article for a detailed description, and Ruggero, 1992, for a review of basic AN responses). The initial model stage was a linear fourth-order gamma-tone lter bank, which was used to represent the frequency selectivity of AN bers. Model lter bandwidths were based on psychophysical estimates of human bandwidths from Glasberg and Moore (1990). Each bandpass lter was followed
1
Code for the AN model used in the study is available online at http://earlab.bu.edu/.
Auditory Performance Limits with Random Level
2321
by a memoryless, asymmetric, saturating nonlinearity, which represents the mechano-electric transduction of the inner hair cell (IHC). All AN model bers had a rate threshold of roughly 0 dB SPL, a spontaneous rate of 50 spikes per second, and a maximum sustained rate of roughly 200 spikes per second. The model dynamic range for sustained rate was roughly 20– 30 dB, and the dynamic range for onset rate was much larger. The highspontaneous-rate (HSR), low-threshold bers described by the AN model represent the majority (61%) of the total AN population (Liberman, 1978). Medium- (23%) and low-SR (16%) bers, which have higher thresholds and larger dynamic ranges, are not included in the model. The rolloff in phase locking was chosen to be consistent with all species discussed in Weiss and Rose (1988), and the cutoff frequency matched data from cat (Johnson, 1980). Neural adaptation was introduced through a simple three-stage diffusion model for the IHC-AN synapse based on data from Westerman and Smith (1988). The output of the ith AN model ber represents the instantaneous discharge rate ri ( t, f, L ) of an individual high-spontaneous-rate, lowthreshold AN ber in response to an arbitrary stimulus. The AN discharges are assumed to be produced by a population of conditionally independent, nonstationary Poisson processes with rate functions described by ri ( t, f, L ) (see the companion article). 2.2 Signal Detection Theory. The application of SDT in this study is extended from the companion study to evaluate psychophysical tasks in which a single parameter is randomly varied. Performance limits are evaluated for both rate-place and all-information models and compared to data from human listeners. Rate-place predictions are based on the assumption that the population of AN-ber discharge counts fKi g over the duration of the stimulus is the only information used by the listener. In contrast, the allinformation predictions are based on the assumption that the listener uses the population of discharge times and counts fti1 , . . . , tiKi g, where tji represents the jth discharge from the ith AN ber. The contribution of temporal information in the responses can be inferred by comparing the predictions of the rate-place and all-information models. The all-information model does not assume any specic forms of temporal processing, such as calculating synchrony coefcients or creating interval histograms, and thus provides an absolute limit on achievable performance given the total information available in the AN. 3 General Analytical Results: One-Parameter Discrimination with One Unwanted Random Parameter 3.1 Overview of Basic Result. The use of SDT with stochastic models of neural responses has never been applied to the class of psychophysical tasks that randomly vary one parameter in order to restrict the cues that are
2322
M. G. Heinz, H. S. Colburn, and L. H. Carney
available to the subject. The SDT analysis of a general one-parameter discrimination experiment with one unwanted random parameter is a straightforward extension of the CramÂer-Rao-bound analysis described in the companion article to the multiple-parameter case, as described in section 3.2. In this study, the analysis is described in terms of a random-level frequencydiscrimination task. A performance limit for the just-noticeable difference in frequency for this task can be calculated as 8 > <
D fJND
D
> :
( EL
) ²2 X± 0 d f [CFi ] ¡ i
9¡ 12 ±P ²o2 0 > = [CF ] d EL i i Lf nP ¡ o , ¢ 2 0 ; C APIL > EL i dL [CFi ] n
(3.1)
where EL denotes the expected value over the random-level range, (d 0f [CFi ]) 2 and (dL0 [CFi ]) 2 represent, for the ith ber, the normalized sensitivities to changes in frequency and level, respectively, APIL represents the a priori in0 [CF ] formation about level (e.g., the random-level range), and dLf i represents the cross-interaction between changes in level and frequency on the ith AN ber (see section 3.2 and the companion article). The normalized sensitivities and the cross-interaction terms in equation 3.1 can be evaluated in terms of the time-varying discharge rate ri ( t, f, L ) for each AN ber. Thus, equation 3.1 can be used with any AN model (e.g., analytical or computational) and is applicable to any single-parameter discrimination experiment with one randomized parameter. This analysis is also applicable to any sensory system for which the statistics of the neural responses can be described as a function of the stimulus parameters of interest. This general form of the relation between the performance limit and the relevant information quantities provides insight into the inuence of random-level variation on the ability to perform frequency discrimination. Equation 3.1 illustrates that the neural information available about changes in frequency in the random-level experiment is equal to the average (over level) of P the information available for xed-level frequency discrimination, EL f i (d 0f [CFi ]) 2 g, minus the amount of information that is lost P due to the random-level variation. The numerator of the second term, 0 [CFi ]) g2 , represents the square of the average total correlation befEL ( i dLf tween the effect of changes in frequency and changes in level on the neural observations. If changes in level inuence the observations in the same way as changes in frequency, then the information about changes in frequency is reduced in the presence of random-level variation. The denominator of the P second term, EL f i (dL0 [CFi ]) 2 g C APIL , is a normalization factor that represents the total information available about changes in level. The rst term in the denominator is the average information about level from the AN observations, while the second term is the a priori information available about level from the limited random-level range. Thus, if the frequencylevel cross term (the numerator) is comparable to the level information
Auditory Performance Limits with Random Level
2323
(the denominator), then the random-level variation term (the subtracted term in equation 3.1) could be comparable to the rst term and inuence frequency-discrimination performance. The analysis thus illustrates the relation between information for random-level frequency discrimination and the information for xed-level frequency and level discrimination. 3.2 Mathematical Analysis. In this section, the analysis leading to equation 3.1 is presented. In a random-level frequency discrimination experiment, the observations (Poisson discharge times on M AN bers, T D fti1 , . . . , tiKi I i D 1, . . . , Mg, where tji represents the jth discharge on the ith ber), are inuenced by both the nonrandom, unknown frequency f and the random level L. The form of an optimal processor in this case can be theoretically derived using a likelihood-ratio test (LRT), in which the likelihood of the observations given frequency is integrated over the uncertainty in level for both hypotheses, D f and f C D f (van Trees, 1968). However, the integrals over level (of the joint probability density of the conditionally independent AN bers) in both the numerator and the denominator of the likelihood ratio prohibit simplication of the decision variable into a form for which performance can be evaluated analytically. An alternative to the LRT for evaluating psychophysical performance limits is the Crame rRao bound from estimation theory (Siebert, 1968, 1970; see the companion article), which provides a lower bound on the variance of any unbiased estimator. We demonstrated in the companion article that for deterministic discrimination experiments, where the independence of AN bers permits simplication of the decision variable derived from the LRT, the Crame rRao bound on performance was met with equality by the LRT processor. Thus, the Crame r-Rao bound was used in the study presented here to calculate performance limits because performance evaluation of the optimal processor derived from the LRT is not easily calculated analytically for a one-parameter discrimination task in which an unwanted parameter is randomly varied. It is possible to evaluate the performance of the LRT processor using numerical simulations (e.g., Gresham & Collins, 1998; Huettel & Collins, 1999); however, in the case here for which performance based on the total population of 30,000 AN bers is desired, such simulations would be arduous given the across-ber correlations created by the random stimulus variability. The CramÂer-Rao bound for estimating a vector of random parameters (N D 2 in this case) can be used in this case (Crame r, 1951; see van Trees, 1968, pp. 84–85). In general, the information available to an observer to estimate a random parameter is the sum of the a priori information based on the known distribution of the parameter and the information available from the data. In order to treat the frequency parameter as nonrandom but unknown, the a priori information is set to zero. The Crame r-Rao bound provides a lower bound on the variance of any unbiased estimate of f , in
2324
M. G. Heinz, H. S. Colburn, and L. H. Carney
the presence of randomized level L, and is given by (µ ¶ ) ¡ ¢ 2 1 @ · EL,T ln p T |LI f @f s 2O f
±
¡
EL,T EL,T
n£
¡ ¢¤ h @ ¡ ¢io²2 |LI ln p T |LI f ln T p f @f n£ n£ ¡ ¢¤2 o ¤2 o , @ @ ( ) |LI C ln T ln p f E p L L @L @L @ @L
(3.2)
where EL,T indicates the expectation over both the random-level L and the random observations T , and p ( L ) is the probability density used to specify the random-level distribution and determines the a priori information for level. Equation 3.2 can be written in a more useful form using iterated expectations: ³ (µ ¶ )´ ¡ ¢ 2 1 @ · EL ET ln p T |LI f L @f s2 fO
h
± n£ o²i ¡ ¢¤ h @ ¡ ¢i 2 @ ln p T |LI f ln p T |LI f EL ET L @L @f ± n£ o² n£ ¡ ¡ ¢¤2 ¤2 o . @ ln p T |LI f EL E T L C EL @@L ln p (L) @L
(3.3)
The probability density of the observed Poisson discharge times on all bers is the product of the densities for individual bers, assuming each ber is conditionally independent given L (see Parzen, 1962; Snyder & Miller, 1991; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Conditional independence results from the assumption that each AN ber has an independent discharge-generating mechanism and that correlation between AN bers results only from a common stimulus drive (see the companion article). The conditional expectations with respect to T are of a form that has been previously evaluated for Poisson observations (Siebert, 1970; see the companion article). A lower bound on the just noticeable difference (JND) D fJND is equal to the minimum standard deviation of any estimator based on the observations, where threshold is dened as 75% correct in a two-interval, twoalternative forced-choice task (Green & Swets, 1966; Siebert, 1968, 1970). This threshold denition corresponds to d0 D 1, where d0 D D fJND / s fO. Thus, a performance limit on the JND based on the population of AN bers is given by ( µ ¶2 ) XZ T 1 1 @ri ( t, f, L ) D EL dt ( D fJND ) 2 @f 0 ri ( t, f, L ) i ± nP R h i o²2 T @ri ( t, f,L ) @ri ( t, f,L ) 1 EL dt i 0 ri ( t, f,L ) @L @f » ¼ »h ¡ (3.4) h i i2 ¼ , 2 P RT 1 @ri ( t, f,L ) @ ln pL ( L ) C EL dt E L i 0 ri ( t, f,L ) @L @L
Auditory Performance Limits with Random Level
2325
where the conditional expectations in equation 3.3 were evaluated by using the Poisson probability densities, as described in the companion article. Equation 3.4 describes a performance limit for discriminating frequency in the presence of random-level variation based on the AN population response in terms of the time-varying discharge rates ri ( t, f, L ). Following the notation used in the companion article, D
Z
(d 0f [CFi ]) 2 D
T 0
1 ri ( t, f, L )
µ
@ri ( t, f, L )
¶2
@f
dt,
(3.5)
dt.
(3.6)
and D
Z
(dL0 [CFi ]) 2 D
T 0
1 ri ( t, f, L )
µ
@ri ( t, f, L ) @L
¶2
The quantities (d 0f [CFi ]) 2 and (dL0 [CFi ]) 2 represent the information available on the ith ber about frequency f and level L, respectively, and are shown in the companion article to represent normalized sensitivities, dened as the sensitivity d0 per unit f or L (also see Durlach & Braida, 1969; Braida & Durlach, 1988). Similarly, the cross-interaction term is dened as 0
D
dLf [CFi ] D
Z
T 0
1 ri ( t, f, L )
µ
@ri ( t, f, L ) @ri ( t, f, L ) @L
@f
¶ dt,
(3.7)
and represents the correlation between changes in level and changes in frequency on the ith ber. Based on this notation, equation 3.4 is equivalent to equation 3.1, where APIL represents the a priori information available about level (e.g., from the range of levels used in the random variation of level).2 4 Computational Methods: Use of Auditory Nerve Models
All predictions in this study were made using the computational AN model in the identical manner used in the companion article. Briey, predictions are based on the total population of high-spontaneous-rate (HSR) AN bers, which are simulated using 60 model CFs ranging from 100 Hz to 10 kHz. The model CFs are uniformly spaced in location according to a human cochlear map (Greenwood, 1990). In order to account for the total number 2 A gaussian distribution was used to calculate APIL due to the analytical difculty for a uniform distribution that results from the undened derivative with respect to L for levels at the edges of a uniform distribution. The a priori information for level was calculated to be APIL D 2p / R2 for a gaussian distribution with a variance of R2 / 2p , where R represents the random-level range in dB. A gaussian distribution with variance R2 / 2p has the same equivalent-rectangular width as a uniform distribution with random-level range R.
2326
M. G. Heinz, H. S. Colburn, and L. H. Carney
of AN bers in the HSR population, predictions are based on a population of 12,200 total HSR bers, where each of the 60 model responses represents roughly 200 conditionally independent AN bers. Tone frequencies were always chosen to be equal to one of the 60 model CFs. Stimulus duration was dened as the duration between half-amplitude points on the stimulus envelope. All rise/fall ramps were generated from raised cosine functions. The temporal window in the all-information analysis included the model response beginning at stimulus onset and ending 25 ms after stimulus offset, in order to allow for the response delay and the transient onset and offset responses associated with AN bers over the range of CFs and stimulus parameters used in this study. Predictions for the rate-place encoding scheme were based on the average discharge rate across the entire temporal analysis window (i.e., including the extra 25 ms after the nominal offset of the stimulus). 5 Computational Results 5.1 Random-Level Frequency Discrimination in Quiet. Predictions of performance limits for the random-level frequency discrimination task were calculated using equation 3.1 for the same values of frequency, level, and duration we used in the companion article. A random-level range was uniformly distributed and centered around the nominal level for each condition. For all conditions, there was no effect of random-level variation for the rate-place or all-information schemes for either a 6 dB or a 20 dB random-level range (neither shown).3 This result is consistent with human performance measured by Moore and Glasberg (1989), in which no effect of random-level variation was observed with a 6 dB range. Conversely, Emmerich et al. (1989) observed a factor of three degradation in human performance when measured with a 20 dB random-level range; however, they showed that much of this effect was due to the confounding role of leveldependent shifts in pitch (Verschuure & van Meeteren, 1975), which are not likely to be produced by the simplied AN model used in this study. The results for both rate-place and all-information models thus support the idea that there is no effect of random-level variation on frequency discrimination when a small enough random-level range is used to avoid the inuence of level-dependent-pitch effects (Moore & Glasberg, 1989). In order to illustrate how each encoding scheme can discriminate frequency accurately in the presence of random-level variation, the distribu3 The effect of random-level variation was evaluated by comparing random-level performance (given by equation 3.1) to average (across P level) xed-level performance (determined by the rst term in equation 3.1, EL f i (d0f [CFi ])2 g). This comparison avoids any potential effect of variation in performance across the levels within the random-level range, which could result in a difference between xed-level performance and average xed-level performance that was not truly an effect of random-level variation.
Auditory Performance Limits with Random Level
2327
Figure 1: Information responsible for performance limits for a random-level frequency discrimination task for f D 970 Hz (indicated by arrows), L D 40 dB SPL, T D 200 ms (20 ms rise/fall), and a 6 dB random-level range. The left and right columns illustrate information from the all-information and rateplace encoding schemes, respectively. The top panel represents the average level information EL f(dL0 [CF]) 2 g available for each AN model ber. The middle panel shows the average cross-interaction between level and frequency for each AN 0 [CF]) . In the bottom panel, the average information available for ber, EL (d Lf xed-level frequency discrimination, EL f(d 0f [CF]) 2 g, is shown by the solid line. The dashed line illustrates the information for estimating frequency based on individual AN bers that is lost due to the random-level variation (the second term in equation 3.1). The amount of lost information (a positive quantity) is plotted so that CFs that have negative correlation between changes in level and changes in frequency have a negative value (for illustrative purposes only).
tion of information quantities in equation 3.1 across the population of AN bers is shown in Figure 1. The rst row shows the average information about level available for each CF, EL f(dL0 [CF]) 2 g. The second row shows the
2328
M. G. Heinz, H. S. Colburn, and L. H. Carney
0 [CF]) . The average average cross-interaction term as a function of CF, EL (dLf
xed-level information about frequency for each AN ber, EL f(d 0f [CF]) 2 g, is shown by the solid line in the bottom row. The dashed line in the bottom row illustrates the average frequency information that is lost due to the random-level variation based on estimating frequency from single AN bers (i.e., the second term in equation 3.1 evaluated for single CFs, and with APIL D 0). The amount of lost information is plotted so that CFs that have a negative correlation between changes in level and frequency have a negative value. This is solely for illustrative purposes, as the second term in equation 3.1 is always positive because it is the ratio of a squared value and a positive information quantity. The curves in the bottom panel of Figure 1 illustrate how both encoding schemes overcome the inuence of random-level variation. In the allinformation scheme, each ber possesses signicantly more information for estimating frequency than is lost due to the random-level variation (compare the solid and dashed lines in Figure 1, bottom left). However, the situation with single-ber rate-place information is very different. The average amount of rate-place information on a single ber that is lost due to randomlevel variation is equal to the average amount of information available for estimating frequency with xed level (compare the solid and dashed lines in Figure 1, bottom right). However, when the population response is considered, there is no effect of random-level variation for the rate-place scheme due to the opposite polarity of the cross-interaction term above and below the frequency of the tone (see the middle panel of Figure 1). The lack of an effect for the rate-place population response can be seen (see equation 3.1) 0 [CF] to result from the summation of the cross-interaction term dLf over all CFs prior to squaring the total interaction. The rate-place cross-interaction prole is an odd function around the frequency of the tone (see Figure 1, middle panel), and thus the positive interaction above the frequency of the tone cancels the negative interaction below the frequency of the tone so that the overall effect of random-level variation is negligible. The shape of the rate-place cross-interaction prole results directly from the frequency tuning associated with AN bers and quanties the signicance of the fundamental relation between frequency and level discrimination discussed by Siebert (1968). In summary, single AN bers in the all-information scheme can perform random-level frequency discrimination equally as well as xed-level frequency discrimination. In contrast, it is not possible to discriminate frequency in the presence of random-level variation based on rate-place information from a single AN ber. However, a rate-place model that compared information in CFs above and below the frequency of the tone could make use of the opposite interaction to separate the effect of changes in level from the effect of changes in frequency, and thereby discriminate frequency in the presence of random-level variation.
Auditory Performance Limits with Random Level
2329
5.2 Random-Level Frequency Discrimination in Noise. Moore and Glasberg (1989) measured human frequency discrimination performance in four conditions in order to test the duplex theory of frequency coding, that is, that rate-place information is used at high frequencies while temporal information is used at low frequencies. They reported performance as a function of frequency for four conditions:
1. Fixed-level frequency discrimination in quiet 2. Random-level frequency discrimination in quiet 3. Fixed-level frequency discrimination with a high-frequency noise masker that spanned from 1.1 f to 1.4 f 4. Random-level frequency discrimination in the presence of the highfrequency noise Moore and Glasberg suggested that if the high-frequency noise were assumed to mask completely all CFs above the frequency of the tone, then the performance of a rate-place model should be signicantly affected by randomizing the level of the tone in the presence of the noise. The random-level frequency discrimination analysis in this study was used to simulate all four conditions from Moore and Glasberg (1989). The rst two conditions have been described above, while the conditions that included the high-frequency noise were simulated by considering the information from CFs only below the frequency of the tone. While this simulated effect of the noise masker is extreme (e.g., given the effects of suppression; Sachs & Kiang, 1968; Delgutte, 1990; Ruggero, Robles, & Rich, 1992), this simulation directly mimics the assumption that was made by Moore and Glasberg (1989) and has been often used to interpret psychophysical experiments with noise maskers (e.g., Viemeister, 1983). Thus, performance limits predicted with this simulation are based on information in model CFs below and equal to the tone frequency, which are the only source of information under the assumption used by Moore and Glasberg (1989) to interpret their psychophysical experiment. The group mean results from Moore and Glasberg (1989) are shown in Figure 2a, the all-information predictions in Figure 2b, and the rate-place predictions in Figure 2c. Psychophysical performance is plotted in terms D of the normalized JND in frequency f f , where higher JNDs correspond to worse performance. (Comparisons between human performance and predicted performance limits are made in terms of both absolute values and trends, as discussed in section 2.2 in the companion article.) If the performance limits are uniformly better than human performance (i.e., parallel to human performance across the entire range of stimulus parameters), then it can be hypothesized that AN information is used in a (uniformly) inefcient manner. On the other hand, if the performance limits exceed human performance in a nonuniform way, and there is no realistic processor
2330
M. G. Heinz, H. S. Colburn, and L. H. Carney
Figure 2: Comparison between human performance and predicted performance limits for the random-level frequency-discrimination conditions reported by Moore and Glasberg (1989). The Weber fraction D f / f is plotted as a function of frequency in each panel. (a) Human performance. (b) All-information predictions. (c) Rate-place predictions. Condition 1 (open circles): xed-level frequency discrimination in quiet. Condition 2 (lled squares): random-level frequency discrimination in quiet. Condition 3 (lled upward triangles): xed-level frequency discrimination in the presence of high-frequency noise. Condition 4 (lled downward triangles): random-level frequency discrimination in highfrequency noise. Human data are for L D 70 dB SPL, T=200 ms (10 ms rise/fall), 6-dB random-level range, and overall noise level of 75 dB SPL (Moore & Glasberg, 1989). Model predictions are for L D 40 dB SPL, T D 200 ms (20 ms rise/fall), and 6 dB random-level range.
that would be nonuniformly inefcient in the required manner, then a fairly strong hypothesis can be made that the information provided in the AN is sufcient but not likely to account solely for human performance. Finally, if human performance is superior to the performance limits, then this result states that there is insufcient information represented by the model of the peripheral transformations to account for human performance. In general, predicted rate-place performance limits (see Figure 2c) are closer to the absolute values of human performance (see Figure 2a), but do not match the trends in human performance as a function of frequency, particularly at high frequencies. In contrast, predicted all-information per-
Auditory Performance Limits with Random Level
2331
Table 1: Average Ratio Between Thresholds in Conditions 2–4 Relative to Condition 1 (Frequency Discrimination in Quiet). Human Condition Random level (condition 2) In noise (condition 3) Random level in noise (condition 4)
Model
Moore and Glasberg (1989)
AllInformation
RatePlace
1.15a 1.37 1.65
1.00 1.41 1.40
1.00 1.26 9.68
Notes: Conditions are dened in the text and the caption to Figure 2. a Not statistically signicant.
formance limits (see Figure 2b) match the trends in human performance versus frequency, but are signicantly better than the absolute values of human performance. The effects of the different conditions used by Moore and Glasberg (1989) on human performance were relatively small, especially compared with the factor-of-ve degradation in performance between 2 and 6.5 kHz in all four conditions (see Figure 2a). Moore and Glasberg reported that there was no statistically signicant interaction between condition and frequency, as indicated by the roughly parallel shifts of the curves for the different conditions in Figure 2a. Table 1 shows the average factor across frequency by which thresholds for each condition became worse relative to frequency discrimination in quiet (condition 1) for human and model performance. Moore and Glasberg reported that there was not a statistically signicant difference between frequency discrimination in quiet and with a random level (between conditions 1 and 2). Human thresholds became worse by an average factor of 1.37 (statistically signicant) for the in-noise condition, while the most signicant effect was seen for the random-level in noise condition in which performance became worse by an average factor of 1.65. Both all-information (see Figure 2b) and rate-place predictions (see Figure 2c) in quiet were unaffected by random-level variation, as described above. The simulated effect of adding the high-frequency noise for the xedlevel task resulted in similar increases in predicted thresholds for both the rate-place and all-information encoding schemes (see Figures 2b and 2c). The increase is consistent with the removal of roughly one half of the available information (see the solid curve in the bottom p panels of Figure 1) and the resultant increase in threshold by a factor of 2, and is similar to the size of the observed effect in the human data. The only relative difference between the rate-place and all-information predictions was for the random-level in-noise condition. All-information thresholds were worse than condition 1 by a factor of 1.40; there was no effect of imposing random-level variation in the presence of high-frequency noise (conditions 3 and 4 were identical;
2332
M. G. Heinz, H. S. Colburn, and L. H. Carney
see Figure 2b and Table 1). In contrast, the rate-place thresholds increased by a factor of 9.68 compared to frequency discrimination in quiet (see Figure 2c and Table 1). The large, predicted increase in rate-place thresholds for the random-level in-noise condition compared to the noise-alone condition is due to the inability of the rate-place model to compare CFs above and below the tone frequency (see Figure 1). This large effect is inconsistent with the small effect observed in human performance at all frequencies. 6 Discussion
This study describes an extension of SDT analysis of stochastic neural models to psychophysical tasks in which one stimulus parameter is randomly varied in order to restrict the cues available to the subject. Frequencydiscrimination performance limits based on either the population of discharge counts (rate-place) or discharge times (all-information) of high-spontaneous-rate, low-threshold AN bers were unaffected by random-level variation, consistent with human performance. The distributions of frequency and level information across the AN population demonstrated how both rate-place and all-information encoding schemes avoid the effect of random-level variation in quiet. Predictions were also made for randomlevel frequency discrimination in the presence of high-frequency noise, based on the simplied assumption that the noise masker acts to eliminate all information above the frequency of the tone. When the simple model of the effect of noise masking is used, the predictions for the random-level frequency discrimination in noise experiment (see Figure 2) are inconsistent with human data if rate-place coding of frequency is assumed. The predicted effect in rate-place performance of adding random-level variation in the presence of high-frequency noise (see Figure 2c) is much larger than the small effect observed in human performance (see Figure 2a); Moore and Glasberg (1989). In fact, there is insufcient rate-place information in the total AN model population to account for human performance in the random-level-in-noise condition. The cause of the large reduction in information in the random-level-in-noise condition is the inability of the rate-place model to compare CFs above and below the tone frequency. Medium- and low-spontaneous-rate (SR) AN bers, which have higher thresholds and broader dynamic ranges than the high-SR, lowthreshold bers included in our AN model, would not be expected to alter signicantly the fundamental relation between frequency and level discrimination that underlies the behavior of the rate-place predictions. Neither an increase in the total number of AN bers nor an alteration of the innervation density across CF would reduce the large discrepancy (factor of 9.68) between rate-place JNDs for the in-quiet and random-level-in-noise conditions. The implications of less easily quantied alterations to the AN model for the assumed effect of the masking noise are discussed below.
Auditory Performance Limits with Random Level
2333
The inconsistency between the effect of random-level variation with a noise masker for the rate-place model and human performance occurs for all frequencies. This nding, combined with the nding by Moore and Glasberg (1989) that there was no statistically signicant interaction between the effects of conditions 1 through 4 (see Figure 2) and frequency suggests that a single encoding scheme is responsible for performance at all frequencies, rather than the duplex theory often invoked to explain frequency encoding (e.g., Wever, 1949; Moore, 1973; Dye & Hafter, 1980; Wakeeld & Nelson, 1985; Moore & Glasberg, 1989). As we discussed in the companion article, rate-place performance in quiet is closer to human performance than all-information performance in terms of absolute performance level. However, the trends in rate-place performance versus frequency are inconsistent with human performance at high frequencies, converse to the duplex theory. This strong discrepancy between the trends in rate-place and human performance is shown in this study to exist for all four frequency-discrimination conditions described by Moore and Glasberg (1989). In contrast to rate-place, all-information performance limits match the trends in human performance across all frequencies and all four conditions. Notably, and contrary to general beliefs (e.g., Moore, 1973, 1989; Dye & Hafter, 1980; Wakeeld & Nelson, 1985; Javel & Mott, 1988; Pickles, 1988; Moore & Glasberg, 1989), there is signicant temporal information in the AN at high frequencies for all four conditions in this study. Also, this study shows that all-information performance limits are unaffected by random-level variation in quiet and in the presence of high-frequency noise (see Figure 2), consistent with the small effects on human performance. The computational AN model used in this study did not include several important aspects of AN responses that could affect the predictions for the masking conditions. The absence of suppression (i.e., nonlinear interactions between different CFs; Sachs & Kiang, 1968; Delgutte, 1990; Ruggero et al., 1992) in our model prohibits the accurate simulation of AN responses to complex stimuli (e.g., noise stimuli). Suppression could potentially produce effects that contradict the assumption that the high-frequency noise acts to mask completely all AN bers with CF above the tone frequency. Another potentially signicant limitation of the current AN model is the exclusion of high-threshold AN bers with low and medium spontaneous rates (Liberman, 1978). Low- and medium-spontaneous-rate AN bers (16% and 23% of the AN population, respectively) have been suggested to contribute to level encoding at higher levels (e.g., Colburn, 1981; Delgutte, 1987; Winter & Palmer, 1991), and therefore should be included in future AN models to quantify the effects of masking noise. In addition, a signicant extension of the SDT analysis is necessary in order accurately to evaluate psychophysical tasks with complex random stimuli, such as noise (Heinz, 2000). Future studies using more complex AN models and SDT analyses will evaluate the validity of common assumptions regarding the effects of masking noise on auditory information.
2334
M. G. Heinz, H. S. Colburn, and L. H. Carney
7 Conclusion
1. Signal detection theory can be used to quantify the effects of both physiological noise and stimulus variation on psychophysical performance limits in discrimination experiments with random variation in one stimulus parameter. 2. Frequency-discrimination performance limits based on the population of AN discharge counts (rate-place) or based on discharge times on individual AN bers are unaffected by random-level variation in quiet. 3. There is insufcient rate-place information in this AN model to account for human performance in a random-level frequency-discrimination task with high-frequency noise based on a common psychophysical assumption for the effects of noise maskers. 4. All-information performance limits with high-frequency noise are unaffected by random-level variation, consistent with human performance. The primary goal of this study was to demonstrate a method for relating stochastic neural responses to behavior in psychophysical tasks that include random variation of one stimulus parameter. This method was demonstrated for the auditory task of random-level frequency discrimination but is applicable to any psychophysical discrimination experiment in which one parameter is randomly varied. The analysis presented applies to any sensory system for which there are models that describe the statistical properties of neural responses to the relevant stimuli. Equation 3.1 is valid for any neural model, while some of the analysis in section 3.2 is specic for neural responses that are well described statistically by conditionally independent, nonstationary Poisson processes. Thus, the study describes a general modeling approach for quantitatively relating physiological responses to behavior in complex psychophysical tasks that are often used to test neural encoding hypotheses. Acknowledgments
We thank Bertrand Delgutte, Christopher Long, Christine Mason, Martin McKinney, Susan Moscynski, Bill Peake, Tom Person, Timothy Wilson, Xuedong Zhang, and Ling Zheng for providing valuable comments on an earlier version of this article. We greatly appreciate the constructive comments and suggestions provided by three anonymous reviewers. We thank Brian Moore for providing the human data shown in Figure 2a. This study was part of a graduate dissertation in the Speech and Hearing Sciences Program of the Harvard-MIT Division of Health Sciences and Technology (Heinz, 2000). Portions of this work were presented at the 22nd Midwinter Meeting of the
Auditory Performance Limits with Random Level
2335
Association for Research in Otolaryngology. The simulations in this study were performed on computers provided by the Scientic Computing and Visualization group at Boston University. This work was supported in part by the National Institutes of Health, Grants T32DC00038 and R01DC00100, as well as by the Ofce of Naval Research, Grant Agmt Z883402. References Braida, L. D., & Durlach, N. I. (1988). Peripheral and central factors in intensity perception. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Auditory function: Neurobiological bases of hearing (pp. 559–583). New York: Wiley. Carney, L. H. (1993). A model for the responses of low-frequency auditory-nerve bers in cat. J. Acoust. Soc. Am., 93, 401–417. Colburn, H. S. (1969). Some physiological limitations on binaural performance. Unpublished doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA. Colburn, H. S. (1973). Theory of binaural interaction based on auditory-nerve data. I. General strategy and preliminary results on interaural discrimination. J. Acoust. Soc. Am., 54, 1458–1470. Colburn, H. S. (1977a). Theory of binaural interaction based on auditory-nerve data. II. Detection of tones in noise. J. Acoust. Soc. Am., 61, 525–533. Colburn, H. S. (1977b). Theory of binaural interaction based on auditory-nerve data. II. Detection of tones in noise. Supplementary material (AIP Document No. PAPS JASMA-61-525-98). Available online at: http://www.aip.org/epaps/ howorder.html. Colburn, H. S. (1981). Intensity perception: Relation of intensity discrimination to auditory-nerve ring patterns (Internal Memorandum). Cambridge, MA: Research Laboratory of Electronics, Massachusetts Institute of Technology. CramÂer, H. (1951). Mathematical methods of statistics. Princeton, New Jersey: Princeton University Press. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997a).Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am., 102, 2892–2905. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997b). Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. J. Acoust. Soc. Am., 102, 2906–2919. Dau, T., Puschel, ¨ D., & Kohlrausch, A. (1996a). A quantitative model of the “effective” signal processing in the auditory system I. Model structure. J. Acoust. Soc. Am., 99, 3615–3622. Dau, T., Puschel, ¨ D., & Kohlrausch, A. (1996b). A quantitative model of the “effective” signal processing in the auditory system II. Simulations and measurements. J. Acoust. Soc. Am., 99, 3623–3631. Delgutte, B. (1987). Peripheral auditory processing of speech information: Implications from a physiological study of intensity discrimination. In M. E. H. Schouten (Ed.), The psychophysics of speech perception (pp. 333–353). Dordrecht: Nijhoff.
2336
M. G. Heinz, H. S. Colburn, and L. H. Carney
Delgutte, B. (1990). Two-tone rate suppression in auditory-nerve bers: Dependence on suppressor frequency and level. Hear. Res., 49, 225–246. Delgutte, B. (1996). Physiological models for basic auditory percepts. In H. L. Hawkins, T. A. McMullen, A. N. Popper, & R. R. Fay (Eds.), Auditory computation (pp. 157–220). New York: Springer-Verlag. Durlach, N. I., & Braida, L. D. (1969). Intensity perception I: Preliminary theory of intensity resolution. J. Acoust. Soc. Am., 46, 372–383. Durlach, N. I., Braida, L. D., & Ito, Y. (1986). Towards a model for discrimination of broadband signals. J. Acoust. Soc. Am., 80, 63–72. Dye, R. H., & Hafter, E. R. (1980). Just-noticeable differences of frequency for masked tones. J. Acoust. Soc. Am., 67, 1746–1753. Emmerich, D. S., Ellermeier, W., & Butensky, B. (1989). A reexamination of the frequency discrimination of random-amplitude tones, and a test of Henning’s modied energy-detector model. J. Acoust. Soc. Am., 85, 1653–1659. Erell, A. (1988). Rate coding model for discrimination of simple tones in the presence of noise. J. Acoust. Soc. Am., 84, 204–214. Fitzhugh, R. (1958). A statistical analyser for optic nerve messages. J. Gen. Physiol., 41, 675–692. Glasberg, B. R., & Moore, B. C. J. (1990). Derivation of auditory lter shapes from notched-noise data. Hear. Res., 47, 103–138. Goldstein, J. L., & Srulovicz, P. (1977). Auditory-nerve spike intervals as an adequate basis for aural spectrum analysis. In E. F. Evans & J. P. Wilson (Eds.), Psychophysics and physiology of hearing (pp. 337–347). New York: Academic Press. Green, D. M. (1988). Prole analysis: Auditory intensity discrimination. New York: Oxford University Press. Green, D. M., Kidd, G., Jr., & Picardi, M. C. (1983). Successive versus simultaneous comparison in auditory intensity discrimination. J. Acoust. Soc. Am., 73, 639–643. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Greenwood, D. D. (1990). A cochlear frequency-position function for several species—29 years later. J. Acoust. Soc. Am., 87, 2592–2605. Gresham, L. C., & Collins, L. M. (1998). Analysis of the performance of a modelbased optimal auditory signal processor. J. Acoust. Soc. Am., 103, 2520–2529. Heinz, M. G. (2000). Quantifying the effects of the cochlear amplier on temporal and average-rate information in the auditory nerve. Unpublished doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA. Henning, G. B. (1966). Frequency discrimination of random amplitude tones. J. Acoust. Soc. Am., 39, 336–339. Henning, G. B. (1967). A model for auditory discrimination and detection. J. Acoust. Soc. Am., 42, 1325–1334. Huettel, L. G., & Collins, L. M. (1999). Using computational auditory models to predict simultaneous masking data: Model comparison. IEEE Trans. Biomed. Eng., 46, 1432–1440. Javel, E., & Mott, J. B. (1988). Physiological and psychophysical correlates of temporal processing in hearing. Hear. Res., 34, 275–294.
Auditory Performance Limits with Random Level
2337
Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve bers to single tones. J. Acoust. Soc. Am., 68, 1115–1122. Joris, P. X., Carney, L. H., Smith, P. H., & Yin, T. C. T. (1994). Enhancement of neural synchrony in the anteroventral cochlear nucleus. I. Responses to tones at the characteristic frequency. J. Neurophysiol., 71, 1022–1036. Kidd, G., Jr., Mason, C. R., Brantley, M. A., & Owen, G. A. (1989). Roving-level tone-in-noise detection. J. Acoust. Soc. Am., 86, 1310–1317. Liberman, M. C. (1978). Auditory-nerve response from cats raised in a low-noise chamber. J. Acoust. Soc. Am., 63, 442–455. McKee, S. P., Silverman, G. H., & Nakayama, K. (1986). Precise velocity discrimination despite random variations in temporal frequency and contrast. Vision Res., 26, 609–619. Moore, B. C. J. (1973). Frequency difference limens for short-duration tones. J. Acoust. Soc. Am., 54, 610–619. Moore, B. C. J. (1989). An introduction to the psychology of hearing. New York: Academic Press. Moore, B. C. J., & Glasberg, B. R. (1989). Mechanisms underlying the frequency discrimination of pulsed tones and the detection of frequency modulation. J. Acoust. Soc. Am., 86, 1722–1732. Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Annu. Rev. Neurosci., 21, 227–277. Parzen, E. (1962). Stochastic processes. San Francisco: Holden-Day. Pickles, J. O. (1988). An introduction to the physiology of hearing. New York: Academic Press. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ruggero, M. A. (1992). Physiology and coding of sound in the auditory nerve. In A. N. Popper, & R. R. Fay (Eds.), The mammalian auditory pathway: Neurophysiology (pp. 34–93). New York: Springer-Verlag. Ruggero, M. A., Robles, L., & Rich, N. C. (1992). Two-tone suppression in the basilar membrane of the cochlea: Mechanical basis of auditory-nerve rate suppression. J. Neurophysiol., 68, 1087–1099. Sachs, M. B., & Kiang, N. Y. S. (1968). Two-tone inhibition in auditory-nerve bers. J. Acoust. Soc. Am., 43, 1120–1128. Siebert, W. M. (1965). Some implication of the stochastic behavior of primary auditory neurons. Kybernetik, 2, 206–215. Siebert, W. M. (1968). Stimulus transformation in the peripheral auditory system. In P. A. Kolers & M. Eden (Eds.), Recognizing patterns (pp. 104–133). Cambridge, MA: MIT Press. Siebert, W. M. (1970). Frequency discrimination in the auditory system: Place or periodicity mechanisms? Proc. IEEE, 58, 723–730. Snyder, D. L., & Miller, M. I. (1991). Random point processes in time and space. New York: Springer-Verlag. van Trees, H. L. (1968). Detection, estimation, and modulation theory: Part I. New York: Wiley.
2338
M. G. Heinz, H. S. Colburn, and L. H. Carney
Verschuure, J., & van Meeteren, A. A. (1975). The effect of intensity on pitch. Acustica, 32, 33–44. Viemeister, N. F. (1983). Auditory intensity discrimination at high frequencies in the presence of noise. Science, 221, 1206–1208. Wakeeld, G. H., & Nelson, D. A. (1985). Extension of a temporal model of frequency discrimination: Intensity effects in normal and hearing-impaired listeners J. Acoust. Soc. Am., 77, 613–619. Weiss, T. F., & Rose, C. (1988).A comparison of synchronization lters in different auditory receptor organs. Hear. Res., 33, 175–180. Westerman, L. A., & Smith, R. L. (1988). A diffusion model of the transient response of the cochlear inner hair cell synapse. J. Acoust. Soc. Am., 83, 2266– 2276. Wever, E. G. (1949). Theory of hearing. New York: Wiley. Winter, I. M., & Palmer, A. R. (1991). Intensity coding in low-frequency auditorynerve bers of the guinea pig. J. Acoust. Soc. Am., 90, 1958–1967. Zwicker, E. (1956). Die elementaren Grundlagen zur Bestimmung der Informationskapazit¨at des Gehors. ¨ Acustica, 6, 365–381. Zwicker, E. (1970). Masking and psychological excitation as consequences of the ear’s frequency analysis. In R. Plomp & G. F. Smoorenburg (Eds.), Frequency analysis and periodicity detection in hearing. Leiden: Sijthoff. Received January 11, 2000; accepted January 3, 2001.
LETTER
Communicated by Terrence J. Sejnowski
Adaptive Algorithm for Blind Separation from Noisy Time-Varying Mixtures V. Koivunen M. Enescu Signal Processing Laboratory, Helsinki University of Technology, FIN-02015 HUT, Finland E. Oja Laboratory of Computer and Information Science, Helsinki University of Technology, FIN-02015 HUT, Finland This article addresses the problem of blind source separation from timevarying noisy mixtures using a state variable model and recursive estimation. An estimate of each source signal is produced real time at the arrival of new observed mixture vector. The goal is to perform the separation and attenuate noise simultaneously, as well as to adapt to changes that occur in the mixing system. The observed data are projected along the eigenvectors in signal subspace. The subspace is tracked real time. Source signals are modeled using low-order AR (autoregressive) models, and noise is attenuated by trading off between the model and the information provided by measurements. The type of zero-memory nonlinearity needed in separation is determined on-line. Predictor-corrector lter structures are proposed, and their performance is investigated in simulation using biomedical and communications signals at different noise levels and a time-varying mixing system. In quantitative comparison to other widely used methods, signicant improvement in output signal-to-noise ratio is achieved. 1 Introduction
In blind source separation (BSS), a collection of observed linear combinations (mixtures) of unknown source signals is processed in order to nd the unobservable sources. BSS has important applications, in speech, array, biomedical, and communications signal processing. The word blind refers to the fact that we do not know either the source signals or the structure of the mixing system. Only some statistical properties of the sources are exploited in the separation. The separation is typically based on the assumption that the sources are statistically independent. Commonly, BSS algorithms assume that no noise is present in the data samples (Hyv¨arinen & Oja, 1998). This is not a reasonable assumption if the c 2001 Massachusetts Institute of Technology Neural Computation 13, 2339– 2357 (2001) °
2340
V. Koivunen, M. Enescu, and E. Oja
data originate from physical measurements. In this article, we address the problem of real-time separation of sources from noisy mixtures. A signal estimation approach whereby unobservable source signals are estimated from observed noisy mixture signals is employed. An adaptive predictorcorrector lter structure is developed where a ltered estimate of the source signal vector s ( k) and an estimate of the separating matrix are provided at the arrival of a new measurement vector y ( k) at time k. A few dominant eigenvectors of the spatial covariance matrix are tracked simultaneously. They are used to project measurements along the signal subspace eigenvectors in order to reduce dimension and attenuate noise. The resulting vector components are uncorrelated, a necessary condition for independence. Moreover, the component variances are normalized to unity. By employing adaptive lter structures, source signals may be estimated reliably even if the signal statistics or mixing coefcients are time varying. The mixing matrix is time varying, for example, if the signal sources or receiving sensor array are mobile or if the number of source signal varies over time. Some noise always remains even if the noise subspace is discarded. In order to attenuate the noise, we assume that the original source signals have some structure, which is a reasonable assumption if the signal is information bearing. Commonly used signal models are an autoregressive (AR) model, a narrowband model, a constant envelope, and the nite alphabet property in digital communications. Noise attenuation is achieved here by modeling the structure of the signal on-line and then trading off between the model and the new information provided by measurements. The BSS problem is dened in section 2. In section 3, an algorithm for adaptive separation from noisy mixtures is proposed. First, subspace tracking is considered. Then the BSS problem is described using basic statevariable model. A predictor-corrector lter structure related to Kalman ltering is developed for estimating the source signals. In section 4, examples of source separation using biomedical and communications signals in additive gaussian noise are presented. Both quantitative and qualitative results of noise attenuation are given. The performance is quantitatively compared to a maximum likelihood type of method (Hyv¨arinen, 1997) and to the JADE algorithm (Cardoso, 1999). The adaptivity of the proposed approach is demonstrated in simulations where the mixing matrix changes either rapidly or slowly. Changes in the number of sources are considered as well. 2 Blind Source Separation Model
There is plenty of recent work on BSS in the signal processing, communications, and neural network research communities: (Amari & Cichocki, 1998; Cardoso, 1989, 1998; Bell & Sejnowski, 1995; Comon, 1994; Jutten & HÂerault, 1991; Karhunen, Cichocki, Kasprzak, & Pajunen, 1997; Karhunen & Pajunen, 1997; Oja, 1997; Attias, 1999).
Adaptive Algorithm for Blind Separation
2341
The unobservable source signals and the observed noisy mixtures are related by y ( k) D As ( k) C v ( k) ,
(2.1)
where A is an n £ m matrix of unknown mixing coefcients, n ¸ m, s is a column vector of m source signals, y is a column vector of n mixtures, v is additive noise vector, and k is the time index. The mixing here is assumed to be instantaneous, and matrix A is assumed to be of full rank. Source signals are typically assumed to be zero mean and stationary. The separation task at hand is to estimate the original source signals with high delity from the noisy mixtures. Estimates of a separating matrix, denoted by W, or a mixing matrix, H, are needed as well. They may be considered nuisance parameters if one is interested only in the sources. Observed mixtures are spatially decorrelated, and signal powers are normalized to unity prior to separation. In addition, by projecting the input data into an m-dimensional signal subspace, the problem becomes easier to solve because n D m and the separating matrix will be orthogonal matrix, that is, W D H ¡1 D H T . An estimate x of unknown sources s is then given by sO D x D HO T y .
(2.2)
The estimate of sources can be obtained only up to a permutation of s, that is, the order and sign of the sources may change. If n > m and the signal-tonoise ratio (SNR) is sufciently high, the dimension of the signal subspace can easily be estimated from the eigenvalue spectrum of the spatial covariance matrix. At low SNR one has to resort to information-theoretic criteria such as the minimum description length (MDL) or the Akaike information criterion (AIC) (see Wax & Kailath, 1985). In many practical BSS applications, observations are noisy, the mixing system or source statistics may be time varying, and source signals may appear and disappear randomly. Moreover, the delays associated with batch processing may be intolerable. Hence, it is desirable to develop recursive BSS algorithms that take into account both noise and the time-varying nature of the problem. In this article a state-variable model and an adaptive lter structure related to the Kalman lter are employed in blind separation. The use of state-variable model in BSS was proposed in Gharbi and Salam (1997) and has been further investigated Everson and Roberts (1999), Koivunen and Oja (1999), and Zhang and Cichocki (1999). In each case, the model is constructed differently. In Zhang and Cichocki (1999) very general mixing and demixing models covering both nonlinear and linear systems are proposed. The blind deconvolution problem is divided into separation and state estimation problems. A learning algorithm is used in updating the matrices in the model, and a Kalman lter is employed in updating the states. In
2342
V. Koivunen, M. Enescu, and E. Oja
Koivunen and Oja (1999), separated sources are the states, and the timevarying mixing system appears in the measurement equations and maps measured mixtures to states. Updated estimates of the states and the unknown matrices in the state-variable model are obtained using separate innovation expressions as each new measurement becomes available. The measurements are projected into the signal subspace. A time-varying mixing system is also considered in Everson and Roberts (1999). The mixing system is assumed to obey a rst-order Markov model. The elements of the mixing matrix are the states, and the goal is to determine the probability density function (pdf) of the state given the observations. A generalized exponential model is used for modeling the sources. There are many different contrast functions that may be used in order to perform the separation (see Cardoso, 1998). In this article, higher-order statistics are used implicitly through a zero-memory nonlinearity. An appropriate nonlinearity is selected on-line depending on the source statistics (Douglas, Cichocki, & Amari, 1997). 3 Predictor-Corrector Structure
The goal of the adaptive source separation algorithm presented in this section is to estimate the unknown sources in the presence of noise. The algorithm consists of two parts: a signal subspace tracker and a recursive predictor-corrector lter structure. It is related to the well-known Kalman lter formulation (Haykin, 1996). In conventional Kalman ltering, the matrices in the state variable model are assumed to be known. In blind separation, these matrices, including the separating matrix, have to be estimated as well. An appropriate zero-memory nonlinearity needed for separation is selected on-line based on source statistics. Prediction of the next state is here based on a low-order AR model. The model is employed in noise attenuation by trading off between the acquired system model and the information provided by the measurement. This approach allows for instantaneous estimation of the source signals and the separating matrix at time k. If some extra delay can be tolerated, xed lag smoothing may be used instead of ltering in order to improve performance further. 3.1 Signal Subspace Tracking. Commonly blind separation is preceded by a decorrelating transform. It is obtained by nding the eigenvectors U and eigenvalues L of the covariance matrix of y ( k) and applying the transformation L ¡1 / 2 U T . If we have m sources, the n-dimensional data (n > m) may be projected to subspace spanned by the eigenvectors corresponding to m largest eigenvalues. As a result, some noise is attenuated, and no matrix inversion is needed in nding the separating matrix because it will be orthogonal (H ¡1 D HT ). The m dominant eigenvectors and eigenvalues are tracked adaptively (Karasalo, 1986). One SVD is required in the process. The dimension of the
Adaptive Algorithm for Blind Separation
2343
matrix to be decomposed depends on the number of signals, however. An upper bound for the number of sources is assumed, and n > m. Substantial computational savings are obtained in particular if the dimension of signal subspace is small. The adaptation of the original algorithm (Karasalo, 1986) is such that it initially relies heavily on the observed data and in later stages more on the computed eigen-decomposition and less on new data. The method per se tracks slowly varying mixing systems reliably, but it takes quite a long time to adapt to a drastic change or changes in the rank (number of sources). In order to recover the track quickly after an abrupt change, we monitor whether drastic changes in the mixing matrix have taken place; if they have, we force the tracking system to rely more on the recent measurements. Drastic changes may be detected, for example, by simultaneously computing the sample covariance matrix CO in a relatively small processing window and comparing it to the covariance matrix constructed by the subspace tracker, CO tr D U[L tr ¡ diag( str2 ) ]UT C str2 I, where L tr contains the eigenvalues and U are the eigenvectors estimated by the subspace tracker. The expression [L tr ¡ diag (str2 )] above estimates the noise-free signal covariance matrix eigenvalues, RO D str2 I estimates measurement noise covariance matrix, and diag (str2 ) forms a diagonal matrix with str2 ’s on diagonal. The estimate of the noise covariance matrix is also needed in the state-variable model. The sample covariance matrix at time k in a window of N samples is computed recursively by 1 1 CO k D CO k¡1 C y ky Tk ¡ y k¡N y Tk¡N . N N O of correlation coefcients is formed from the covariance matrices A matrix R O O Ck and Ctr . The following dimensionless expression, O tr ¡ R O k kF kR , O tr kF kR
(3.1)
is compared to a threshold value. Since the expression is dimensionless, the selection of threshold value describing the relative change in covariance structure is simple. If it is exceeded, the weighting used in the subspace tracking is changed back to initial weights so that recent measurements are trusted more. 3.2 State-Variable Model. The BSS problem is here formulated as a state-variable model. The basic model is given as follows: x ( k) D F ( k | k ¡ 1) x ( k ¡ 1) C G ( k) w ( k ¡ 1)
(3.2)
y ( k) D H ( k) x ( k) C v ( k) ,
(3.3)
2344
V. Koivunen, M. Enescu, and E. Oja
where w and v are mutually uncorrelated, white noise sequences with covariance matrices Q ( k) and R( k), y are the measurements, and x are the states to be estimated. The state noise weighting matrix G ( k) is assumed to be a diagonal (identity) matrix here because the sources are assumed to be statistically independent. The matrix sequences F ( k | k ¡ 1) and H ( k) are commonly assumed to be known, which is not the case in BSS. In BSS context, measurements y are the observed mixtures, states x are the estimated source signals, and H is the mixing matrix. The well-known Kalman lter may be presented in a predictor-corrector form as follows. The predicted state estimate xO ( k | k ¡ 1) is given by xO ( k | k ¡ 1) D F ( k | k ¡ 1) xO ( k ¡ 1 | k ¡ 1) ,
(3.4)
and the prediction error covariance matrix P ( k | k¡1) is updated as well (see Haykin, 1996). The correction equations update the predicted state estimate to a ltered state estimate based on the new information conveyed by the measurement. Estimated xO ( k | k) of the source is obtained by xO ( k | k) D xO ( k | k ¡ 1) C K ( k) [y ( k) ¡ H xO ( k | k ¡ 1) ],
(3.5)
where K ( k) is the Kalman gain, which species the amount by which the innovation yQ ( k | k ¡1) D y ( k) ¡ H ( k) xO ( k | k ¡1) must be multiplied to get the correction of the old state estimate. The estimation error covariance matrix P ( k | k) and gain are updated also. 3.2.1 Extensions for Solving BSS Problem. In BSS problem, the standard Kalman lter cannot be applied directly, because the matrices H and F are not known. Matrix H describes the mixing system, whereas matrix F models how sources evolve over time. The matrix F is needed in noise attenuation because the Kalman lter determines whether the system model or the measurement is to be trusted more and, consequently, which is to be emphasized. An estimate of noise covariance matrix R needed in the Kalman gain K is provided by the subspace tracker. The Kalman lter equations need to be modied in order to estimate H and F simultaneously. Optimality in the mean square sense cannot be claimed because states (sources) are nongaussian, and only estimates of the matrices in the state variable model are available. These uncertainties are partly modeled by the state noise, however. In addition, the performance cannot be evaluated in advance because off-line computation of the error covariance matrices and the gains cannot be performed prior to making any measurements. In estimating the mixing matrix H, separate expressions for innovation and gain for correcting the elements of H are required. The “innovation” in estimating H is yQ H ( k) D y ( k) ¡ H ( k ¡ 1) xO H ( k) ,
Adaptive Algorithm for Blind Separation
2345
where xO H ( k) D g ( u ( k) ) , u ( k) D HT y ( k) , g ( u ( k) ) D [g1 ( u1 ( k) ) ¢ ¢ ¢ gm ( um ( k) ) ]T , and gi ( ¢) are nonlinear contrast functions. The gain KH used in estimating the mixing matrix is KH D
POx H xO TH PxO H C 1
.
Finally, the update for H is given by: T . H ( k) D H ( k ¡ 1) C yQ H ( k) KH
The adaptation rate for H should be slow compared to the adaptation rate for the actual state variables to have better stability. The unknown state transition matrix F describes how the state sequence evolves over time. In the correction stage, the algorithm produces a ltered state estimate, which is a trade-off between the measurement and the predictions produced by the state variable model. In order to make the prediction more accurate, F is augmented to contain a low-order AR model instead of the rst-order model in regular Kalman ltering. The order of the AR model may be estimated, for example, by AIC (see Hayes, 1996). A severe model order mismatch leads to similar performance to methods that do not take noise into account. In case of the AR model of order p, the prediction for the ith component of the state vector is performed using vector fi of AR coefcients given by £ ¤ fi D ai,1 ai,2 ¢ ¢ ¢ ai,p¡1 ai,p , and the ith component of the state variable evolves as xi ( k) D f i x i ( k ¡ 1) C w i ,
where xi ( k) D [xi ( k) xi ( k ¡ 1) ¢ ¢ ¢ xi ( k ¡ p C 1) ]T
is a vector of p past states and w i D [wi 0 ¢ ¢ ¢ 0]T has p elements. In x i ( k), we are interested in only the predicted state xi ( k) at time k. Predictions of the past states may be discarded. This models how the states evolve over time and allows for noise attenuation. The changes in state prediction cause some changes in the update for the prediction error covariance matrix. The expression for Pi ( k | k ¡ 1) of the ith component of the state variable becomes 2 Pi ( k | k ¡ 1) D fi Pi ( k ¡ 1 | k ¡ 1) f Ti C sw,i , 2 is state noise where Pi are scalars, p is the order of the AR model, and sw,i variance from the ( i, i) element of Q. Each Pi ( k | k ¡ 1) is inserted into the ( i, i) element of the matrix P ( k | k ¡ 1) in the Kalman lter equations. Then
2346
V. Koivunen, M. Enescu, and E. Oja
the update for the estimation error covariance P ( k | k) is identical to that of the conventional Kalman lter. Various techniques can be used for modifying adaptively the AR coefcients in the fi vectors, for example, Kalman, recursive least squares (RLS) or least mean square. In order to model potentially nonstationary sources effectively, the sliding window RLS algorithm is employed (see Hayes, 1996). Unlike growing memory and exponentially weighted RLS, it has a nite memory and completely discards past data points after a while. 3.3 Selecting Appropriate Nonlinearity. If prior information on the sign of the kurtosis of the sources is available, a xed nonlinear contrast function g ( ¢) may be selected in advance. Since no such information is necessarily available, the statistics of each output of the separation system are recursively tracked, and an appropriate nonlinearity for each channel is selected from two choices. The selection principle stems from the method presented in Amari, Chen, and Cichocki (1997). In order to choose the nonlinearity for each channel, we recursively estimate the following statistics at each time step k:
si2 ( k) D ( 1 ¡ m ) si2 ( k ¡ 1) C m |xi ( k) | 2
ki,r ( k) D ( 1 ¡ m )ki,r ( k ¡ 1)
C m g0r ( xi ( k) )
r i,r ( k) D ( 1 ¡ m )ri,r ( k ¡ 1) C m xi ( k) gr ( xi ( k) ) ,
(3.6)
where 1 · i · m, r D f1, 2g refers to the type of nonlinearity and m is a positive constant such that 0 < m ¿ 1. Let us denote K 1 D si2 ( k)k i1 ( k) ¡ri1 ( k) and K 2 D si2 ( k)k i2 ( k) ¡ ri2 ( k) . The nonlinear function for the ith source at time k is selected as follows: ( g1 ( x ) , if K 1 ¡ K 2 < 0 (3.7) gik D g2 ( x ) , otherwise. The functions g1 ( ¢) and g2 (¢) are the appropriate nonlinearities sub- and supergaussian sources (see Cardoso, 1998; Karhunen, Oja, Wang, Vigario, & Joutsensalo, 1997). In this article, gi,1 ( u ) D tanh( a1 u ) and gi,2 ( u ) D a2 u3 provide adequate separation capabilities, where a1 and a2 are constants. Thus, at each time k, gi component from g( u ( k) ) is either as g1 (¢) or g2 (¢) based on criterion 3.7. 4 Examples
In this section, performance of the proposed predictor-corrector structure for BSS is investigated in simulation using biomedical and communications source signals. In all the examples, the type of nonlinearity needed in separation is selected automatically based on the data.
Adaptive Algorithm for Blind Separation
2347
In the rst example, 2000 samples of three source signals and four mixtures are used. The initial sources are two positive kurtotic electrocardiogram (ECG) signals representing maternal and fetal heartbeats at frequencies slightly above 1 Hz and below 3 Hz, respectively, and an interfering sinusoid of 50 Hz (negative kurtotic). In practice, this problem may be encountered if the electrocardiograph is disturbed by some unwanted interference due to grounding problems or some muscle other than that of the heart contraction during the measurements. The mixing coefcient matrix A is randomly generated, and the observed mixtures are contaminated with additive gaussian noise (input SNR D 20 B). The input SNR is computed P by considering (Attias, 1999) that the signal level in senP the fact 2 sor i is Ef( j Aij sj ) 2 g D j ( Aij ) , where Ef¢g denotes expectation operator, EfssT g D I, and A is the initial mixing matrix. The corresponding noise level is Efv2i g D Rii . Averaging over the sensors, we obtain 2 3 n m X 1X 2 4 ( Aij ) 5 / Rii . SNRin D n iD 1 jD 1
(4.1)
The constants used in the contrast functions are a1 D 1 and a2 D 1 / 3. In order to select the appropriate nonlinearity, the criterion given in equation 3.7 is used with m D 0.01. An example of separation from noisy mixtures is given in Figure 1. The measurements are projected in the direction of the signal subspace eigenvectors found through adaptive subspace tracking. The performance of the tracking method is studied by nding the maximum canonical angles between the estimated signal subspace eigenvectors and the true ones. If the maximum canonical angle is small, the subspaces are close to each other. We consider uO and u to be the eigenvectors spanning the estimated and true signal subspaces, and we compute the singular value decomposition of uO T u . Let c 1 > c 2 > ¢ ¢ ¢ > c m be the singular values of uO T u. The canonical angles between the basis vectors are ( uO , u ) D cos¡1 c i . The maximum canonical angles for the mixtures used in Figure 1 are shown in Figure 2. The results indicate that the subspace tracker needs fewer than 100 observations to converge close to the true signal subspace. The convergence of the predictor-corrector structure is studied using the prediction and estimation error covariance matrices. Typical plots of the traces of error covariance matrices are given in Figure 3. The traces of the covariance matrices appear to settle rapidly to steady-state values. The noise attenuation capability is investigated quantitatively using the mean squared error (MSE) criterion. Computation of the MSE between the recovered and original sources over a large number of realizations is somewhat tedious because the order or sign ambiguities have to be resolved and the powers of the sources normalized. Then one has to nd matching pairs of estimated and original sources and compute the MSE between matching
2348
V. Koivunen, M. Enescu, and E. Oja
Figure 1: An example of blind separation from noisy mixtures. (a) Noise-free source signals as a function time. (b) Noisy mixtures with zero-mean additive gaussian noise (SNRin D 20 dB). (c) Separation result obtained by the proposed adaptive predictor-corrector structure.
maximum canonical angle in radians
Adaptive Algorithm for Blind Separation
2349
1.5
1
0.5
0 0
500 1000 1500 number of samples
2000
Figure 2: Maximum canonical angle between estimated and true signal subspace basis vectors as a function of time index. The estimated basis converges to the true one rapidly.
3
1.5
0 0
500 1000 1500 number of samples
2000
Figure 3: Typical trace of prediction error covariance matrix (upper) and estimation error covariance matrix (lower).
pairs yielding the minimum error. The output SNR between the true and estimated source signals si and y i is given in terms of MSE ( s i , y i ) as follows: SNRout ( s i , y i ) D ¡10log10MSE ( s i , y i ).
(4.2)
In order to characterize the convergence of the estimates as a function of sample size N, the SNR for different N are given in Figure 4. Mixtures
2350
V. Koivunen, M. Enescu, and E. Oja
output SNR (dB)
38 36 34 32 30 28 26 24 200
600 1000 1400 1800 number of samples
Figure 4: Improvement in output SNR (dB) as a function of sample size. The results are averaged over 10 realizations and over all sources.
output SNR (dB)
38
40
P C JADE fICA
38
P C JADE fICA
output SNR (dB)
41
36
35
34
32
32
29
30
(a) 26 5
0
5
input SNR (dB)
10
15
28 5
(b) 0
5
input SNR (dB)
10
15
Figure 5: Output SNR (dB) as a function of input SNR (dB). The results are averaged over 20 independent trials and over all sources. The sources are (left) ECG signals and tone interference and (right) AR models of order 4.
of unit amplitude AR(2) signals are observed. Similarly, the AR(2) model is employed in state prediction matrix F. AR models are useful in wide variety of applications but an appropriate model class obviously depends on the problem at hand. The noise attenuation performance in terms of output SNR is studied at different input SNRs in Figure 5. The performance of the proposed predictor-corrector (P-C) structure is quantitatively compared to the Fast– independent component analysis algorithm (Hyv¨arinen, 1997), where the score function can be obtained using the maximum likelihood principle, and with JADE (Cardoso, 1999). In the simulation presented in Figure 5 (left) the
Adaptive Algorithm for Blind Separation
2351
Figure 6: Separated sources in the case of drastic change in the mixing system. Track of the sources is recovered rapidly.
initial sources are two ECG signals (supergaussian) and one sinusoidal tone interferer, and in Figure 5 (right) AR signals of order 4 are used. The output SNRs are computed using only the last 1500 samples (out of 2000) to illustrate the steady-state performance. Improvement of 2 to 3 dB in output SNR is obtained consistently in comparison to JADE and Fast-ICA. The AR model order in the F matrix is p D 3 in the rst case and p D 4 in the second case. The order of the AR model could be estimated using AIC, for example. Our experiments indicate that the method is not sensitive to small errors in model order and that if there is a severe model order mismatch, the performance of the proposed method degrades to that of noise-free ICA methods. 4.1 Time-Varying Mixing Matrix. Tracking of the time-varying mixing system is considered next. Both drastic change and a slowly time-varying mixing system are used in the simulations. In the rst example, sample size is N D 2000, and an abrupt change of the mixing matrix takes place at k D 1000 sample. This is achieved by simply generating a new random mixing matrix. The sources are the same as in Figure 1a. The separated source signals are depicted in Figure 6. The subspace tracker not employing expression 3.1 reacts slowly to drastic change in the mixing system, as illustrated in Figure 7a. The value of the expression 3.1 reveals the fast change, as can be seen from Figures 6 and 7b. Tracking is recovered much faster because the weighting in the algorithm is changed so that most recent observations are more trusted. In this example, fewer than 100 observations were needed to restore the track. The length of the sliding window used for calculating the sample covariance matrix CO is 100 observations. The dimensionless threshold value of expression 3.1 describing the relative change in the cor-
2352
V. Koivunen, M. Enescu, and E. Oja
1.5
1.5
(a)
(b)
1
1
0.5
0.5
0 0
500
1000 1500 1.5
2000
0 0
500
1000
1500
2000
(c) 1
0.5
0 0
500
1000
1500
2000
Figure 7: (a) The value of equation 3.1 at different time indices when no “action” is taken. (b) The value of equation 3.1 when the weighting in the subspace tracker is reinitialized so that recent observations are trusted more. (c) Maximum canonical angle between estimated signal subspace basis vectors and theoretical ones.
relation matrix is set to 0.2 in all experiments. In case of drastic change, it is exceeded signicantly, and if the mixing system changes slowly, the measured value remains well below the threshold. In blind separation, it is difcult to employ a theoretically sound statistical test strategy in selecting the threshold because the forms of the pdf’s are unknown. Moreover, most such criteria would require additional computation for eigenvector decomposition. In order to demonstrate the rapid convergence of the subspace tracker employing expression 3.1, the maximum canonical angle between the estimated and theoretical basis vectors is shown in Figure 7. The case of slowly time-varying mixing matrix is also considered. Rotation matrix T (h ( k) ) is applied to the mixing matrix A at time k: 2 3 cos(h ( k) ) sin(h ( k) ) 0 0 6 ¡ sin(h ( k) ) cos(h ( k) ) 0 07 7. T (h ( k) ) D 6 4 0 0 1 05 0 0 0 1 In this simulation the same ECG and sinusoidal (m D 3) source signals of 2000 samples and n D 4 mixtures are used. They are contaminated with zero-
Adaptive Algorithm for Blind Separation
2353
Figure 8: An example of blind separation using a slowly time-varying mixing system after time index k D 1000. (a) Separated source signals as a function of time using adaptive predictor-corrector structure. (b) Separation result obtained by using Fast-ICA.
mean white gaussian noise (SNRin D 20 dB). A xed, randomly generated mixing matrix A was used for the rst 1000 samples. For the next 1000 samples, the angle h ( k) in T (h ( k) ) is linearly changed from the initial value h ( 1000) D 0 to the nal value h (2000) D p / 3. The proposed predictorcorrector structure successfully tracks the time-varying mixing system well, as can be seen in Figure 8a. The threshold value 0.2 of equation 3.1 is not exceeded; hence, subspace tracking is working in its normal mode. The
2354
V. Koivunen, M. Enescu, and E. Oja
0.1 0 -0.1 0 0.2
100
200
300
100
200
300
100
200
300
0 -0.2 0 0.2 0 -0.2 0
Figure 9: Recovering the original source with high delity: points 1430–1780 from the original signals (continuous line) and the recovered sources (dash-dot line).
results were compared to Fast-ICA algorithm. It cannot separate the signals after changes in the mixing system take place, as can be seen in Figure 8b. In order to illustrate qualitatively the recovery of the waveform of ECG signals with high delity, original noise-free sources and separated sources are depicted in Figure 9. 4.2 Varying the Number of Signals. In many application areas, including communications, it is important to estimate the number of signals. For example, if a user turns a mobile phone on or off, the source may appear or disappear from the system. We simulated this situation by turning on and off one of the sources. At high SNR, there is a clear pattern in the eigenvalues of mixture covariance matrix. For low-SNR scenarios, information-theoretic criteria such as MDL give consistent estimates for the number of sources. This requires an additional singular value decomposition computation. One low-complexity solution is to use equation 3.1 in order to detect the change in the system. Changes in the number of sources (rank) cause large differences between the two covariance matrices, and expression 3.1 exceeds the threshold value signicantly. Only then is the MDL criterion used to estimate the number of signals. In the simulations, MDL estimated the number of sources correctly and consistently even when SNRin was only slightly larger than 0 dB. The performance of the proposed method is investigated using digital communications signals. Two white 4– quadrature amplitude modulation (QAM) signals and one 16-QAM signal
Adaptive Algorithm for Blind Separation
1.5
0
1.5 -1.5 1.5
0
1.5
2355
1
1
0.5
0.5
0
0
0.5
0.5
-1 -1 1
0
1
0
1
-1 -1
0
1
0
1
0.5 0
0 0.5
1.5 -1.5 1.5
0
1.5
0
1.5 -1.5
0
1.5
-1 -1 1
1
0.5
0.5
0
0
0.5
0.5
-1 -1
0
1
-1 -1
Figure 10: Communication signal separation result illustrated as QAM constellation points when one source is leaving. In the rst row, three communications signals are successfully separated, in the second row we have the separation result when one source has disappeared, and in the third row we have separation result after a new source has appeared.
are randomly mixed and contaminated with gaussian noise (SNRin D 15 dB). No intersymbol interference (ISI) is assumed. The matrix F is here an identity matrix. The total number of samples is 7000, and one of the 4-QAM sources is turned off in the interval 2500–5000. The number of receivers is four and is not changed during the simulation. From the signal space diagram of the mixture signals, no QAM constellation can be distinguished. The constellation points are clearly visible for the separated signals. The result of the separation is presented in Figure 10. The number of samples needed to achieve separation depends, for example, on the SNR, modulation types, and number of sources. In this simulation, about 400 samples were needed.
2356
V. Koivunen, M. Enescu, and E. Oja
5 Conclusion
In this article an adaptive algorithm for real-time separation of noisy timevarying sources is proposed. A state variable model and a predictor-corrector lter structure are used. The observed data are decorrelated adaptively by using a subspace tracking algorithm. Noise attenuation is achieved by modeling how the states (sources) evolve and trading off between the model and the new information conveyed by the measurements. A signicant improvement (2–3 dB) in output SNR is achieved in examples in comparison to two widely used batch algorithms. The type of nonlinearity used in separation is selected on-line based on the statistics of estimated source signals. The adaptivity of the proposed algorithm is demonstrated in simulations where the mixing matrix changes rapidly or varies slowly in time. The simulation results indicate that the algorithm converges relatively quickly and has reliable performance in nonstationary and noisy environments. References Amari, S.-I., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10, 1345–1351. Amari, S.-I., & Cichocki, A. (1998). Adaptive blind signal processing-neural network approaches. Proceedings of the IEEE, 86(10), 2026–2048. Attias, H. (1999). Independent factor analysis. Neural Computation, 11, 803–851. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Cardoso, J.-F. (1989). Source separation using higher order moments. In Proc. of the Intl. Conf. on Acoustics, Speech, and Signal Processing (pp. 2109–2112). Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of the IEEE, 86(10), 2009–2025. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36(3), 287–314. Douglas, S. C., Cichocki, A., & Amari, S.-I. (1997). Multichannel blind separation and deconvolution of sources with arbitrary distributions. In IEEE Workshop on Neural Networks for Signal Processing (pp. 436–445). Everson, R., & Roberts, J. S. (1999). Non-stationary independent component analysis. In International Conference on Articial Neural Networks (pp. 503–508). Gharbi, A. B., & Salam, F. M. (1997). Algorithms for blind signal separation and recovery in static and dynamic environments. IEEE International Symposium on Circuits and Systems, 1, 713–716. Hayes, M. (1996). Statistical digital signal processing and modeling. New York: Wiley. Haykin, S. (1996). Adaptive lter theory (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.
Adaptive Algorithm for Blind Separation
2357
Hyva¨ rinen, A. (1997). A family of xed-point algorithms for independent component analysis. In Proc. of the Intl. Conf. on Acoustics, Speech, and Signal Processing (pp. 3917–3920). Hyva¨ rinen, A., & Oja, E. (1998). Independent component analysis by general non-linear Hebbian-like learning rules Signal Processing, 64(3), 301–313. Jutten, C., & HÂerault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Karasalo, I. (1986). Estimating the covariance matrix by signal subspace averaging. IEEE T. Acoustics, Speech, and Signal Processing, 34(1), 8–12. Karhunen, J., Cichocki, A., Kasprzak, W., & Pajunen, P. (1997). On neural blind separation with noise suppression and redundancy reduction. Int. Journal of Neural Systems, 8(2), 219–237. Karhunen, J., Oja, E., Wang, L., Vigario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE T. on Neural Networks, 8(3), 486–504. Karhunen, J., & Pajunen, P. (1997). Blind source separation using least-squares type adaptive algorithms. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 3361–3364. Koivunen, V., & Oja, E. (1999). Predictor-corrector structure for real-time blind separation from noisy mixtures. In ICA ’99 (pp. 479–484). Oja, E. (1997). The nonlinear PCA learning rule and independent component analysis—mathematical analysis. Neurocomputing, 17, 25–45. Wax, M., & Kailath, T. (1985). Detection of signals by information theoretic criteria. IEEE T. Acoustics, Speech, and Signal Processing, 33(2), 387–392. Zhang, L., & Cichocki, A. (1999). Blind separation of ltered sources using statespace approach. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 648–654). Cambridge, MA: MIT Press. Received December 15, 1999; accepted November 22, 2000.
LETTER
Communicated by Michael Jordan
Robust Full Bayesian Learning for Radial Basis Networks Christophe Andrieu ¤ Cambridge University Engineering Department, Cambridge CB2 1PZ, England Nando de Freitas Computer Science Division, University of California, Berkeley,CA 94720-1776,U.S.A. Arnaud Doucet Cambridge University Engineering Department, Cambridge CB2 1PZ, England We propose a hierarchical full Bayesian model for radial basis networks. This model treats the model dimension (number of neurons), model parameters, regularization parameters, and noise parameters as unknown random variables. We develop a reversible-jump Markov chain Monte Carlo (MCMC) method to perform the Bayesian computation. We nd that the results obtained using this method are not only better than the ones reported previously, but also appear to be robust with respect to the prior specication. In addition, we propose a novel and computationally efcient reversible-jump MCMC simulated annealing algorithm to optimize neural networks. This algorithm enables us to maximize the joint posterior distribution of the network parameters and the number of basis function. It performs a global search in the joint space of the parameters and number of parameters, thereby surmounting the problem of local minima to a large extent. We show that by calibrating the full hierarchical Bayesian prior, we can obtain the classical Akaike information criterion, Bayesian information criterion, and minimum description length model selection criteria within a penalized likelihood framework. Finally, we present a geometric convergence theorem for the algorithm with homogeneous transition kernel and a convergence theorem for the reversible-jump MCMC simulated annealing method. 1 Introduction
Buntine and Weigend (1991) and Mackay (1992) showed that a principled Bayesian learning approach to neural networks can lead to many improvements. In particular, Mackay showed that by approximating the distributions of the weights with gaussians and adopting smoothing priors, it is
¤
Authorship based on alphabetical order.
c 2001 Massachusetts Institute of Technology Neural Computation 13, 2359– 2407 (2001) °
2360
C. Andrieu, N. de Freitas, and A. Doucet
possible to obtain estimates of the weights and output variances and to set the regularization coefcients automatically. Neal (1996) cast the net much further by introducing advanced Bayesian simulation methods, specically the hybrid Monte Carlo method (Brass, Pendleton, Chen, & Robson, 1993), into the analysis of neural networks. Theoretically, he also proved that certain classes of priors for neural networks, whose number of hidden neurons tends to innity, converge to gaussian processes. Bayesian sequential Monte Carlo methods have also been shown to provide good training results, especially in time-varying scenarios (de Freitas, Niranjan, Gee, & Doucet, 2000). An essential requirement of neural network training is the correct selection of the number of neurons. There have been three main approaches to this problem: penalized likelihood, predictive assessment, and growing and pruning techniques. In the penalized likelihood context, a penalty term is added to the likelihood function so as to limit the number of neurons, thereby avoiding overtting. Classical examples of penalty terms include the well-known Akaike information criterion (AIC), Bayesian information criterion (BIC) and minimum description length (MDL) (Akaike, 1974; Schwarz, 1985; Rissanen, 1987). Penalized likelihood has also been used extensively to impose smoothing constraints by weight decay priors (Hinton, 1987; Mackay, 1992) or functional regularizers that penalize for high-frequency signal components (Girosi, Jones, & Poggio, 1995). In the predictive assessment approach, the data are split into a training set, a validation set, and possibly a test set. The key idea is to balance the bias and variance of the predictor by choosing the number of neurons so that the errors in each data set are of the same magnitude. The problem with the previous approaches, known as the model adequacy problem, is that they assume one knows which models to test. To overcome this difculty, various authors have proposed model selection methods whereby the number of neurons is set by growing and pruning algorithms. Examples of this class of algorithms include the upstart algorithm (Frean, 1990), cascade correlation (Fahlman & Lebiere, 1988), optimal brain damage (Le Cun, Denker, & Solla, 1990) and the resource allocating network (RAN) (Platt, 1991). A major shortcoming of these methods is that they lack robustness in that the results depend on several heuristically set thresholds. For argument’s sake, let us consider the case of the RAN algorithm. A new radial basis function is added to the hidden layer each time an input in a novel region of the input space is found. Unfortunately, novelty is assessed in terms of two heuristically set thresholds. The center of the gaussian basis function is then placed at the location of the novel input, while its width depends on the distance between the novel input and the stored patterns. For improved efciency, the amplitudes of the gaussians may be estimated with an extended Kalman lter (Kadirkamanathan & Niranjan, 1993). Yingwei, Sundararajan, and Saratchandran (1997) have extended the approach by proposing a simple pruning technique. Their strategy is to monitor the
Robust Full Bayesian Learning for Radial Basis
2361
outputs of the gaussian basis functions continuously and compare them to a threshold. If a particular output remains below the threshold over a number of consecutive inputs, then the corresponding basis function is removed. Recently, Rios Insua and Muller ¨ (1998), Marrs (1998), and Holmes and Mallick (1998) have addressed the issue of selecting the number of hidden neurons with growing and pruning algorithms from a Bayesian perspective. In particular, they apply the reversible-jump Markov chain Monte Carlo (MCMC) algorithm of Green (1995; Richardson & Green, 1997) to feedforward sigmoidal networks and radial basis function (RBF) networks to obtain joint estimates of the number of neurons and weights. Once again, their results indicate that it is advantageous to adopt the Bayesian framework and MCMC methods to perform model order selection. In this article, we also apply the reversible-jump MCMC simulation algorithm to RBF networks so as to compute the joint posterior distribution of the radial basis parameters and the number of basis functions. We advance this area of research in three important directions: We propose a hierarchical prior for RBF networks. That is, we adopt a full Bayesian model, which accounts for model order uncertainty and regularization, and show that the results appear to be robust with respect to the prior specication. We propose an automated growing and pruning reversible-jump MCMC optimization algorithm to choose the model order using the classical AIC, BIC, and MDL criteria. This algorithm estimates the maximum of the joint likelihood function of the radial basis parameters and the number of bases using a reversible-jump MCMC simulated annealing approach. It has the advantage of being more computationally efcient than the reversible-jump MCMC algorithm used to perform the integrations with the hierarchical full Bayesian model. We derive a geometric convergence theorem for the homogeneous reversible-jump MCMC algorithm and a convergence theorem for the annealed reversible-jump MCMC algorithm. In Section 1, we present the approximation model. In section 2, we formalize the Bayesian model and specify the prior distributions. Section 3 is devoted to Bayesian computation. We rst propose an MCMC sampler to perform Bayesian inference when the number of basis functions is given. Subsequently, a reversible-jump MCMC algorithm is derived to deal with the case where the number of basis functions is unknown. A reversible-jump MCMC simulated annealing algorithm to perform stochastic optimization using the AIC, BIC, and MDL criteria is proposed in section 5. The convergence of the algorithms is established in section 6. The performance of the proposed algorithms is illustrated by computer simulations in section 7. Finally, some conclusions are drawn in section 8. Appendix A denes the notation used, and the proofs of convergence are given in appendix B.
2362
C. Andrieu, N. de Freitas, and A. Doucet
2 Problem Statement
Many physical processes may be described by the following nonlinear, multivariate input-output mapping: yt D f( xt ) C nt ,
where x t 2 Rd corresponds to a group of input variables, y t 2 Rc to the target variables, n t 2 Rc to an unknown noise process, and t D f1, 2, . . .g is an index variable over the data. In this context, the learning problem involves computing an approximation to the function f and estimating the characteristics of the noise process given a set of N input-output observations O D fx1 , x 2 , . . . , x N , y 1 , y 2 , . . . , y N g. Typical examples include regression, where y 1:N,1:c 1 is continuous; classication, where y corresponds to a group of classes; and nonlinear dynamical system identication, where the inputs and targets correspond to several delayed versions of the signals under consideration. When the exact nonlinear structure of the multivariate function f cannot be established a priori, it may be synthesized as a combination of parameterized basis functions, that is, 0
0
fO ( x , µ ) D Gk @µ k I @¢ ¢ ¢
X j
³ Gj µj I
X i
´
11
Gi ( µi I x ) ¢ ¢ ¢AA ,
(2.1)
where Gi ( x , µi ) denotes a multivariate basis function. These multivariate basis functions may be generated from univariate basis functions using radial basis, tensor product, or ridge construction methods. This type of modeling is often referred to as non-parametric regression because the number of basis functions is typically very large. Equation 2.1 encompasses a large number of nonlinear estimation methods, including projection pursuit regression, Volterra series, fuzzy inference systems, multivariate adaptive regression splines (MARS), and many articial neural network paradigms such as functional link networks, multilayer perceptrons (MLPs), RBF networks, wavelet networks, and hinging hyperplanes (see, e.g., Cheng & Titterington, 1994; de Freitas, 1999; Denison, Mallick, & Smith, 1998; Holmes & Mallick, 2000). For the purposes of this article, we adopt the approximation scheme of Holmes and Mallick (1998), consisting of a mixture of k RBFs and a linear 1
y 1:N, 1:c is an N £ c matrix, where N is the number of data and c the number of (y 1, j , y2, j , . . . , yN, j )0 to denote all the observations outputs. We adopt the notation y 1:N, j corresponding to the jth output (jth column of y). To simplify the notation, yt is equivalent to y t,1:c . That is, if one index does not appear, it is implied that we are referring to all of its possible values. Similarly, y is equivalent to y1:N,1:c . We favor the shorter notation but invoke the longer notation to avoid ambiguities and emphasize certain dependencies.
Robust Full Bayesian Learning for Radial Basis b1
1 b
S
++
b2 1,1
xt,1
2363
S b
yt,1 + +
yt,2
2,2
f
1
a1,1
S
xt,2
f
2
S f
3
a3,2
Figure 1: Approximation model with three RBFs, two inputs, and two outputs. The solid lines indicate weighted connections.
regression term. However, the work can be straightforwardly extended to other regression models. More precisely, our model M is: yt D b C ¯ 0 xt C nt
M 0: M k: yt D
Pk
j D1 aj w (kx t
¡ ¹j k) C b C ¯ 0 x t C n t
kD 0 k ¸ 1,
where k ¢ k denotes a distance metric (usually Euclidean or Mahalanobis), ¹j 2 Rd denotes the jth RBF center for a model with k RBFs, aj 2 Rc the jth RBF amplitude, and b 2 Rc and ¯ 2 Rd £ Rc the linear regression parameters. The noise sequence n t 2 Rc is assumed to be zero-mean white gaussian. Although we have not explicitly indicated the dependency of b, ¯, and n t on k, these parameters are indeed affected by the value of k. Figure 1 depicts the approximation model for k D 3, c D 2, and d D 2 (c is the number of outputs, d is the number of inputs, and k is the number of basis functions). Depending on our a priori knowledge about the smoothness of the mapping, we can choose different types of basis functions (Girosi et al., 1995). The most common choices are: Linear: w ( %) D
%
Cubic: w ( %) D %3 Thin plate spline: w ( %) D %2 ln( %) ¡ ¢1 / 2 Multiquadric: w ( %) D %2 C l2 ¡ ¢ Gaussian: w ( %) D exp ¡l%2
2364
C. Andrieu, N. de Freitas, and A. Doucet
For the last two choices of basis functions, we treat l as a user set parameter. For convenience, we express our approximation model in vector-matrix form: ¡
¢
y D D ¹1:k,1:d , x 1:N,1:d ®1:1 C d C k,1:c C n t ,
that is: 2
3
y 1,1 ¢ ¢ ¢ y 1,c 6 y 2,1 ¢ ¢ ¢ y 2,c 7 6 7
2
1 61 6 6 7 6 6 7 6 6 7 D 6 .. .. 6 7 6. . 6 7 6 4 5 4 y N,1 ¢ ¢ ¢ y N,c 1
x1,1 ¢ ¢ ¢ x 1,d x2,1 ¢ ¢ ¢ x 2,d
.. .
2
x N,1 ¢ ¢ ¢ x N,d
(2.2)
3 w ( x1 , ¹1 ) ¢ ¢ ¢ w ( x1 , ¹ k ) w ( x2 , ¹1 ) ¢ ¢ ¢ w ( x2 , ¹ k ) 7 7 7 7 7 .. 7 . 7 5 w ( x N , ¹1 ) ¢ ¢ ¢ w ( xN , ¹ k )
3 b1 ¢ ¢ ¢ bc 6¯1,1 ¢ ¢ ¢ ¯ 1,c 7 6 7 6 .. 7 6 7 . 6 7 7 £6 ¢ ¢ ¢ ¯ ¯ d,c 7 C n 1:N , 6 d,1 6 a1,1 ¢ ¢ ¢ a 1,c 7 6 7 6 7 .. 4 5 . a k,1 ¢ ¢ ¢ a k,c
where the noise process is assumed to be normally distributed as follows: ± ± ²² nt » N 0c£1 , diag ¾ 21 , . . . , ¾ 2c . Once again, we stress that ¾ 2 depends implicitly on the model order k. We assume here that the number k of basis functions and their parameters µ f®1:m,1:c , ¹1:k,1:d , ¾ 21:c g, with m D 1 C d C k, are unknown. Given the data set fx, y g, our objective is to estimate k and µ 2 £ k . 3 Bayesian Model and Aims
We follow a Bayesian approach where the unknowns k and µ are regarded as being drawn from appropriate prior distributions. These priors reect our degree of belief on the relevant values of these quantities (Bernardo & Smith, 1994). Furthermore, we adopt a hierarchical prior structure that enables us to treat the priors’ parameters (hyperparameters) as random variables drawn from suitable distributions (hyperpriors). That is, instead of xing the hyperparameters arbitrarily, we acknowledge that there is an inherent uncertainty in what we think their values should be. By devising probabilistic models that deal with this uncertainty, we are able to implement estimation techniques that are robust to the specication of the hyperpriors.
Robust Full Bayesian Learning for Radial Basis
2365
The remainder of the section is organized as follows. First, we propose a hierarchical model prior that denes a probability distribution over the space of possible structures of the data. Subsequently, we specify the estimation and inference aims. Finally, we exploit the analytical properties of the model to obtain an expression, up to a normalizing constant, of the joint posterior distribution of the basis centers and their number. 3.1 Prior Distributions. The overall parameter space £ £ ª can be written as a nite union of subspaces £ £ ª D ( [kkmax fkg £ £ k ) £ ª where D0 d C 1 ) c £( C ) c and d C 1 C k ) c £( C ) c £ ( ( £ 0 R R £ k R R k for k 2 f1, . . . , kmax g. That is, ® 2 ( Rd C 1 C k ) c , ¾ 2 ( R C ) c , and ¹ 2 k. The hyperparameter space ( R C ) c C 1 , with elements à fL , ± 2 g, will be discussed at the end of ª this section. The space of the radial basis centers k is dened as a compact set that encompasses the input data: k f¹I ¹1:k,i 2 [min( x 1:N,i ) ¡ iJ i , max( x1:N,i ) C 6 6 iJ i ]k for i D 1, . . . , d with ¹j,i D ¹l,i for j D lg. J i D kmax( x1:N,i ) ¡ min( x 1:N,i ) k denotes the Euclidean distance for the ith dimension of the input, and i is a user-specied parameter that we need to consider only if we wish to place basis functions outside the region where the input data lie. That is, we allow k to include the space of the input data and extend it by a factor proportional to the spread of the input data. Typically, researchers either set i to zero and choose the basis centers from the input data (Holmes & Mallick, 1998; Kadirkamanathan & Niranjan, 1993) or compute the basis centers using clustering algorithms (Moody & Darken, 1988). This strategy is also exploited within the support vector paradigm (Vapnik, 1995). The premise here is that it is better to place the basis functions where the data are dense, not in regions of extrapolation. Moreover, if the input space is very large, then placing basis functions where the data lie reduces the space over which one has to sample the basis locations. However, when one adopts global basis functions, it is no longer clear that this is the case. In fact, if there are outliers, it might be a bad strategy to place basis functions where the data lie. After some experimentation and taking these trade-offs into consideration, we chose to sample the basis centers from the space k, Q ( diD 1 (1 C 2i) J i ) k. Figure 2 shows this space whose hypervolume is = k for a two-dimensional input. (N ¡ The maximum number of basis functions is dened as kmax ( d C 1) ) .2 We also dene [kkmax fkg £ k with 0 ;. There is a D0 natural hierarchical structure to this setup (Richardson & Green, 1997), which we formalize by modeling the joint distribution of all variables as
2 The constraint k · N ¡ (d C 1) is added because otherwise the columns of D (m 1:k , x ) are linearly dependent and the parameters h may not be uniquely estimated from the data (see equation 2.2).
2366
C. Andrieu, N. de Freitas, and A. Doucet
x2
iX
2
X
2
iX
2
iX
1
X
1
iX
1
x1 Figure 2: RBF centers space for a two-dimensional input. The circles represent the input data.
p ( k, µ, Ã , y | x ) D p ( y | k, µ, Ã , x ) p ( µ | k, Ã , x ) p ( k, Ã | x ) , where p( k, Ã | x ) is the joint model order and hyperparameters’ probability, p ( µ | k, Ã , x ) is the parameters’ prior, and p ( y | k, µ , Ã , x ) is the likelihood. Under the assumption of independent outputs given ( k, µ ), the likelihood for the approximation model described in the previous section is: p ( y | k, µ, Ã , x ) D D
c Y
p( y 1:N,i | k, ®1:m,i , ¹1:k, ¾ 2i , x )
iD 1
³ ¢0 1 ¡ ( 2p ¾ 2i ) ¡N / 2 exp ¡ 2 y 1:N,i ¡ D ( ¹1:k, x ) ®1:m,i 2¾ i iD 1 ´ ¡ ¢ ( ) £ y 1:N,i ¡ D ¹1:k, x ®1:m,i .
c Y
We assume the following structure for the prior distribution: p ( k, µ, Ã ) 2 2 D p ( ®1:m | k, ¹1:k , ¾ 2 , L , ± ) p ( ¹1:k | k, ¾ 2 , L , ± )
£ p( k | ¾2 , L , ± 2 ) p( ¾2 | L , ± 2 ) p( L , ±2 )
2 2 2 2 D p ( ®1:m | k, ¹1:k , ¾ , ± ) p ( ¹1:k | k) p ( k | L ) p( ¾ ) p( L ) p( ± ) ,
(3.1)
where the scale parameters ¾ 2i , i D 1, . . . , c are assumed to be independent of the hyperparameters (i.e., p ( ¾ 2 | L , ± 2 ) D p ( ¾ 2 ) ), independent of each Qc 2) 2 other (p( ¾ D iD1 p ( ¾ i ) ), and distributed according to conjugate inversegamma prior distributions ¾ 2i » IG ( u20 , c20 ). When u0 D 0 and c 0 D 0,
Robust Full Bayesian Learning for Radial Basis
2367
we obtain Jeffreys’s uninformative prior p ( ¾ 2i ) / 1 / ¾ 2i (Bernardo & Smith, 1994). Given ¾ 2 , we introduce the following prior distribution: p ( k, ®1:m , ¹1:k | ¾ 2 , L , ± 2 ) D p ( ®1:m | k, ¹1:k, ¾ 2 , ± 2 ) p ( ¹1:k | k) p ( k | L ) " ³ ´# c Y 1 ¡1 2 ¡1 / 2 0 |2p ¾ i § i | exp ¡ 2 ®1:m,i § i ®1:m,i D 2¾ i iD 1 2 3 µ ¶ I ( k, ¹1:k ) 4 L k / k! 5 £ , P kmax j =k L / j! jD 0
where § i¡1 D ± i¡2 D 0 ( ¹1:k, x ) D ( ¹1:k, x ) and I ( k, ¹1:k ) is the indicator function of the set (1 if ( k, ¹1:k ) 2 , 0 otherwise). The prior model order distribution p ( k | L ) is a truncated Poisson distribution. Conditional on k, the RBF centers are uniformly distributed. Finally, conditional on (k, ¹1:k), the coefcients ®1:m,i are assumed to be zero-mean gaussian with variance ¾ 2i § i . The terms ± 2 2 ( R C ) c and L 2 R C can be respectively interpreted as the expected signal-to-noise ratios and the expected number of radial basis. The prior for the coefcients has been previously advocated by various authors (George & Foster, 1997; Smith & Kohn, 1996). It corresponds to the popular g-prior distribution (Zellner, 1986) and can be derived using a maximum entropy approach (Andrieu, 1998). An important property of this prior is that it penalizes for basis functions being too close as, in this situation, the determinant of § i¡1 tends to zero. We now turn our attention to the hyperparameters, which allow us to accomplish our goal of designing robust model selection schemes. We assume that they are independent of each other, that is, p( L , ± 2 ) D p( L ) p ( ± 2 ). Q Moreover, p ( ± 2 ) D ciD 1 p ( ± 2i ). As ± 2 is a scale parameter, we ascribe a vague conjugate prior density to it: ± 2i » IG ( ad2 , bd2 ) for i D 1, . . . , c, with ad2 D 2 and bd2 > 0. The variance of this hyperprior with ad 2 D 2 is innite. We apply the same method to L by setting an uninformative conjugate prior (Bernardo & Smith, 1994): L » G a (1 / 2 C e1 , e2 ) (ei ¿ 1 i D 1, 2). We can visualize our hierarchical prior (see equation 3.1) with a directed acyclic graphical model (DAG), as shown in Figure 3. 3.2 Estimation and Inference Aims. The Bayesian inference of k, µ , and à is based on the joint posterior distribution p ( k, µ , à | x, y ) obtained from Bayes’s theorem. Our aim is to estimate this joint distribution from which, by standard probability marginalization and transformation techniques, one can “theoretically” obtain all posterior features of interest. For instance, we might wish to perform inference with the predictive density:
p ( y N C 1 | x 1:N C 1 , y 1:N ) Z D p ( yN C 1 | k, µ, à , xN C 1 ) p ( k, µ, à | x 1:N , y 1:N ) dk dµ dà £
2368
C. Andrieu, N. de Freitas, and A. Doucet e
a d
b
2
d
d
u
2
g
0
s
2
x
2
L
0
2
a
e
1
k
m
y
Figure 3: Directed acyclic graphical model for our prior.
and consequently make predictions, such as:
E( y N C 1 | x1:N C 1 , y1:N ) Z D ( ¹1:k , x N C 1 ) ®1:m p ( k, µ , à | x1:N , y 1:N ) dk dµ dà . D £
We might also be interested in evaluating the posterior model probabilities p ( k | x, y ) , which can be used to perform model selection by selecting the model order as arg max k2f0,...,kmax g p ( k | x, y ) . In addition, it allows us to perform parameter estimation by computing, for example, the conditional expectation E ( µ | k, x , y ) . However, it is not possible to obtain these quantities analytically, as it requires the evaluation of high-dimensional integrals of nonlinear functions in the parameters, as we shall see in the following section. We propose here to use an MCMC method to perform Bayesian computation. MCMC techniques were introduced in the mid-1950s in statistical physics and started appearing in the elds of applied statistics, signal processing, and neural networks in the 1980s and 1990s (Holmes & Mallick, 1998; Neal, 1996; Rios Insua & Muller, ¨ 1998; Robert & Casella, 1999; Tierney, 1994). The key idea is to build an ergodic Markov chain ( k( i) , µ ( i) , Ã ( i) ) i2N whose equilibrium distribution is the desired posterior distribution. Under weak additional assumptions, the P À 1 samples generated by the Markov chain are asymptotically distributed according to the posterior distribution and thus allow
Robust Full Bayesian Learning for Radial Basis
2369
easy evaluation of all posterior features of interest—for example: b p ( k D j | x, y ) D
P 1X Ifjg ( k( i) ) P iD 1
and b E( µ | k D j, x , y ) D
PP
( i) ( ( i) ) i D 1 µ Ifjg k . PP ( i) i D1 I fjg ( k )
(3.2)
In addition, we can obtain predictions, such as: P 1X ( i) ( i) b D ( ¹1:k , xN C 1 ) ®1:m . E( y N C 1 | x 1:N C 1 , y 1:N ) D P iD 1
3.3 Integration of the Nuisance Parameters. The proposed Bayesian model allows for the integration of the so-called nuisance parameters, ®1:m and ¾ 2 , and subsequently to obtain an expression for p( k, ¹1:k, L , ± 2 | x , y ) up to a normalizing constant. By applying Bayes’s theorem, multiplying the exponential terms of the likelihood and coefcients prior, and completing squares, we obtain the following expression for the full posterior distribution:
p ( k, ®1:m , ¹1:k, ¾ 2 , L , ± 2 | x , y ) " ³ ´# c Y 1 0 2 ¡N 2 / ( 2p ¾ i ) / exp ¡ 2 y 1:N,i P i, ky 1:N,i 2¾ i i D1 " c Y |2p ¾ 2i § i | ¡1 / 2 £ iD 1
³
£ exp ¡
1 ¡
¢0
¡1 M i,k
¡
¢
´#
®1:m,i ¡ h i,k ®1:m,i ¡ h i,k 2¾ 2i 2 3" µ ¶ ³ ´# c c0 I ( k, ¹1:k ) 4 L k / k! 5 Y C 1) 2 ¡(u 2 / 0 ( ¾i ) £ exp ¡ 2 P kmax j 2¾ i =k iD 1 j D 0 L / j! " ³ ´# c h i Y bd2 2 ¡(ad 2 C 1) ( ) ( L ) ( e1 ¡1/ 2) exp ( ¡e2 L ) , £ exp ¡ 2 ±i iD 1
±i
where ¡1 M i,k D D 0 ( ¹1:k , x ) D ( ¹1:k , x ) C §
¡1 i ,
P i,k D IN ¡ D ( ¹1:k , x ) M i, k D 0 ( ¹1:k , x ) .
h i, k D M i, k D 0 ( ¹1:k , x ) y 1:N,i
2370
C. Andrieu, N. de Freitas, and A. Doucet
We can now integrate with respect to ®1:m (gaussian distribution) and ¾ 2i (inverse gamma distribution) to obtain the following expression for the posterior: 2 3 ³ ´¡ N C u0 ¢ c ± ²¡m / 2 c 0 C y0 P i,ky 1:N,i ¡ 2 Y 6 7 1:N,i 1 C ± 2i p ( k, ¹1:k, L , ± 2 | x, y ) / 4 5 2 iD 1 2 3" ³ ´# ¶ c b 2 I ( k, ¹k ) 4 L k / k! 5 Y 2 ¡(ad 2 C 1) ± ( ±i ) £ exp ¡ 2 P kmax j =k ±i L / j! iD 1 µ
jD 0
h
i £ ( L ) ( e1 ¡1 / 2) exp ( ¡e2 L ) .
(3.3)
It is worth noticing that the posterior distribution is highly nonlinear in the RBF centers ¹ k and that an expression of p( k | x, y ) cannot be obtained in closed form. 4 Bayesian Computation
For clarity, we assume that k is given. After dealing with this xed-dimension scenario, we present an algorithm where k is treated as an unknown random variable. 4.1 MCMC Sampler for Fixed Dimension. We propose the following hybrid MCMC sampler, which combines Gibbs steps and Metropolis-Hastings (MH) steps (Gilks, Richardson, & Spiegelhalter, 1996; Tierney, 1994): Fixed-Dimension MCMC Algorithm 1. Initialization. Fix the value of k and set (h (0) , y (0) ) and i D 1. 2. Iteration i
For j D 1, . . . , k — Sample u » U[0,1] . (i) — If u < $, perform an MH step admitting p (m j, 1:d | x , y , m ¡j,1:d ) (i) ? as invariant distribution and q1 (m j,1:d | m j,1:d ) as proposal distribution. (i) — Else perform an MH step using p (m j,1:d | x , y , m ¡j,1:d ) as (i) ? ) as proposal dis| m j,1:d invariant distribution and q2 (m j,1:d tribution. End For. (i) Sample the nuisance parameters ( a1:m , s 2(i) ) using equations 4.1 and 4.2.
Robust Full Bayesian Learning for Radial Basis
2371
Sample the hyperparameters ( L (i) , d 2(i) ) using equations 4.3 and 4.4. 3. i à i C 1 and go to 2.
The simulation parameter $ is a real number satisfying 0 < $ < 1. Its value indicates our belief on which proposal distribution leads to faster convergence. If we have no preference for a particular proposal, we can set it to 0.5. The various steps of the algorithm are detailed in the following sections. In order to simplify the notation, we drop the superscript ¢( i) from all variables at iteration i. 4.1.1 Updating the RBF Centers. Sampling the RBF centers is difcult because the distribution is nonlinear in these parameters. We have chosen here to sample them one at a time using a mixture of MH steps. An MH step of invariant distribution, say p ( z ), and proposal distribution, say q( z ? | z ), involves sampling a candidate value z ? given the current value z according to q( z ? | z ) . The Markov chain then moves toward z? with probability A ( z , z ?) minf1, (p ( z ) q( z ? | z ) ) ¡1p ( z?) q( z | z ?)g; otherwise, it remains equal to z . This algorithm is very general, but to perform well in practice, it is necessary to use “clever” proposal distributions to avoid rejecting too many candidates. According to equation 3.3, the target distribution is the full conditional distribution of a basis center: p ( ¹j,1:d | x , y , ¹ ¡j,1:d ) 2 3 ³ ´¡¡ N C2u0 ¢ 0 c Y C c 0 y 1:N,i P i, ky 1:N,i 6 7 /4 5 I ( k, ¹1:k ) , 2 iD 1 where ¹ ¡j,1:d denotes f¹1,1:d , ¹2,1:d , . . . , ¹j¡1,1:d , ¹j C 1,1:d , . . . , ¹ k,1:d g. ? | ¹j,1:d ) corresponds With probability 0 < $ < 1, the proposal q1 ( ¹j,1:d to randomly sampling a basis center from the interval [min( x 1:N,i ) ¡ iJ i , max( x1:N,i ) C iJ i ]k for i D 1, . . . , d. The motivation for using such a proposal distribution is that the regions where the data are dense are reached quickly. Subsequently, with probability 1¡$, we perform an MH step with proposal ? | ¹j,1:d ) : distribution q2 ( ¹j,1:d ? | ¹j,1:d » N ¹j,1:d
2 ( ¹j,1:d , sRW Id ) .
? , which is a perturbation This proposal distribution yields a candidate ¹j,1:d of the current center. The perturbation is a zero-mean gaussian random 2 I . This random walk is introduced to perform a variable with variance sRW d
2372
C. Andrieu, N. de Freitas, and A. Doucet
local exploration of the posterior distribution. In both cases, the acceptance probability is given by: ? ) A ( ¹j,1:d , ¹j,1:d 8 2 9 3 ³ ´¡ N C2u0 ¢ > > 0 < c = Y c 0 C y 1:N,i P i, ky 1:N,i 6 7 ? ( ) D min 1, 4 I k, ¹ , 5 1:k > > c 0 C y 01:N,i P ?i, ky 1:N,i : ; i D1
where P ?i,k and M ?i, k are similar to P i, k and M i, k with ¹1:k,1:d replaced by ? f¹1,1:d , ¹2,1:d , . . ., ¹j¡1,1:d , ¹j,1:d , ¹j C 1,1:d , . . . , ¹k,1:d g. We have found that the combination of these proposal distributions works well in practice. 4.1.2 Sampling the Nuisance Parameters. In section 3.3, we derived an expression for p ( k, ¹1:k, L , ± 2 | x , y ) from the full posterior distribution p ( k, ®1:m , ¹1:k, ¾ 2 , L , ± 2 | x , y ) by performing some algebraic manipulations and integrating with respect to ®1:m (gaussian distribution) and ¾ 2 (inverse gamma distribution). As a result, if we take into consideration that p ( k, ®1:m , ¹1:k , ¾ 2 , L , ± 2 | x, y ) 2 2 D p ( ®1:m | k, ¹1:k , ¾ 2 , L , ± , x , y ) p ( k, ¹1:k , ¾ 2 , L , ± | x , y ) 2 2 D p ( ®1:m | k, ¹1:k , ¾ 2 , L , ± , x , y ) p ( ¾ 2 | k, ¹1:k , L , ± , x , y )
£ p ( k, ¹1:k, L , ± 2 | x, y ) , it follows that for i D 1, . . . , c, ®1:m,i , and ¾ 2i are distributed according to:
³
¾ 2i
2
| ( k, ¹1:k, ± , x , y ) » IG
®1:m,i | ( k, ¹1:k, ¾ 2 , ± 2 , x , y ) » N
0 u0 C N c 0 C y 1:N,i P i, ky 1:N,i , 2 2
( h i, k, ¾ 2i M i, k ) .
´
(4.1) (4.2)
4.1.3 Sampling the Hyperparameters. By considering p ( k, ®1:m , ¹1:k, ¾ 2 , L , ± 2 | x , y ) , we can clearly see that the hyperparameters ± i (for i D 1, . . . , c) can be simulated from the full conditional distribution:
± 2i | ( k, ®1:m , ¹1:k, ¾ 2i , x, y ) ³ ´ 1 m 0 0 ( ) ( ) » IG ad2 C , bd2 C D x D x . ® ¹ , ¹ , ® 1:m,i 1:k 1:k 2 2¾ 2i 1:m,i
(4.3)
On the other hand, an expression for the posterior distribution of L is not so straightforward because the prior for k is a truncated Poisson distribution. L can be simulated using the MH algorithm with a proposal corresponding to the full conditional that would be obtained if the prior for k was an innite
Robust Full Bayesian Learning for Radial Basis
2373
Poisson distribution. That is, we can use the following gamma proposal for L , ¡ ¢ q( L ?) / L ?(1 / 2 C e1 C k) exp ¡(1 C e2 ) L ? ,
(4.4)
and subsequently perform an MH step with the full conditional distribution p ( L | k, ¹1:k, ± 2 , x , y ) as invariant distribution. 4.2 MCMC Sampler for Unknown Dimension. Now let us consider the case where k is unknown. Here, the Bayesian computation for the estimation of the joint posterior distribution p ( k, µ , Ã | x , y ) is even more complex. One obvious solution would consist of upper bounding k by, say, kmax and running kmax C 1 independent MCMC samplers, each being associated to a xed number k D 0, . . . , kmax . However, this approach suffers from severe drawbacks. First, it is computationally very expensive since kmax can be large. Second, the same computational effort is attributed to each value of k. In fact, some of these values are of no interest in practice because they have a very weak posterior model probability p ( k | x , y ) . Another solution would be to construct an MCMC sampler that would be able to sample directly from the joint distribution on £ £ ª D ( [kkmax fkg £ £ k ) £ D0 ª . Standard MCMC methods are not able to “jump” between subspaces £ k of different dimensions. However, Green (1995) has introduced a new, exible class of MCMC samplers, the so-called reversible-jump MCMC, that are capable of jumping between subspaces of different dimensions. This is a general state-space MH algorithm (see Andrieu, DjuriÂc, & Doucet, in press, for an introduction). One proposes candidates according to a set of proposal distributions. These candidates are randomly accepted according to an acceptance ratio, which ensures reversibility and thus invariance of the Markov chain with respect to the posterior distribution. Here, the chain must move across subspaces of different dimensions, and therefore the proposal distributions are more complex (see Green, 1995 and Richardson & Green, 1997, for details). For our problem, the following moves have been selected:
1. Birth of a new basis, that is, proposing a new basis function in the interval [min( x 1:N,i ) ¡ iJ i , max( x 1:N,i ) C iJ i ]k for i D 1, . . . , d.
2. Death of an existing basis, that is, removing a basis function chosen randomly. 3. Merge a randomly chosen basis function and its closest neighbor into a single basis function. 4. Split a randomly chosen basis function into two neighbor basis functions, such that the distance between them is shorter than the distance between the proposed basis function and any other existing basis function. This distance constraint ensures reversibility. 5. Update the RBF centers.
2374
C. Andrieu, N. de Freitas, and A. Doucet
These moves are dened by heuristic considerations, the only condition to be fullled being to maintain the correct invariant distribution. A particular choice will have inuence only on the convergence rate of the algorithm. The birth and death moves allow the network to grow from k to k C 1 and decrease from k to k ¡ 1, respectively. The split and merge moves also perform dimension changes from k to k C 1 and k to k ¡ 1. The merge move serves to avoid the problem of placing too many basis functions in the same neighborhood. That is, when amplitudes of many basis functions, in a close neighborhood, add to the amplitude that would be obtained by using fewer basis functions, the merge move combines some of these basis functions. On the other hand, the split move is useful in regions of the data where there are close components. Other moves may be proposed, but we have found that the ones suggested here lead to satisfactory results. In particular, we started with the update, birth, and death moves and noticed that by incorporating split and merge moves, the estimation results improved. The resulting transition kernel of the simulated Markov chain is then a mixture of the different transition kernels associated with the moves described above. This means that at each iteration, one of the candidate moves—birth, death, merge, split, or update—is randomly chosen. The probabilities for choosing these moves are bk, d k, m k, s k, and u k, respectively, such that bk C d k C m k C sk C u k D 1 for all 0 · k · kmax . A move is performed if the algorithm accepts it. For k D 0 the death, split, and merge moves are impossible, so that d0 0, s0 0, and m 0 0. The merge move is also not permitted for k D 1, that is, m1 0. For k D kmax , the birth and split moves are not allowed, and therefore bkmax 0 and skmax 0. Except in the cases described above, we adopt the following probabilities: (
bk
d kC 1
¡ ¢) p kC1| L ¡ ¢ c min 1, , p k|L ( ¡ ¢ ) p k|L ? ¢ , c min 1, ¡ p kC1| L ?
(4.5)
where p ( k | L ) is the prior probability of model M k and c? is a parameter that tunes the proportion of dimension and update moves. As Green (1995) pointed out, this choice ensures that bk p( k | L ) [d kC 1 p( k C 1 | L ) ]¡1 D 1, which means that an MH algorithm in a single dimension, with no observations, would have 1 as acceptance probability. We take c? D 0.25 and then bk C d k C mk C sk 2 [0.25, 1] for all k (Green, 1995). In addition, we choose mk D d k and sk D bk. We can now describe the main steps of the algorithm as follows: Reversible-Jump MCMC Algorithm 1. Initialization: set ( k(0) , h (0) , y (0) ) 2 £
£ª
.
Robust Full Bayesian Learning for Radial Basis
2375
2. Iteration i. Sample u » U[0, 1] . If ( u · bk(i) ) — then birth move (see section 4.2.1). — else if ( u · bk(i) C d k(i) ) then death move (see section 4.2.1). — else if ( u · bk(i) C d k(i) C sk(i) ) then split move (see section 4.2.2). — else if ( u · bk(i) C dk(i) C sk(i) C m k(i) ) then merge move (see section 4.2.2). — else update the RBF centers (see section 4.2.3). End If. (i) Sample the nuisance parameters ( s 2(i) a ) using equations 4.1 k , k and 4.2. Simulate the hyperparameters ( L (i) , d 2(i) ) using equations 4.3 and 4.4.
3. i à i C 1 and go to 2.
We expand on these different moves in the following sections. Once again, in order to simplify the notation, we drop the superscript ¢( i) from all variables at iteration i. 4.2.1 Birth and Death Moves. Suppose that the current state of the Markov chain is in fkg £ £ k £ ª . Then the algorithm for the birth and death moves is as follows: Birth Move 1. Propose a new basis location at random from the interval [min( x 1:N, i ) ¡iJ i , max( x1:N,i ) C iJ i ] for i D 1, . . . , d. 2. Evaluate Abirth (see equation 4.7), and sample u » U[0, 1] .
3. If u · A birth , then the state of the Markov chain becomes ( k C 1, m 1:kC 1 ) , else it remains equal to ( k, m 1:k ) . Death move 1. Choose the basis center to be deleted, at random among the k existing basis. 2. Evaluate Adeath (see equation 4.7), and sample u » U[0, 1] .
3. If u · Adeath then the state of the Markov chain becomes ( k ¡ 1, m 1:k¡1 ) , else it remains equal to ( k, m 1:k ) .
2376
C. Andrieu, N. de Freitas, and A. Doucet
The acceptance ratio for the proposed birth move is deduced from the following expression (Green, 1995): rbirth
( posterior distributions ratio ) £ ( proposal ratio ) £ ( Jacobian) p ( k C 1, ¹1:kC 1 , L , ± 2 | x , y ) @( ¹1:k C 1 ) dkC 1 / ( k C 1) £ £ . D bk / = @( ¹1:k , ¹?) p ( k, ¹1:k, L , ± 2 | x, y )
Clearly, the Jacobian is equal to 1, and after simplications we obtain:
rbirth
2 c 6Y D4 i D1
1
³
( 1 C ± 2i ) 1 / 2
c 0 C y 01:N,i P i, ky 1:N,i
´¡ N 2u0 ¢ C
c 0 C y01:N,i P i,kC 1 y1:N,i
3 7 5
1 . ( k C 1)
Similarly, for the death move: rdeath D
p( k ¡ 1, ¹1:k¡1 , L , ± 2 | x, y ) 2
p ( k, ¹1:k, L , ± 2 | x , y )
6Y 2 D 4 ( 1 C ± i ) 1/ 2
c
i D1
³
£
@( ¹1:k¡1 , ¹?) bk¡1 / = £ @( ¹ ) dk / k 1:k ¡ NCu ¢ 3
c 0 C y 01:N,i P i,ky 1:N,i
´
c 0 C y 01:N,i P i, k¡1 y 1:N,i
0
2
7 5 k.
(4.6)
The acceptance probabilities corresponding to the described moves are:
A birth D min f1, rbirth g ,
Adeath D min f1, rdeathg
(4.7)
4.2.2 Split and Merge Moves. The merge move involves randomly selecting a basis function ( ¹1 ) and then combining it with its closest neighbor ( ¹2 ) into a single basis function ¹, whose new location is
¹D
¹1 C ¹2 . 2
(4.8)
The corresponding split move that guarantees reversibility is
¹1 D ¹ ¡ ums &?
¹2 D ¹ C ums &?,
(4.9)
where &? is a simulation parameter and ums » U [0,1] . To ensure reversibility, we perform the merge move only if k¹1 ¡ ¹2 k < 2& ?. Suppose now that the current state of the Markov chain is in fkg £ £ k £ ª . Then the algorithm for the split and merge moves is as follows: Split Move 1. Randomly choose an existing RBF center.
Robust Full Bayesian Learning for Radial Basis
2377
2. Substitute it for two neighbor basis functions, whose centers are obtained using equation 4.9. The new centers must be bound to lie in the space k , and the distance (typically Euclidean) between them has to be shorter than the distance between the proposed basis function and any other existing basis function. 3. Evaluate Asplit (see equation 4.10), and sample u » U[0,1] . 4. If u · A split then the state of the Markov chain becomes ( k C 1, m 1:kC 1 ) , else it remains equal to ( k, m 1:k ) . Merge Move 1. Choose a basis center at random among the k existing basis. Then nd the closest basis function to it applying some distance metric, for example, Euclidean. 2. If km 1 ¡ m 2 k < 2& ?, substitute the two basis functions for a single basis function in accordance with equation 4.8. 3. Evaluate Amerge (see equation 4.10), and sample u » U[0, 1] .
4. If u · A merge , then the state of the Markov chain becomes ( k ¡ 1, m 1:k¡1 ) , else it remains equal to ( k, m 1:k ) .
The acceptance ratio for the proposed split move is given by:
rsplit D
p ( k C 1, ¹1:kC 1 , L , ± 2 | x , y ) p ( k, ¹1:k, k, L , ± 2 | x , y )
mkC 1 / ( k C 1) @( ¹1 , ¹2 ) £ £ . p( ums ) sk / k @( ¹, ums )
In this case, the Jacobian is equal to: Jsplit
@( ¹1 , ¹2 ) D D @( ¹, ums )
1 ¡& ?
1 D 2&?, & ?
and, after simplications, we obtain: 2 rsplit
c 6Y D4 iD 1
1 (1
³
C ± 2i ) 1/ 2
c 0 C y 01:N,i P i,ky 1:N,i 0
c 0 C y 1:N,i P i, kC 1 y 1:N,i
´¡ N 2u0 ¢ C
3 7 5
k& ? . =( k C 1)
Similarly, for the merge move:
rmerge D
p ( k ¡ 1, ¹1:k¡1 , L , ± 2 | x , y ) p ( k, ¹1:k, L , ± 2 | x, y )
£
sk¡1 / ( k ¡ 1) @( ¹, ums ) £ , mk / k @( ¹1 , ¹2 )
2378
C. Andrieu, N. de Freitas, and A. Doucet
and, since Jmerge D 1 / 2& ?, it follows that
rmerge
2 3 ³ ´¡ N C2u0 ¢ 0 c Y c 0 C y 1:N,i P i,k y 1:N,i k= 6 7 2 . D 4 (1 C ± i ) 1/ 2 5 ? 0 ( C c y P y ¡ 1) & k 0 1:N,i i, k¡1 1:N,i iD 1
The acceptance probabilities for the split and merge moves are: © ª A split D min 1, rsplit ,
© ª Amerge D min 1, rmerge .
(4.10)
4.2.3 Update Move. The update move does not involve changing the dimension of the model. It requires an iteration of the xed dimension MCMC sampler presented in section 4.1.1. The method presented so far can be very accurate, yet it can be computationally demanding. In the following section, we present a method that requires optimization instead of integration to obtain estimates of the parameters and model dimension. This method, although less accurate, as shown in section 7, is less computationally demanding. The choice of one method over the other should ultimately depend on the modeling constraints and specications. 5 Reversible-Jump Simulated Annealing
In this section, we show that traditional model selection criteria within a penalized likelihood framework, such as AIC, BIC, and MDL (Akaike, 1974; Schwarz, 1985; Rissanen, 1987), can be shown to correspond to particular hyperparameter choices in our hierarchical Bayesian formulation. (As pointed out by one of the reviewers, AIC is not restricted to the situation where the true model belongs to the set of model candidates.) That is, we can calibrate the prior choices so that the problem of model selection within the penalized likelihood context can be mapped exactly to a problem of model selection via posterior probabilities. This technique has been previously applied to the problem of variable selection (George & Foster, 1997). After resolving the calibration problem, we perform maximum likelihood estimation, with the model selection criteria, by maximizing the calibrated posterior distribution. To accomplish this goal, we adopt an MCMC simulated annealing algorithm, which makes use of the homogeneous reversible-jump MCMC kernel as proposal distribution. This approach has the advantage that we can start with an arbitrary model order, and the algorithm will perform dimension jumps until it nds the “true” model order. That is, we do not have to resort to the more expensive task of running a xed-dimension algorithm for each possible model order and subsequently select the best model.
Robust Full Bayesian Learning for Radial Basis
2379
5.1 Penalized Likelihood Model Selection. Traditionally, penalized likelihood model order selection strategies, based on standard information criteria, require the evaluation of the maximum likelihood (ML) estimates for each model order. The number of required evaluations can be prohibitively expensive unless appropriate heuristics are available. Subsequently, a particular model Ms is selected if it is the one that minimizes the sum of the the log-likelihood and a penalty term that depends on the model dimension (DjuriÂc, 1998; Gelfand & Dey, 1997). In mathematical terms, this estimate is given by:
n
Ms D
arg min M
k:
k2f0,..., kmax g
o ¡ log( p ( y | k, b µ , x) ) C P ,
(5.1)
b 1:m , b where b µ D (® ¹1:k, b ¾ 2k ) is the ML estimate of µ for model M k. P is a penalty term that depends on the model order. Examples of ML penalties include the well-known AIC, BIC, or MDL information criteria (Akaike, 1974; Schwarz, 1985; Rissanen, 1987). The expressions for these in the case of gaussian observation noise are
PAIC D j
and
PBIC D PMDL D
j log(N) , 2
where j denotes the number of model parameters (k( c C 1) C c(1 C d) ) in the case of an RBF network). These criteria are motivated by different factors: AIC is based on expected information, BIC is an asymptotic Bayes factor, and MDL involves evaluating the minimum information required to transmit some data and a model, which describes the data, over a communications channel. Using the conventional estimate of the variance for gaussian distributions, ¢0 ¡ ¢ 1 ¡ b1:m,i y 1:N,i ¡ D ( b b1:m,i y 1:N,i ¡ D (b ¹1:k, x ) ® ¹1:k, x ) ® N 1 0 y P ¤ y 1:N,i , D N 1:N,i i,k
b2i D ¾
where P ¤i, k is the least-squares orthogonal projection matrix, £
P ¤i, k D IN ¡ D (b ¹1:k , x ) D 0 (b ¹1:k , x ) D ( b ¹1:k, x )
we can expand equation 5.1 as follows: (
Ms D
arg min M
k :k2f0,...,kmax g
" ¡ log
c Y iD 1
(2p b ¾ 2i ) ¡N / 2
¤¡1
D 0 (b ¹1:k, x ) ,
2380
C. Andrieu, N. de Freitas, and A. Doucet
³ £ exp ¡
1 ¡ 2b ¾ 2i
b1:m,i y 1:N,i ¡ D (b ¹1:k , x ) ®
¡
(" D
arg max M
k :k2f0,..., kmax g
b 1:m,i £ y 1:N,i ¡ D ( b ¹1:k, x ) ® c Y iD 1
#
( y 1:N,i P ¤i,k y1:N,i ) ¡N / 2 0
¢
¢0
´#
) C
P
)
exp( ¡P ) .
(5.2)
In the following section, we show that calibrating the priors in our hierarchical Bayes model will lead to the expression given by equation 5.2. 5.2 Calibration. It is useful and elucidating to impose some restrictions on the Bayesian hierarchical prior (see equation 3.3) to obtain the AIC and MDL criteria. We begin by assuming that the hyperparameter ± is xed to a particular value, say ± , and that we no longer have a denite expression for the model prior p ( k) , so that
p ( k, ¹1:k | x , y ) 2 3 ³ ´¡¡ N C2u0 ¢ µ ¶ 0 c Y C c y P y 0 2 6 7 I ( k, ¹k ) 1:N,i i,k 1:N,i / 4 ( 1 C ± i ) ¡m / 2 p ( k) . 5 2 =k i D1 Furthermore, we set u0 D 0 and c 0 D 0 to obtain Jeffreys’s uninformative prior p ( ¾ 2i ) / 1 / ¾ 2i . Consequently, we obtain the following expression: p ( k, ¹1:k | x , y ) " #µ ¶ c ± ² Y ¢¡ N 2 ¡k/ 2 ¡ 0 I ( k, ¹ k ) / 1 C ±i y 1:N,i P i,k y1:N,i 2 p ( k) , =k iD 1 ¡2
¡1 where M i,k D ( 1 C ± i ) D 0 ( ¹1:k , x ) D ( ¹1:k , x ), h i, k D M i,k D 0 ( ¹1:k , x ) y 1:N,i , and 2
P i, k D IN ¡ D ( ¹1:k , x ) M i, k D 0 ( ¹1:k , x ) . Finally, we can select ± i and p ( k) such
that
"
#µ ¶ c ± ² Y 2 ¡k/ 2 I ( k, ¹ k ) 1 C ±i p( k) D exp( ¡P ) / exp( ¡C k) , =k iD 1
thereby ensuring that the expression for the calibrated posterior distribution p ( k, ¹1:k | x, y ) corresponds to the term that needs to be maximized in the penalized likelihood framework (see equation 5.2). Note that for the purposes of optimization, we only need the proportionality condition with C D c C 1 for the AIC criterion and C D ( c C 1) log( N ) / 2 for the MDL and
Robust Full Bayesian Learning for Radial Basis
2381
BIC criteria. We could, for example, satisfy the proportionality by remaink ing in the compact set and choosing the prior p ( k) D PkLmax j , with the jD 0
following xed value for L :
L
" # c ± ²1 Y 2 2 L D 1 C ±i = exp ( ¡C ) .
(5.3)
i D1
In addition, we have to let ± ! 1 so that P i, k ! P ¤i,k . We have thus shown that by calibrating the priors in the hierarchical Bayesian formulation, in particular by treating L and ± 2 as xed quantities instead of as random variables, letting ± ! 1, choosing an uninformative Jeffreys’s prior for ¾ 2 and setting L as in equation 5.3, we can obtain the expression that needs to be maximized in the classical penalized likelihood formulation with AIC, MDL, and BIC model selection criteria. Consequently, we can interpret the penalized likelihood framework as a problem of maximizing the joint posterior distribution p ( k, ¹1:k | x , y ). Effectively, we can obtain this MAP estimate as follows: ( k, ¹1:k ) MAP D arg maxp ( k, ¹1:k | x, y ) k,m 1:k 2
D arg max k,m 1:k 2
(" c Y iD 1
# ( y 1:N,i P ¤i, ky 1:N,i ) ¡N / 2 0
) exp( ¡P ) .
The sufcient conditions that need to be satised so that the distribution p ( k, ¹1:k | x , y ) is proper are not overly restrictive. First, has to be a compact set, which is not a problem in our setting. Second, y 01:N,i P ¤i, ky 1:N,i has to be larger than zero for i D 1, . . . , c. In appendix B, lemma 1, we show that this is the case unless y 1:N,i spans the space of the columns of D ( ¹1:k, x ), in which case y 01:N,i P ¤i,ky 1:N,i D 0. This event has probability zero. 5.3 Reversible-Jump Simulated Annealing. From an MCMC perspective, we can solve the stochastic optimization problem posed in the previous section by adopting a simulated annealing strategy (Geman & Geman, 1984; Van Laarhoven & Arts, 1987). The simulated annealing method involves simulating a nonhomogeneous Markov chain whose invariant distribution at iteration i is no longer equal to p ( z ) , but to p i ( z ) / p 1 / Ti ( z ), where Ti is a decreasing cooling schedule with limi!C 1 Ti D 0. The reason for doing this is that under weak regularity assumptions on p ( z ) , p 1 ( z ) is a probability density that concentrates itself on the set of global maxima of p ( z ) . As with the MH method, the simulated annealing method with distribution p ( z ) and proposal distribution q( z ? | z ) involves sampling a candidate value z ? given the current value z according to q( z? | z ) . The Markov chain
2382
C. Andrieu, N. de Freitas, and A. Doucet
moves toward the candidate z? with probability ASA ( z, z ?) D minf1, (p 1 / Ti ( z ) q( z ? | z ) ) ¡1 p 1 / Ti ( z ?) q( z | z ?) g; otherwise, it remains equal to z . If we choose the homogeneous transition kernel K ( z , z?) of the reversible-jump algorithm as the proposal distribution and use the reversibility property, p ( z?) K ( z?, z ) D p ( z ) K ( z, z ?) , it follows that » ¼ p (1 / Ti ¡1) ( z ?) A RJSA D min 1, (1 / T ¡1) . i ( z) p
(5.4)
Consequently, the following algorithm, with bk D d k D m k D sk D u k D 0.2, can nd the joint MAP estimate of ¹1:k and k: Reversible-Jump Simulated Annealing 1. Initialization: set ( k(0) , h (0) ) 2 £
.
2. Iteration i.
Sample u » U[0,1] , and set the temperature according to the cooling schedule. If ( u · b k(i) ) — then birth move (see equation 5.6). — else if ( u · bk(i) C dk(i) ) then death move (see equation 5.6). — else if ( u · b k(i) C dk(i) C s k(i) ) then split move (see equation 5.7). — else if ( u · bk(i) C d k(i) C s k(i) C m k(i) ) then merge move (see equation 5.7). — else update the RBF centers (see equation 5.5). End If. Perform an MH step with the annealed acceptance ratio (see equation 5.4). 3. i à i C 1 and go to 2.
4. Compute the coefcients a1:m by least squares (optimal in this case):
b a1:m,i D [D 0 (m 1:k , x ) D (m 1:k , x ) ] ¡1 D 0 (m 1:k , x ) y 1:N,i . We explain the simulated annealing moves in the following sections. To simplify the notation, we drop the superscript ¢( i ) from all variables at iteration i.
Robust Full Bayesian Learning for Radial Basis
2383
5.4 Moves. We sample the radial basis centers in the same way as explained in section 4.1.1. However, the target distribution is given by
" p ( ¹j,1:d | x , y , ¹ ¡j,1:d ) /
c Y
¡
N ( y 1:N,i P ¤i,ky 1:N,i ) ¡ 2
¢#
0
iD 1
exp( ¡P ) ,
and, consequently, the acceptance probability is 8 > <
2 c Y
³
6 ? ) Aupdate ( ¹j,1:d , ¹j,1:d D min 1, 4 > : iD 1
39
y 1:N,i P ¤i,k y1:N,i y 01:N,i P ?i,k y1:N,i 0
´¡ N2 ¢ > =
7 5 , > ;
(5.5)
where P ?i, k is similar to P ¤i,k with ¹1:k,1:d replaced by f¹1,1:d , ¹2,1:d , . . . , ¹j¡1,1:d , ? ,¹ ¹j,1:d j C 1,1:d , . . . , ¹ k,1:d g. The birth and death moves are similar to the ones proposed in section 4.2.1, except that the expressions for rbirth and rdeath (with bk D d k D 0.2) become 2 rbirth
c 6Y D4
³
iD 1
´¡ N2 ¢
y 01:N,i P ¤i, k y 1:N,i y 1:N,i P ¤i,k C 1 y 1:N,i 0
3 7 = exp( ¡C ) . 5 kC1
Similarly,
rdeath
2 3 ³ 0 ´¡ N2 ¢ ¤ c Y y 1:N,i P i, k y 1:N,i 6 7 k exp( C ) . D4 5 0 ¤ y P y = 1:N,i i, k¡1 1:N,i iD 1
Hence, the acceptance probabilities corresponding to the described moves are
Abirth D min f1, rbirth g , A death D min f1, rdeathg .
(5.6)
Similarly, the split and merge moves are analogous to the ones proposed in section 4.2.2, except that the expressions for rsplit and rmerge (with mk D sk D 0.2) become 2 6 rsplit D 4
c Y iD 1
³
y 1:N,i P ¤i, k y 1:N,i y 01:N,i P ¤i,k C 1 y 1:N,i 0
´¡ N2 ¢
3 7 k& ? exp( ¡C ) 5 kC1
2384
C. Andrieu, N. de Freitas, and A. Doucet
and
rmerge
2 ³ 0 ´¡ N2 ¢ 3 ¤ y c Y y P 1:N,i 6 7 k exp( C ) 1:N,i i, k . D4 5 ? 0 ¤ y 1:N,i P i, k¡1 y 1:N,i & ( k ¡ 1) iD 1
The acceptance probabilities for these moves are © ª A split D min 1, rsplit ,
© ª Amerge D min 1, rmerge .
(5.7)
6 Convergence Results
It is easy to prove that the reversible-jump MCMC algorithm applied to the full Bayesian model converges, in other words, that the Markov chain ( i) ( k( i) , ¹1:k , L ( i) , ± 2(i) ) i2N is ergodic. We prove here a stronger result by show( i) ing that ( k( i) , ¹1:k , L ( i) , ± 2(i) ) i2N converges to the required posterior distribution at a geometric rate. For the homogeneous kernel, we have the following result: ( )
i Let ( k( i ) , ¹1:k , L ( i) , ± 2(i) ) i2N be the Markov chain whose transition kernel has been described in section 3. This Markov chain converges to the probability distribution p ( k, ¹1:k, L , ± 2 | x , y ) . Furthermore, this convergence occurs at a geometric rate, that is, for p ( k, ¹1:k, L , ± 2 | x , y ) -almost every initial point ( 0) ( 0) ( k( 0) , ¹1:k , L ( 0) , ± 2( 0) ) 2 £ ª there exists a function C ( k( 0) , ¹1:k , L (0) , ± 2( 0) ) > 0 and a constant r 2 [0, 1) such that:
Theorem 1.
® ± ²® ² ± ® (i) ® x, y ® k, ¹1:k, L , ± 2 ¡ p k, ¹1:k, L , ± 2 ®p TV ± ² ( 0) · C k( 0) , ¹1:k , L ( 0) , ± 2( 0) r bi/ kmax c
(6.1)
( i) where p ( i ) ( k, ¹1:k, L , ± 2 ) is the distribution of ( k( i) , ¹1:k , L ( i) , ± 2(i) ) and k ¢ kTV is the total variation norm (Tierney, 1994).
Proof.
See appendix B.
Since at each iteration i, one simulates the nuisance parameters ( i) ( i) 2(i) the distribution of the series ( k( i) , ®1:m , ¹1:k , ¾ k , L ( i) , ± 2(i) ) i2N converges geometrically toward p( k, ®1:m , ¹1:k, ¾ 2k , L , ± 2 | x , y ) at the same rate r. Corollary 1.
( ®1:m , ¾ 2k ) ,
In other words, the distribution of the Markov chain converges at least at a geometric rate, dependent on the initial state, to the required equilibrium distribution p( k, µ , y | x , y ) .
Robust Full Bayesian Learning for Radial Basis
2385
In practice one cannot evaluate r, but theorem 1 proves its existence. This type of convergence ensures that a central limit theorem for ergodic averages is valid (Meyn & Tweedie, 1993; Tierney, 1994). Moreover, in practice, there is empirical evidence that the Markov chain converges quickly. Remark 1.
We have the following convergence theorem for the reversible-jump MCMC simulated annealing algorithm: Under certain assumptions found in Andrieu, Breyer, and Doucet ( ) (1999), the series of ( µ i , k( i) ) converges in probability to the set of global maxima max max (µ , k ) , that is, for any 2 > 0, it follows that Theorem 2.
³ lim Pr
i!1
( ) p ( µ i , k( i ) ) max max ¸ 1 ¡ 2 p( µ ,k )
´ D 1.
If we follow the same steps as in proposition 1 of appendix B, with ± 2 and L xed, it is easy to show that the transition kernels for each temperature are uniformly geometrically ergodic. Hence, as a corollary of Andrieu et al. (1999, theorem 1), the convergence result for the simulated annealing MCMC algorithm follows. Proof.
7 Experiments
When implementing the reversible-jump MCMC algorithm discussed in section 4, one might encounter problems of ill conditioning, in particular for high-dimensional parameter spaces. There are two satisfactory ways of overcoming this problem.3 First, we can introduce a ridge regression ¡1 component so that the expression for M i,k in section 3.3 becomes ¡1 M i,k D D 0 ( ¹1:k , x ) D ( ¹1:k , x ) C §
¡1 i C
~ Im ,
where ~ is a small number. Alternatively, we can introduce a slight modication of the prior for ®1:m : p ( ®1:m | k, ¹1:k, ¾ 2 , L , ± 2 ) " ³ c Y 2 2 ¡1/ 2 |2p ¾ i ± i Im | exp ¡ D iD 1
3
1 2¾ 2i ± 2i
´# 0
®1:m,i ®1:m,i
.
The software is available online at http://www.cs.berkeley.edu/»jfgf.
2386
C. Andrieu, N. de Freitas, and A. Doucet
We have found that although both strategies can deal with the problem of limited numerical precision, the second approach seems to be more stable. In addition, the second approach does not oblige us to select a value for the simulation parameter ~. 7.1 Experiment 1: Signal Detection. The problem of detecting signal components in noisy signals has occupied the minds of many researchers for a long time (DjuriÂc, 1996). Here, we consider the rather simple toy problem of detecting gaussian components in a noisy signal. Our aim is to compare the performance of the hierarchical Bayesian model selection scheme and the penalized likelihood model selection criteria (AIC, MDL) when the amount of noise in the signal varies. The data were generated from the following univariate function using 50 covariate points uniformly on [¡2, 2]:
y D x C 2 exp( ¡16x2 ) C 2 exp( ¡16(x ¡ 0.7) 2 ) C n, where n » N ( 0, s 2 ). The data were then rescaled to make the input data lie in the interval [0, 1]. We used the reversible-jump MCMC algorithm, for the full Bayesian model, and the simulated annealing algorithms to estimate the number of components in the signal for different levels of noise. We repeated the experiment 100 times for each noise level. We chose gaussian radial basis functions with the same variance as the gaussian signal components. For the simulated annealing method, we adopted a linear cooling schedule: Ti D a ¡ bi, where a, b 2 R C and Ti > 0 for i D 1, 2, 3, . . . In particular, we set the initial and nal temperatures to 1 and 1 £ 10¡5 . For the Bayesian model, we selected diffuse priors (ad 2 D 2, bd 2 D 10 [see experiment 2], u0 D 0, c 0 D 0, e1 D 0.001 and e2 D 0.0001). Finally, we set the simulation 2 parameters kmax , i, sRW , and & ? to 20, 0.1, 0.001, and 0.1. Figure 4 shows the typical ts that were obtained for training and validation data sets. By varying the variance of the noise s 2 , we estimated the main mode and fractions of unexplained variance. For the AIC and BIC/MDL criteria, the main mode corresponds to the one for which the posterior is maximized, while for the Bayesian approach, the main mode corresponds to the MAP of the model order probabilities b p ( k | x , y ) , computed as suggested in section 3.2. The fractions of unexplained variance (fv) were computed as follows:
fv D
100 P50 b 2 1 X tD 1 ( yt,i ¡ yt,i ) , P50 100 i D1 ( )2 D 1 yt,i ¡ y t
i
where b yt,i denotes the tth prediction for the ith trial and yi is the estimated mean of yi . The normalization in the fv error measure makes it independent of the size of the data set. If the estimated mean was to be used as the
Robust Full Bayesian Learning for Radial Basis
2387
Train output
4
2
0
2
4
0
0.1
0.2
0.3
0.4
0.5
Train input
0.6
0.7
0.8
0.9
1
4
Test output
2 0 2
True function Test data Prediction
4 6
0
0.1
0.2
0.3
0.4
0.5
Test input
0.6
0.7
0.8
0.9
1
Figure 4: Performance of the reversible-jump MCMC algorithm on the signal detection problem. Despite the large noise variance, the estimates of the true function and noise process are very accurate, thereby leading to good generalisation (no overtting). Table 1: Fraction of Unexplained Variance for Different Values of the Noise Variance, Averaged over 100 Test Sets. s2
AIC
BIC/MDL
Bayes
0.01 0.1 1
0.0070 0.0690 0.6083
0.0076 0.0732 0.4846
0.0069 0.0657 0.5105
predictor of the data, the fv would be equal to 1. The results obtained are shown in Figure 5 and Table 1. The fv for each model selection approach are very similar. This result is expected since the problem under consideration is rather simple and the error variations could possibly be attributed to the fact that we use only 100 realizations of the noise process for each s 2 . What is important is that even in this scenario, it is clear that the full Bayesian model provides more accurate estimates of the model order.
2388
C. Andrieu, N. de Freitas, and A. Doucet
AIC
BIC/MDL
100
50
100
50
Bayes
50
s
2
= 0.01
100
0
2
4
6
8
0
0
2
4
6
8
0
100
100
100
50
50
50
0 1 2 3 4 5 6 7 8
s
2
= 0.1
0
0
2
4
6
8
0
0
2
4
6
8
0
100
100
100
50
50
50
0 1 2 3 4 5 6 7 8
s
2
=1
0
0
0
2
4
6
8
0
0
2
4
6
8
0
0 1 2 3 4 5 6 7 8
Figure 5: Histograms of the main mode b p ( k | x , y ) for 100 trials of each noise level in experiment 1. The Bayes solution provides a better estimate of the true number of basis (2) than the MDL/BIC and AIC criteria. 7.2 Experiment 2: Robot Arm Data. This data set is often used as a benchmark to compare neural network algorithms.4 It involves implementing a model to map the joint angle of a robot arm ( x1 , x2 ) to the position of the end of the arm ( y1 , y2 ) . The data were generated from the following model:
y1 D 2.0 cos( x1 ) C 1.3 cos( x1 C x2 ) C 2 y2 D 2.0 sin(x1 ) C 1.3 sin(x1 C x2 ) C 2
1 2
(7.1)
where 2 i » N ( 0, s 2 ) ; s D 0.05. We use the rst 200 observations of the data set to train our models and the last 200 observations to test them. First, we assessed the performance of the reversible-jump algorithm with the Bayesian model. In all the simulations, we chose to use cubic basis functions. Figure 6 shows the three-dimensional plots of the training data and the contours of the training and test data. The contour plots also include 4
The data set is available online at http://wol.ra.phy.cam.ac.uk/mackay/.
Robust Full Bayesian Learning for Radial Basis
5
y2
y1
5 0 5 4 2
Train set
x2
Test set
0
2
2
0
0 5 4 2
x2
x1
4
4
3
3
2
2
1
1
0 2
1
0
1
2
0 2
4
4
3
3
2
2
1
1
0 2
2389
1
0
1
2
0 2
0
2
2
0
x1
1
0
1
2
1
0
1
2
Figure 6: (Top) Training data surfaces corresponding to each coordinate of the robot arm’s position. (Middle, bottom) Training and validation data (solid lines) and respective RBF network mappings (dotted lines).
the typical approximations that were obtained using the algorithm. To assess convergence, we plotted the probabilities of each model order b p( k | x, y ) in the chain (using equation 3.2) for 50,000 iterations, as shown in Figure 7. As the model orders begin to stabilize after 30,000 iterations, we decided to run the Markov chains for 50,000 iterations with a burn-in of 30,000 iterations. It is possible to design more complex convergence diagnostic tools; however, this topic is beyond the scope of this article. We chose uninformative priors for all the parameters and hyperparameters. In particular, we used the values shown in Table 2. To demonstrate the robustness of our model, we chose different values for bd2 (the only critical hyperparameter as it quanties the mean of the spread ± of ®k). The obtained mean square errors (see Table 2) and probabilities for ± 1 , ± 2 , ¾ 21,k, ¾ 22,k, and k, shown in Figure 8, clearly indicate that our model is robust with respect to prior specication. As shown in Table 3, our mean square errors are slightly better than the ones reported by other researchers (Holmes & Mallick, 1998; Mackay, 1992; Neal, 1996; Rios Insua & Muller, ¨ 1998). Yet the main point we are trying to make is that our model exhibits the important quality of being robust to the
2390
C. Andrieu, N. de Freitas, and A. Doucet
1
13 14 15 16
0.9 0.8
p(k|x,y)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.5
1
1.5
2
2.5
3
Iteration number
3.5
4
4.5
5 x 10
4
Figure 7: Convergence of the reversible-jump MCMC algorithm for RBF networks. The plot shows the probability of each model order given the data. The model orders begin to stabilize after 30,000 iterations.
Table 2: Simulation Parameters and Mean Square Errors for the Robot Arm Data (Test Set) Using the Reversible-Jump MCMC Algorithm and the Bayesian Model. ad 2
bd 2
u0
c0
e1
e2
Mean Square Error
2 2 2
0.1 10 100
0 0 0
0 0 0
0.0001 0.0001 0.0001
0.0001 0.0001 0.0001
0.00505 0.00503 0.00502
prior specication and statistically signicant. Moreover, it leads to more parsimonious models than the ones previously reported. We also tested the reversible-jump simulated annealing algorithms with the AIC and MDL criteria on this problem. The results for the MDL criterion are depicted in Figure 9. We note that the posterior increases stochastically with the number of iterations and eventually converges to a maximum. The gure also illustrates the convergence of the train and test set errors for each
Robust Full Bayesian Learning for Radial Basis
d
2 1
and d
2 2
s
2 1
2391
2 2
and s
k
= 0.1
0.04
0.1
0.02
0.8 0.6 0.4 0.2
b
d
0.2
2
0.06
0
0
100
200
0
0
2
4 x 10
0.04
0.1
0.02
0 12
14
16
14
16
14
16
0.8 0.6 0.4 0.2
b
d
2
= 10
0.06 0.2
6 3
0
0
100
200
0
0
2
4 x 10
= 100
0.04
2
0.06 0.2 0.1
0.02
6
0 12
3
0.8 0.6 0.4
d
0.2
b 0
0
100
200
0
0
2
4
6 x 10
0 12
3
Figure 8: Histograms of smoothness constraints for each output (d 1 and d 2 ), noise variances (s 21, k and s 22, k ), and model order (k) for the robot arm data simulation using three different values for bd 2 . The plots conrm that the model is robust to the setting of bd 2 .
network in the Markov chain. For the nal network, we chose the one that maximized the posterior (the MAP estimate). This network consisted of 12 basis functions and incurred an error of 0.00512 in the test set. Following the same procedure, the AIC network consisted of 27 basis functions and incurred an error of 0.00520 in the test set. These results indicate that the full Bayesian model, with model averaging, provides more accurate results. Moreover, it seems that the information criteria, in particular the AIC, can lead to overtting of the data. These results conrm the well-known fact that suboptimal techniques, such as the simulated annealing method with information criteria penalty terms and a rapid cooling schedule, can allow for faster computation at the expense of accuracy.
2392
C. Andrieu, N. de Freitas, and A. Doucet
Table 3: Mean Square Errors and Number of Basis Functions for the Robot Arm Data. Method
Mean Square Error
5.5
x 10
0.00573 0.00557 0.00554 0.00549 0.00620 0.00535 0.00502 0.00512 0.00520
3
5 4.5 4
Test error
Train error
Mackay’s (1992) gaussian approximation with highest evidence Mackay’s (1992) gaussian approximation with lowest test error Neal’s (1996) hybrid Monte Carlo Neal’s (1996) hybrid Monte Carlo with ARD Rios Insua and Muller ¨ ’s (1998) MLP with reversible-jump MCMC Holmes and Mallick’s (1998) RBF with reversible-jump MCMC Reversible-jump MCMC with Bayesian model Reversible-jump MCMC with MDL Reversible-jump MCMC with AIC
7
0 3 x 10
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
100
120
140
160
180
200
6 5 4
k
30 20
|y))
10
log(p(k,m
1:k
100 0 100
Iterations
Figure 9: Performance of the reversible-jump simulated annealing algorithm for 200 iterations on the robot arm data, with the MDL criterion. 7.3 Experiment 3: Classication with Discriminants. Here we consider an interesting nonlinear classication data set5 collected as part of a study to identify patients with muscle tremor (Roberts, Penny, & Pillot, 1996; Spyers5
The data are available online at http://www.ee.ic.ac.uk/hp/staff/sroberts.html.
Robust Full Bayesian Learning for Radial Basis
2393
1
x
2
0.5
0
0.5 0.5
0
x1
0.5
1
Figure 10: Classication boundaries (solid line) and condence intervals (dashed line) for the RBF classier. The circles indicate patients, and the crosses represent the control group.
Ashby, Bain, & Roberts, 1998). The data were gathered from a group of patients (nine with, primarily, Parkinson’s disease or multiple sclerosis) and from a control group (not exhibiting the disease). Arm muscle tremor was measured with a 3D mouse and a movement tracker in three linear and three angular directions. The time series of the measurements were parameterized using a set of autoregressive models. The number of features was then reduced to two (Roberts et al., 1996). Figure 10 shows a plot of these features for patient and control groups. The gure also shows the classication boundaries and condence intervals obtained with our model, using thin-plate spline hidden neurons and an output linear neuron. We should point out, however, that having an output linear neuron leads to a classication framework based on discriminants. An alternative approach, which we do not pursue here, is to use a logistic output neuron so that the classication scheme is based on probabilities of class membership. It is, however, possible to extend our approach to this probabilistic classication setting by adopting the generalized linear models framework with logistic, probit
2394
C. Andrieu, N. de Freitas, and A. Doucet
1 0.9 0.8
True positives
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
False positives
0.7
0.8
0.9
1
Figure 11: Receiver operating characteristic (ROC) of the classier for the tremor data. The solid line is the ROC curve for the posterior mean classier, and the dotted lines correspond to the curves obtained for various classiers in the Markov chain.
or softmax link functions (Gelman, Carlin, Stern, & Rubin, 1995; Holmes & Mallick, 1999; Nabney, 1999). The size of the condence intervals for the decision boundary is given by the noise variance (s 2 ). These intervals are a measure of uncertainty on the threshold that we apply to the linear output neuron. Our condence of correctly classifying a sample occurring within these intervals should be very low. The receiver operating characteristic (ROC) curve, shown in Figure 11, indicates that using the Neyman Pearson criterion, we can expect to detect patients with a 69% condence and without making any mistakes (Hand, 1997). In our application, the ROC curve was obtained by averaging all the predictions for each classier in the Markov chain. The percentage of classication errors in the test set was found to be 14.60. This error is of the same magnitude as previously reported results (de Freitas, Niranjan, & Gee, 1998; Roberts & Penny, 1998). Finally, the estimated probabilities of the signal-to-noise ratio (d 2 ), noise variance (s 2 ), and model order (k) for this application are depicted in Figure 12.
Robust Full Bayesian Learning for Radial Basis
2395
0.4
2
0.3
d
0.2 0.1 0
0
100
200
300
400
500
600
0.06
s
2
0.04 0.02 0 0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.8
k
0.6 0.4 0.2 0
0
1
2
3
4
5
6
7
8
9
10
Figure 12: Estimated probabilities of the signal-to-noise ratio (d 2 ), noise variance (s 2 ), and model order (k) for the classication example.
8 Conclusions
We have proposed a robust full Bayesian model for estimating, jointly, the noise variance, parameters, and number of parameters of an RBF model. We also considered the problem of stochastic optimization for model order selection and proposed a solution that makes use of a reversible-jump simulated annealing algorithm and classical information criteria. Moreover, we gave proofs of geometric convergence for the reversible-jump algorithm for the full Bayesian model and convergence for the simulated annealing algorithm. Contrary to previously reported results, our experiments suggest that our Bayesian model is robust with respect to the specication of the prior. In addition, we obtained more parsimonious RBF networks and slightly better approximation errors than the ones previously reported in the literature. We also presented a comparison between Bayesian model averaging and penalized likelihood model selection with the AIC and MDL criteria. We found that the Bayesian strategy led to more accurate results. Yet the optimization strategy using the AIC and MDL criteria and a reversible-jump simulated annealing algorithm was shown to converge faster for a specic cooling schedule.
2396
C. Andrieu, N. de Freitas, and A. Doucet
There are many avenues for further research. These include estimating the type of basis functions required for a particular task, performing input variable selection, considering other noise models, adopting Bernoulli and multinomial output distributions for probabilistic classication by incorporating ideas from the generalized linear models eld, and extending the framework to sequential scenarios. A solution to the rst problem can be easily formulated using the reversible-jump MCMC framework presented here. Variable selection schemes can also be implemented via the reversible-jump MCMC algorithm. Finally, we are working on a sequential version of the algorithm that allows us to perform model selection in nonstationary environments (Andrieu, de Freitas, & Doucet, 1999a, 1999b). We also believe that the algorithms need to be tested on additional real-world problems. For this purpose, we have made the software available online at http://www.cs.berkeley.edu/»jfgf. Appendix A: Notation A i, j : entry of the matrix A in the ith row and jth column. A 0 : transpose of matrix A .
| A |: determinant of matrix A . If z ( z1 , . . . , zj¡1 , zj , zj C 1 , . . . , z k ) 0 then z ¡j ( z1 , . . . , zj¡1 , zj C 1 , . . . , zk ) 0 . In : identity matrix of dimension n £ n.
IE ( z ) : indicator function of the set E (1 if z 2E, 0 otherwise). bzc: highest integer strictly less than z.
z »p( z ) : z is distributed according to p ( z ).
z | y »p( z ) : the conditional distribution of z given y is p( z ).
Probability Distribution
F
fF ( ¢)
Inverse gamma Gamma Gaussian Poisson Uniform
IG ( a, b ) G a ( a, b ) N ( m, S ) P n (l)
b a ¡a¡1 exp ( ¡b z ) I[0, C 1) ( z) z C(a) b a a¡1 exp ( ¡bz) I[0, C 1) ( z ) C(a) z ¡ ¡1 / 2 |2p S | exp ¡ 12 ( z ¡ m ) 0 S ¡1 z l exp( ¡l) IN ( z) z! £R ¤¡1 z IA ( z ) Ad
U
A
/
(z ¡ m)
¢
Appendix B: Proof of Theorem 1
The proof of theorem 1 relies on the following theorem, which is a result of theorems 14.0.1 and 15.0.1 in Meyn and Tweedie (1993):
Robust Full Bayesian Learning for Radial Basis Theorem 3.
2397
Suppose that a Markovian transition kernel P on a space Z
1. Is a w ¡irreducible (for some measure w ) aperiodic Markov transition kernel with invariant distribution p . 2. Has geometric drift toward a small set C with drift function V: Z ! [1, C 1), that is, there exists 0 < l < 1, b > 0, k0 and an integrable measure º such that: PV ( z ) · lV( z ) C bIC ( z )
(B.1)
P k0 ( z , dz 0 ) ¸ IC ( z )º( dz 0 ) .
(B.2)
Then for p -almost all z 0, some constants r < 1 and R < C 1, we have: kP n ( z 0 , ¢) ¡ p ( ¢)kTV · RV ( z 0 )r n .
(B.3)
That is, P is geometrically ergodic. We need to prove ve lemmas that will allow us to prove the different conditions required to apply theorem 2. These lemmas will enable us to prove proposition 1, which will establish the minorization condition, equation B.2, for some k0 and measure w (to be described). The w-irreducibility and aperiodicity of the Markov chain are then proved in corollary 3, thereby ensuring the simple convergence of the Markov chain. To complete the proof, proposition 2 will establish the drift condition, equation B.1. To simplify the presentation, we consider only one network output. The proof for multiple outputs follows trivially. Before presenting the various lemmas and their respective proofs, we need to introduce some notation. Let K ( L k1 , d k21 , k1 , ¹1:k1 ; dL k2 , dd 2k2 , k2 , d ¹1:k2 ) denote the transition kernel of the Markov chain.6 Thus, for xed ( L k1 , d k2 , k1 , ¹1:k1 ) 2 R C 2 £ , we have: 1 ± ² Pr ( L k2 , d k22 , k2 , ¹1:k2 ) 2 A k2 | ( L k1 , d k21 , k1 , ¹1:k1 ) Z K ( L k1 , d k21 , k1 , ¹1:k1 I dL k2 , dd k22 , k2 , d¹1:k2 ) , D Ak 2
where A k2 2 B ( R C 2 £ fk2 g £ k2 ) . This transition kernel is by construction (see section 4.2) a mixture of transition kernels. Hence:
K ( L k1 , d k21 , k1 , ¹1:k1 I dL k2 , dd k22 , k2 , d¹1:k2 ) 6 We will use the notation L k , d 2k , when necessary, for ease of presentation. This does not mean that these variables depend on the dimension k.
2398
C. Andrieu, N. de Freitas, and A. Doucet
±
2 birth ( L k1 , d k1 , k1 ,
bk1 K
D
¹1:k1 I dL k1 , dd k21 , k1 C 1, d¹1:k1 C 1 )
C d k1 K death ( L k1 , d k2 , k1 , 1 C s k1 K split ( L k1 , d k2 , k1 , 1
¹1:k1 I dL k1 , dd k21 , k1 ¡ 1, d¹1:k1 ¡1 )
¹1:k1 I dL k1 , dd k21 , k1 C 1, d¹1:k1 C 1 )
C m k1 K merge ( L k1 , d k2 , k1 , ¹1:k1 I dL k1 , dd k2 , k1 ¡ 1, d¹1:k1 ¡1 ) 1 1 C ( 1 ¡ bk1 ¡ d k1 ¡ s k1 ¡ m k1 )
£K
2 update ( L k1 , d k1 , k1 ,
¹1:k1 I dL k1 , dd 2k1 , k1 , d¹¤1:k1 )
²
£ p (d k22 | d 2k1 , k2 , ¹1:k2 , x , y ) p (L k2 | k2 ) dL k2 dd 2k2 , where K birth and K death correspond to the reversible jumps described in section 4.2.1, K split and K merge to the reversible jumps described in section 4.2.2, and K update is described in section 4.2.3. The different steps for sampling the parameters d k2 and L k2 are described in section 4.1.3. 2
Lemma 1. We denote P ¤k the matrix P k for which d 2 ! C 1. Let v 2 RN , then v 0 P ¤k v D 0 if and only if v belongs to the space spanned by the columns of ¡ ¢ D ¹1:k , x , with ¹1:k 2 k .
Then, noting that y 0 P ky D corollary:
2 1 y 0 y C 1 dCd 2 y0 P ¤k y , 1 Cd 2
we obtain the following
If the observed data y are really noisy, that is, cannot be described as the sum of k basis functions and a linear mapping, then there exists a number e > 0 such that for all k · kmax , d 2 2 R C , and ¹1:k 2 k , we have y 0 P ky ¸e > 0. Corollary 2.
Lemma 2.
For all k · kmax , d 2 2 R C , and ¹1:k 2 k, we have y0 P k y · y 0 y .
Lemma 3. Let K1 be the transition kernel corresponding to K such that L and d 2 are kept xed. Then there exists M 1 > 0 such that for any M2 sufciently large, any d 2 2 R C and k1 D 1, . . . , kmax :
K1 (L , d 2 , k1 , ¹1:k1 I L , d 2 , k1 ¡ 1, d¹1:k1 ¡1 ) ¸
c?IfL IL < M2 g ( L ) dS¹ ( d¹1:k1 ¡1 ) 1:k1 M1 M2 k1
with c? > 0 as dened in equation 4.5.
Robust Full Bayesian Learning for Radial Basis
2399
According to the denition of the transition kernel, for all variables ( ( k1 , ¹1:k1 ) , ( k2 , ¹1:k2 ) ) 2 2 , one has the following inequality:
Proof.
K1 (L , d 2 , k1 , ¹1:k1 I L , d 2 , k2 , d¹1:k2 ) ¸ minf1, rdeath gd k1
dS¹
1:k1
( d¹1:k2 ) k1
,
where 1 / k1 is the probability of choosing one of the basis functions for the purpose of removing it and S¹1:k f¹0 2 k1 ¡1 / 9l 2 f1, . . . , k1 g such that 1 ¹0 D ¹¡l g. Then from equation 4.6 and for all k1 D 1, . . . , kmax , we have ³ ¡1 rdeath
D
c 0 C y 01:N P k1 ¡1 y 1:N c 0 C y 01:N P k1 y 1:N
´ ( N C2u0 )
1 . k1 ( 1 C d 2 ) 1 / 2
As a result, we can use lemmas 1 and 2 to obtain e and M1 such that ³ ¡1 rdeath
·
c 0 C y 01:N y 1:N e
´ ( N C2u0 )
1 < M1 < C 1. k1 ( 1 C d 2 ) 1/ 2
Thus, there exists M1 sufciently large such that for any M2 sufciently large (from equation 4.5), d 2 2 R C , 1 · k1 · kmax , and ¹1:k1 2 k1 K1 (L , d 2 , k1 , ¹1:k1 I L , d 2 , k1 ¡ 1, d¹1:k1 ¡1 ) ¸ IfL IL < M2 g (L ) Lemma 4.
c? 1 ( d¹1:k1 ¡1 ) . dS M 2 M1 k1 ¹1:k1
(B.4)
The transition kernel K satises the following inequality for k D 0:
¤2 ¤2 K ( L 0 , d 02 , 0, ¹ 0I dL ¤0 , dd ¤2 0 , 0, d¹ 0 ) ¸ f Q (d 0 | 0) p (L 0 | 0) dd 0 dL 0 , (B.5)
with f > 0 and Q a probability density. Proof.
From the denition of the transition kernel K , we have:7
K ( L 0 , d 02 , 0, ¹0 I dL ¤0 , dd ¤2 0 , 0, d¹ 0 )
2 ¤2 ¸ u 0 p (d ¤2 0 | d 0 , 0, d¹ 0 , x , y ) p( L 0 | 0) dd 0 dL 0
¸ ( 1 ¡ c?) p (d 0¤2 | d 20 , 0, d¹0 , x , y ) p ( L 0 | 0)dd 0¤2 dL 0
as 0 < 1 ¡ c? · u 0 · 1, and we adopt the notation Q( d ¤2 | 0) ¹0 , x, y ).
p (d ¤2 | d 2 , 0,
7 When k D 0, we keep for notational convenience the same notation for the transition kernel even if m 0 does not exist.
2400
C. Andrieu, N. de Freitas, and A. Doucet
There exists a constant j > 0 and a probability density Q such that for all d 2 2 R C , 0 · k · kmax , and ¹1:k 2 k, one obtains: Lemma 5.
p(d ¤2 | d 2 , k, ¹1:k, x, y ) ¸ j Q (d ¤2 | k) .
(B.6)
From section 4.1.2, to update d 2 at each iteration, one draws from the distribution p ( ®1:m , s 2 | d 2 , k, ¹1:k , x , y ) , that is, one draws s 2 from Proof.
p (s 2 | d 2 , k, ¹1:k, x , y ) ± D
c 0 C y 01:N P k y 1:N 2
± C
N C u0 2
² N C u0
²
(s 2 )
³
2
N C u0 C1 2
´ ¡1 0 (c 0 C y 1:N P ky 1:N ) I exp 2s 2
then ®1:m from p ( ®1:m | d 2 , k, ¹1:k, s 2 , x , y ) ³ ´ 1 ¡1 ¡1 0 ( ®1:m ¡ h k ) M k ( ®1:m ¡ h k ) I exp D |2p s 2 M k | 1 / 2 2s 2 and nally one draws d ¤2 according to p (d ¤2 | d 2 , k, µ , x , y ) ± 0 D0 ( x) D (
®1:m
D
¹1:k ,
¹1:k , x ) ®1:m
2s 2
C bd 2
²m/ 2 C ad2
C ( m / 2 C ad 2 ) (d ¤2 ) m/ 2 C ad2 C 1 ³ ³ ´´ ¡1 ®01:m D 0 ( ¹1:k, x ) D ( ¹1:k, x ) ®1:m C bd2 £ exp ¤2 . d 2s 2
Consequently: p (d ¤2 | d 2 , k, µ , x , y ) p ( ®1:m , s 2 | d 2 , k, ¹1:k, x , y ) ± D
c 0 C y 01:N Pk y 1:N 2
² N C u0 ± 2
®01:m D 0 ( ¹1:k , x ) D ( ¹1:k , x ) ®1:m 2s 2
C bd 2
²m / 2 C ad2
C ( N C2 u0 ) C ( m / 2 C ad 2 ) ( 2p ) m / 2 (s 2 ) ( N C u0 C m ) / 2 C 1 (d ¤2 ) m/ 2 C ad2 C 1 | M k | 1 / 2 ³ ¡1 h ( ®1:m ¡ h k ) 0 M ¡1 ( £ exp ¡ h k ) C c 0 C y 01:N P ky 1:N k ®1:m 2s 2 ¶ ´ ®01:m D 0 ( ¹1:k, x ) D ( ¹1:k, x ) ®1:m bd 2 C ¡ ¤2 . d ¤2 d
We can obtain the minorization condition, given by equation B.6, by integrating with respect to the nuisance parameters ®1:m and s 2 . To accomplish
Robust Full Bayesian Learning for Radial Basis
2401
this, we need to perform some algebraic manipulations to obtain the following relation, ( ®1:m ¡ h k ) 0 M ¡1 ( ¡ h k ) C (c 0 C y 1:N P k y1:N ) k ®1:m C
®01:m D 0 ( ¹1:k, x ) D ( ¹1:k, x ) ®1:m d ¤2
¡1 D ( ®1:m ¡ h k ) 0 M k ( ®1:m ¡ h k ) C c 0 C y 1:N P k y 1:N ,
with: ³ ´ 1 1 1 C 2 C ¤2 D 0 ( ¹1:k, x ) D ( ¹1:k, x ) d d
M ¡1 D k
h k D M k D 0 ( ¹1:k , x ) y 1:N
P k D IN ¡ D ( ¹1:k , x ) M k D 0 ( ¹1:k , x ).
We can now integrate with respect to ®1:m (gaussian distribution) and ¾ 2 (inverse gamma distribution) to obtain the minorization condition for d ¤2 : p (d ¤2 | d 2 , k, ¹1:k, x , y ) ±
Z ¸
c 0 C y 01:N Pk y 1:N 2
² N C u0 2
( bd 2 ) m / 2 C ad2
C ( N C2 u0 ) C ( m / 2 C ad 2 ) ( 2p ) m/ 2
Rm £R C
£( s 2 ) ( N C u0 C m ) / 2 C 1 (d ¤2 ) m / 2 C ad2 C 1 | M k | 1/ 2 i ¡1 h 0 ¡1 0 ( ) ( ) C C £ exp ¡ h M ¡ h c y P y ® ® 1:m 1:m 0 1:N 1:N k k k k 2s 2 ´ bd2 ¡ ¤2 d®1:m ds 2 d ³
D
| M k | 1/ 2 | M k | 1/2
³ ¸
±c
C ( m / 2 C ad 2 )
1 C 1¤2 d2 d
³
¸
1 1 C d ¤2
±c
( bd 2 ) m / 2 C ad2
2
bd 2 d ¤2
² N C u0
0 0 C y 1:N P k y 1:N
2
2
´m / 2
1 d2
£ exp ¡ ³
² N C u0
2
1C 1C
0
0 C y 1:N P k y 1:N
e
N C u0 2
N C u0 2
(c 0 C y 1:N y 1:N )
´ ( kmax C d C 1) / 2
(d ¤2 ) m / 2 C ad2 C 1
bd2 exp ¡ ¤2 d
´
m/ 2 C ad2
bd 2
0
´
³
N C u0 2
1 ¤2 ) m/ 2 C ad 2 C 1 (d C ( m / 2 C ad 2 )
m/ 2C a
d2 mink2f0,...,kmax g bd2 N C u0 ¡ ¢ (c 0 C y 01:N y 1:N ) 2 C ( kmax C d C 1) / 2 C ad 2
e
2402
C. Andrieu, N. de Freitas, and A. Doucet
£
1 (d ¤2 )
kmax C d C 1 C ad 2 C 1 2
³ ´ bd 2 exp ¡ ¤2 , d
where we have made use of lemma 1, its corollary and lemma 2. For any M2 sufciently large, there exists an gM2 > 0 such that for all ( ( L k1 , d 2k1 , k1 , ¹1:k1 ) , ( L k2 , d k22 , k2 , ¹1:k2 ) ) 2 ( R C 2 £ ) 2 Proposition 1.
K
( kmax )
L k1 , d12 , k1 , ¹1:k1 I dL k2 , dd k22 , k2 , d¹1:k2 )
¸ IfL k1 IL k 1 < M2 g ( L k1 )gM2 w ( dL k2 , dd k22 , k2 , d¹1:k2 )
where w ( dL , dd 2 , k, d¹1:k ) Proof.
p (L | k) dL Q (d 2 | k) dd 2 If0g ( k)df¹0 g ( d¹1:k ) .
From lemmas 3 and 5, one obtains for k1 D 1, . . . , kmax :
K ( L k1 , d k21 , k1 , ¹1:k1 I dL k1 ¡1 , dd k21 ¡1 , k1 ¡ 1, d¹k1 ¡1 ) ¸ IfL k1 IL k 1 < M2 g ( L k1 )
£ dd k21 ¡1dS¹
1:k1
c? 1 jp( L k1 ¡1 | k1 ¡ 1) dL k1 ¡1Q (d k21 ¡1 | k1 ¡ 1) M2 M1 k1
( d¹ k1 ¡1 ) .
Consequently for k1 D 1, . . . , kmax , when one iterates the kernel K times, the resulting transition kernel denoted K ( kmax ) satises:
K
¤ ( L k1 , d k2 , k1 , ¹1:k1 I dL ¤0 , dd ¤2 0 , 0, d¹ 0 ) 1 Z K ( k1 ) ( L k1 , d 2k1 , k1 , ¹1:k1 I dL l , ddl2 , l, d¹1:l ) K D
kmax
( kmax )
R C £R C £
( kmax ¡k1 )
£ (L l , dl2 , l, ¹1:l I dL ¤0 , dd 0¤2 , 0, d¹¤0 ) Z Z ¸ K ( k1 ) ( L k1 , dk21 , k1 , ¹1:k1 I dL l , ddl2 , l, d¹1:l ) K R £R C
£ D K
C
f0g£
(L l , dl2 ,
( k1 )
( kmax ¡k1 )
0
l, ¹1:l I dL ¤0 , dd 0¤2 , 0, d¹¤0 )
( L k1 , d 2k1 , k1 , ¹1:k1 I dL 0, dd 20 , 0, d¹ 0 ) K
( kmax ¡k1 )
£ (L 0, d 20 , 0, ¹0 I dL ¤0 , dd 0¤2 , 0, d¹¤0 )
j c? k1 kmax ¡k1 ¤ ) & w ( dL ¤0 , dd ¤2 0 , 0, d¹ 0 ) , M 1 M2 R where we have used lemma 4 and M3 D minkD 1,..., kmax fL IL < M2 g p ( L | k) dL > ¸ IfL k1 IL k 1 < M2 g ( L k1 ) M3k1
¡1
(
0. The conclusion follows with gM2 & kmax ¡k g > 0.
?
minf& kmax , mink2f1,..., kmax g M3k¡1 ( Mj cM ) k 1
2
Robust Full Bayesian Learning for Radial Basis
2403
The transition kernel K is w -irreducible. In addition, we know that p ( dL , dd 2 , k, d¹1:k | x , y ) is an invariant distribution of K and the Markov chain is w -irreducible; then according to Tierney (1994, theorem 1*) the Markov chain is p( dL , dd 2 , k, d¹1:k | x , y ) -irreducible. Aperiodicity is straightforward. Indeed there is a nonzero probability of choosing the update move in the empty conguration from equation B.5 and to move anywhere in R2 £ f0g £ f¹0 g. Therefore, the Markov chain admits p( dL , dd 2 , k, d¹1:k | x, y ) as unique equilibrium distribution (Tierney, 1994, theorem 1*). Corollary 3.
We will now prove the drift condition: maxf1, L u g for u > 0; then
Let V ( L , d 2 , k, ¹1:k )
Proposition 2.
lim K V (L , d 2 , k, ¹1:k ) / V ( L , d 2 , k, ¹1:k ) D 0,
L !C1
where by denition,
K V ( L , d 2 , k, ¹1:k ) Z K ( L , d 2 , k, ¹1:kI dL ¤ , dd ¤2 , k¤ , d¹¤1:k ) V ( L ¤ , d ¤2 , k¤ , ¹¤1:k ) R C £R C £ The transition kernel of the Markov chain is of the form (we remove some arguments for convenience): ¡ K D b k1 K birth C d k1 K death C m k1 K merge C sk1 K split ¢ C ( 1 ¡ bk1 ¡ dk1 ¡ sk1 ¡ m k1 K update ) p (d k2 | d k2 , k2 , ¹1:k2 , x , y ) 2 1 Proof.
£ p ( L k2 | k2 ) .
Now, study the following expression:
K V ( L k1 , d12 , k1 , ¹1:k1 ) Z X D b k1
Z
K
birth
k2 2fk1 , k1 C 1g W k2
Z
£
RC
C dk1
RC
p (d 2k2 | d k21 , k2 , ¹1:k2 , x , y ) dd k22
p ( L k2 | k2 )L uk2 dL k2 Z Z X
K
k2 2fk1 , k1 ¡1g W k 2
death
RC
Z £ C sk1
RC
p (dk22 | d k21 , k2 , ¹1:k2 , x , y ) dd 2k2
p ( L k2 | k2 )L uk2 dL k2 X
Z
k2 2fk1 ,k1 C 1g W k2
Z
K
split
RC
p (d k22 | d k21 , k2 , ¹1:k2 , x , y ) dd 2k2
2404
C. Andrieu, N. de Freitas, and A. Doucet
Z £
RC
p (L k2 | k2 )L uk2 dL k2 Z
X
C m k1
Z £
Z
K
merge
k2 2fk1 ,k1 ¡1g W k2
RC
RC
p (L k2 | k2 )L uk2 dL k2 Z
C ( 1 ¡ bk1 ¡ d k1 ¡ s k1 ¡ mk1 )
Z £ D bk1
RC
X
Z
X
¤ p ( L ¤k1 | k1 ) L ¤u k1 dL k1
Z p ( L k2 | k2 ) L uk2 dL k2
Z
C k2 2fk1 , k1 C 1g R
X
RC
p( L k2 | k2 ) L uk2 dL k2
C k2 2fk1 ,k1 ¡1g R
C m k1
update
Z
C k2 2fk1 , k1 C 1g R
C s k1
K k1
p (d k¤21 | d 2k1 , k1 , ¹¤1:k1 , x , y ) ddk21
X
C d k1
p (dk22 | d k21 , k2 , ¹1:k2 , x , y ) dd 2k2
p (L k2 | k2 ) L uk2 dL k2
Z
C k2 2fk1 ,k1 ¡1g R
p ( L k2 | k2 ) L uk2 dL k2
C ( 1 ¡ bk1 ¡ d k1 ¡ s k1 ¡ mk1 )
Z RC
¤ p (L ¤k1 | k1 ) L ¤u k1 dL k1 .
As p( L | k)Ris a gamma distribution, for any 0 · k · kmax , one obtains the inequality RC p ( L | k) L u dL < C 1, and then the result follows immediately. By construction, the transition kernel K ( L k1 , dk21 , k1 , ¹1:k1 ; dL k2 , d¹1:k2 ) admits the probability distribution p ( dL , dd 2 , k, d¹1:k | x , y ) as invariant distribution. Proposition 1 proved the w-irreducibility and the minorization condition with k0 D kmax , and proposition 2 proved the drift condition. Thus, theorem 3 applies. Proof of Theorem 3.
dd k22 , k2 ,
Acknowledgments
We thank Mark Coates, Bill Fitzgerald, and Jaco Vermaak (Cambridge University), Chris Holmes (Imperial College of London), David Lowe (Aston University), David Melvin (Cambridge Clinical School), Stephen Roberts, and Will Penny (University of Oxford) for very useful discussions and comments.
Robust Full Bayesian Learning for Radial Basis
2405
References Akaike, H. (1974). A new look at statistical model identication. IEEE Transactions on Automatic Control, AC-19, 716–723. Andrieu, C. (1998). MCMC methods for the Bayesian analysis of nonlinear parametric regression models. Unpublished doctoral dissertation, University CergyPontoise, Paris. Andrieu, C., Breyer, L. A., & Doucet, A. (1999). Convergence of simulated annealing using Foster-Lyapunov criteria (Tech. Rep. No. CUED/F-INFENG/TR 346). Cambridge: Cambridge University Engineering Department. Andrieu, C., de Freitas, J. F. G., & Doucet, A. (1999a). Sequential Bayesian estimation and model selection applied to neural networks (Tech. Rep. No. CUED/FINFENG/TR 341). Cambridge: Cambridge University Engineering Department. Andrieu, C., de Freitas, J. F. G., & Doucet, A. (1999b). Sequential MCMC for Bayesian model selection. In IEEE Higher Order Statistics Workshop (pp. 130– 134), Caesarea, Israel. Andrieu, C., DjuriÂc, P. M., & Doucet, A. (in press). Model selection by MCMC computation. Signal Processing. Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley. Brass, A., Pendleton, B. J., Chen, Y., & Robson, B. (1993). Hybrid Monte Carlo simulations theory and initial comparison with molecular dynamics. Biopolymers, 33(8), 1307–1315. Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643. Cheng, B., & Titterington, D. M. (1994). Neural networks: A review from a statistical perspective. Statistical Science, 9(1), 2–54. de Freitas, J. F. G. (1999). Bayesian methods for neural networks. Unpublished doctoral dissertation, Cambridge University, Cambridge, UK. de Freitas, J. F. G., Niranjan, M., & Gee, A. H. (1998). The EM algorithm and neural networks for nonlinear state space estimation (Tech. Rep. No. CUED/FINFENG/TR 313). Cambridge: Cambridge University Engineering Department. de Freitas, J. F. G., Niranjan, M., Gee, A. H., & Doucet, A. (2000). Sequential Monte Carlo methods to train neural network models. Neural Computation, 12(4), 955–993. Denison, D., Mallick, B. K., & Smith, A. F. M. (1998). Bayesian MARS, Statistics and Computing, 8, 337–346. DjuriÂc, P. M. (1996). A model selection rule for sinusoids in white gaussian noise. IEEE Transactions on Signal Processing, 44(7), 1744–1751. DjuriÂc, P. M. (1998). Asymptotic MAP criteria for model selection. IEEE Transactions on Signal Processing, 46(10), 2726–2735. Fahlman, S. E., & Lebiere, C. (1988). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Proceedings of the Connectionist Models Summer School (Vol. 2, pp. 524–532) San Mateo, CA. Frean, M. (1990). The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation, 2(2), 198–209.
2406
C. Andrieu, N. de Freitas, and A. Doucet
Gelfand, A. E., & Dey, D. K. (1997). Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society B, 56(3), 501–514. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741. George, E. I., & Foster, D. P. (1997). Calibration and empirical Bayes variable selection. Unpublished manuscript, Department of Management Science and Information Systems, University of Texas. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain Monte Carlo in practice. London: Chapman and Hall. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7(2), 219–269. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. Hand, D. J. (1997). Construction and assessment of classication rules. New York: Wiley. Hinton, G. (1987). Learning translation invariant recognition in massively parallel networks. In J. W. de Bakker, A. J. Nijman, & P. C. Treleaven (Eds.), Proceedings of the Conference on Parallel Architectures and Languages (pp. 1–13) Berlin. Holmes, C. C., & Mallick, B. K. (1998). Bayesian radial basis functions of variable dimension. Neural Computation, 10(5), 1217–1233. Holmes, C. C., & Mallick, B. K. (1999). Generalised nonlinear modelling with multivariate smoothing splines. Unpublished manuscript, Statistics Section, Department of Mathematics, Imperial College of London. Holmes, C. C., & Mallick, B. K. (2000). Bayesian wavelet networks for nonparametric regression. IEEE Transactions on Neural Networks, 11(1), 27–35. Kadirkamanathan, V., & Niranjan, M. (1993). A function estimation approach to sequential learning with neural networks. Neural Computation, 5(6), 954–975. Le Cun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598– 605). San Mateo, CA: Morgan Kaufmann. Mackay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472. Marrs, A. D. (1998). An application of reversible-jump MCMC to multivariate spherical Gaussian mixtures. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 577–583). Cambridge, MA: MIT Press. Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. New York: Springer-Verlag. Moody, J., & Darken, C. (1988). Learning with localized receptive elds. In G. Hinton, T. Sejnowski, & D. Touretzsky (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 133–143). Palo Alto, CA. Muller, ¨ P., & Rios Insua, D. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 571–592.
Robust Full Bayesian Learning for Radial Basis
2407
Nabney, I. T. (1999). Efcient training of RBF networks for classication (Tech. Rep. No. NCRG/99/002). Birmingham, UK: Neural Computing Research Group, Aston University. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Platt, J. (1991). A resource allocating network for function interpolation. Neural Computation, 3, 213–225. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society B, 59(4), 731–792. Rios Insua, D., & Muller, ¨ P. (1998). Feedforward neural networks for nonparametric regression. In D. K. Dey, P. Muller, ¨ & D. Sinha (Eds.), Practical nonparametric and semiparametric Bayesian statistics (pp. 181–191). New York: Springer Verlag. Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statistical Society, 49, 223–239. Robert, C. P., & Casella, G. (1999). Monte Carlo statistical methods. New York: Springer-Verlag. Roberts, S. J., & Penny, W. D. (1998). Bayesian neural networks for classication: How useful is the evidence framework? Neural Networks, 12, 877–892. Roberts, S. J., Penny, W. D., & Pillot, D. (1996). Novelty, condence and errors in connectionist systems. IEE Colloquium on Intelligent Sensors and Fault Detection, no. 261, 10/1–10/6. Schwarz, G. (1985). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. Smith, M., & Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. Journal of Econometrics, 75(2), 317–343. Spyers-Ashby, J. M., Bain, P., & Roberts, S. J. (1998). A comparison of fast Fourier transform (FFT) and autoregressive (AR) spectral estimation techniques for the analysis of tremor data. Journal of Neuroscience Methods, 83(1), 35–43. Tierney, L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics, 22(4), 1701–1762. Van Laarhoven, P. J., & Arts, E. H. L. (1987). Simulated annealing: Theory and applications, Amsterdam: Reidel P. Vapnik, V. (1995). The Nature of statistical learning theory. New York: SpringerVerlag. Yingwei, L., Sundararajan, N., & Saratchandran, P. (1997). A sequential learning scheme for function approximation using minimal radial basis function neural networks. Neural Computation, 9, 461–478. Zellner, A. (1986).On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In P. Goel & A. Zellner (Eds.), Bayesianinference and decision techniques (pp. 233–243) Amsterdam: Elsevier.
Received August 25, 1999; accepted November 17, 2000.
Neural Computation, Volume 13, Number 11
CONTENTS Article
Predictability, Complexity, and Learning William Bialek, Ilya Nemenman, and Naftali Tishby
2409
Note
Dendritic Subunits Determined by Dendritic Morphology K. A. Lindsay, J. M. Ogden, and J. R. Rosenberg
2465
Letters
Computing the Optimally Fitted Spike Train for a Synapse Thomas Natschl¨ager and Wolfgang Maass
2477
Period Focusing Induced by Network Feedback in Populations of Noisy Integrate-and-Fire Neurons Francisco B. RodrÂõ guez, Alberto SuÂarez, and Vicente LÂopez
2495
A Variational Method for Learning Sparse and Overcomplete Representations Mark Girolami
2517
Random Embedding Machines for Pattern Recognition Yoram Baram
2533
Manifold Stochastic Dynamics for Bayesian Learning Mark Zlochin and Yoram Baram
2549
Resampling Method for Unsupervised Estimation of Cluster Validity Erel Levine and Eytan Domany
2573
The Whitney Reduction Network: A Method for Computing Autoassociative Graphs D. S. Broomhead and M. J. Kirby
2595
Enhanced 3D Shape Recovery Using the Neural-Based Hybrid Reectance Model Siu-Yeung Cho and Tommy W. S. Chow
2617
ARTICLE
Communicated by Jean-Pierre Nadal
Predictability, Complexity, and Learning William Bialek NEC Research Institute, Princeton, NJ 08540, U.S.A. Ilya Nemenman NEC Research Institute, Princeton, New Jersey 08540, U.S.A., and Department of Physics, Princeton University, Princeton, NJ 08544, U.S.A. Naftali Tishby NEC Research Institute, Princeton, NJ 08540, U.S.A., and School of Computer Science and Engineering and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel We dene predictive information Ipred ( T ) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T : Ipred ( T ) can remain nite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a nite number of parameters, then Ipred ( T ) grows logarithmically with a coefcient that counts the dimensionality of the model space. In contrast, power-law growth is associated, for example, with the learning of innite parameter (or nonparametric) models such as continuous functions with smoothness constraints. There are connections between the predictive information and measures of complexity that have been dened both in learning theory and the analysis of physical systems through statistical mechanics and dynamical systems theory. Furthermore, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred ( T ) provides the unique measure for the complexity of dynamics underlying a time series. Finally, we discuss how these ideas may be useful in problems in physics, statistics, and biology. 1 Introduction
There is an obvious interest in having practical algorithms for predicting the future, and there is a correspondingly large literature on the problem of time-series extrapolation.1 But prediction is both more and less than extrap1 The classic papers are by Kolmogoroff (1939, 1941) and Wiener (1949), who essentially solved all the extrapolation problems that could be solved by linear methods. Our under-
c 2001 Massachusetts Institute of Technology Neural Computation 13, 2409– 2463 (2001) °
2410
W. Bialek, I. Nemenman, and N. Tishby
olation. We might be able to predict, for example, the chance of rain in the coming week even if we cannot extrapolate the trajectory of temperature uctuations. In the spirit of its thermodynamic origins, information theory (Shannon, 1948) characterizes the potentialities and limitations of all possible prediction algorithms, as well as unifying the analysis of extrapolation with the more general notion of predictability. Specically, we can dene a quantity—the predictive information—that measures how much our observations of the past can tell us about the future. The predictive information characterizes the world we are observing, and we shall see that this characterization is close to our intuition about the complexity of the underlying dynamics. Prediction is one of the fundamental problems in neural computation. Much of what we admire in expert human performance is predictive in character: the point guard who passes the basketball to a place where his teammate will arrive in a split second, the chess master who knows how moves made now will inuence the end game two hours hence, the investor who buys a stock in anticipation that it will grow in the year to come. More generally, we gather sensory information not for its own sake but in the hope that this information will guide our actions (including our verbal actions). But acting takes time, and sense data can guide us only to the extent that those data inform us about the state of the world at the time of our actions, so the only components of the incoming data that have a chance of being useful are those that are predictive. Put bluntly, nonpredictive information is useless to the organism, and it therefore makes sense to isolate the predictive information. It will turn out that most of the information we collect over a long period of time is nonpredictive, so that isolating the predictive information must go a long way toward separating out those features of the sensory world that are relevant for behavior. One of the most important examples of prediction is the phenomenon of generalization in learning. Learning is formalized as nding a model that explains or describes a set of observations, but again this is useful only because we expect this model will continue to be valid. In the language of learning theory (see, for example, Vapnik, 1998), an animal can gain selective advantage not from its performance on the training data but only from its performance at generalization. Generalizing—and not “overtting” the training data—is precisely the problem of isolating those features of the data that have predictive value (see also Bialek and Tishby, in preparation). Furthermore, we know that the success of generalization hinges on controlling the complexity of the models that we are willing to consider as possibilities. standing of predictability was changed by developments in dynamical systems, which showed that apparently random (chaotic) time series could arise from simple deterministic rules, and this led to vigorous exploration of nonlinear extrapolation algorithms (Abarbanel et al., 1993). For a review comparing different approaches, see the conference proceedings edited by Weigend and Gershenfeld (1994).
Predictability, Complexity, and Learning
2411
Finally, learning a model to describe a data set can be seen as an encoding of those data, as emphasized by Rissanen (1989), and the quality of this encoding can be measured using the ideas of information theory. Thus, the exploration of learning problems should provide us with explicit links among the concepts of entropy, predictability, and complexity. The notion of complexity arises not only in learning theory, but also in several other contexts. Some physical systems exhibit more complex dynamics than others (turbulent versus laminar ows in uids), and some systems evolve toward more complex states than others (spin glasses versus ferromagnets). The problem of characterizing complexity in physical systems has a substantial literature of its own (for an overview, see Bennett, 1990). In this context several authors have considered complexity measures based on entropy or mutual information, although, as far as we know, no clear connections have been drawn among the measures of complexity that arise in learning theory and those that arise in dynamical systems and statistical mechanics. An essential difculty in quantifying complexity is to distinguish complexity from randomness. A true random string cannot be compressed and hence requires a long description; it thus is complex in the sense dened by Kolmogorov (1965; Li & VitÂanyi, 1993; VitÂanyi & Li, 2000), yet the physical process that generates this string may have a very simple description. Both in statistical mechanics and in learning theory, our intuitive notions of complexity correspond to the statements about complexity of the underlying process, and not directly to the description length or Kolmogorov complexity. Our central result is that the predictive information provides a general measure of complexity, which includes as special cases the relevant concepts from learning theory and dynamical systems. While work on complexity in learning theory rests specically on the idea that one is trying to infer a model from data, the predictive information is a property of the data (or, more precisely, of an ensemble of data) themselves without reference to a specic class of underlying models. If the data are generated by a process in a known class but with unknown parameters, then we can calculate the predictive information explicitly and show that this information diverges logarithmically with the size of the data set we have observed; the coefcient of this divergence counts the number of parameters in the model or, more precisely, the effective dimension of the model class, and this provides a link to known results of Rissanen and others. We also can quantify the complexity of processes that fall outside the conventional nite dimensional models, and we show that these more complex processes are characterized by a power law rather than a logarithmic divergence of the predictive information. By analogy with the analysis of critical phenomena in statistical physics, the separation of logarithmic from power-law divergences, together with the measurement of coefcients and exponents for these divergences, allows us to dene “universality classes” for the complexity of data streams. The
2412
W. Bialek, I. Nemenman, and N. Tishby
power law or nonparametric class of processes may be crucial in real-world learning tasks, where the effective number of parameters becomes so large that asymptotic results for nitely parameterizable models are inaccessible in practice. There is empirical evidence that simple physical systems can generate dynamics in this complexity class, and there are hints that language also may fall in this class. Finally, we argue that the divergent components of the predictive information provide a unique measure of complexity that is consistent with certain simple requirements. This argument is in the spirit of Shannon’soriginal derivation of entropy as the unique measure of available information. We believe that this uniqueness argument provides a conclusive answer to the question of how one should quantify the complexity of a process generating a time series. With the evident cost of lengthening our discussion, we have tried to give a self-contained presentation that develops our point of view, uses simple examples to connect with known results, and then generalizes and goes beyond these results. 2 Even in cases where at least the qualitative form of our results is known from previous work, we believe that our point of view elucidates some issues that may have been less the focus of earlier studies. Last but not least, we explore the possibilities for connecting our theoretical discussion with the experimental characterization of learning and complexity in neural systems. 2 A Curious Observation Before starting the systematic analysis of the problem, we want to motivate our discussion further by presenting results of some simple numerical experiments. Since most of the article draws examples from learning, here we consider examples from equilibrium statistical mechanics. Suppose that we have a one-dimensional chain of Ising spins with the Hamiltonian given by HD¡
X
Jij si sj ,
(2.1)
i,j
where the matrix of interactions Jij is not restricted to nearest neighbors; long-range interactions are also allowed. One may identify spins pointing upward with 1 and downward with 0, and then a spin chain is equivalent to some sequence of binary digits. This sequence consists of (overlapping) words of N digits each, W k, k D 0, 1 ¢ ¢ ¢ 2N ¡1. There are 2N such words total, and they appear with very different frequencies n ( W k ) in the spin chain (see 2
Some of the basic ideas presented here, together with some connections to earlier work, can be found in brief preliminary reports (Bialek, 1995; Bialek & Tishby, 1999). The central results of this work, however, were at best conjectures in these preliminary accounts.
Predictability, Complexity, and Learning
2413
0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 1 W
0
W
1
... W
9
... W
7
... W
0
W
1
W
0
= 0 0 0 0
W
1
= 0 0 0 1
W
2
= 0 0 1 0 . . .
W 15 = 1 1 1 1
Figure 1: Calculating entropy of words of length 4 in a chain of 17 spins. For this chain, n( W0 ) D n ( W1 ) D n ( W3 ) D n ( W7 ) D n ( W12 ) D n( W14 ) D 2, n ( W8 ) D n ( W9 ) D 1, and all other frequencies are zero. Thus, S ( 4) ¼ 2.95 bits.
Figure 1 for details). If the number of spins is large, then counting these frequencies provides a good empirical estimate of PN ( W k ) , the probability distribution of different words of length N. Then one can calculate the entropy S ( N ) of this probability distribution by the usual formula:
S( N ) D ¡
N 2X ¡1
PN ( W k ) log2 PN ( W k )
(bits) .
(2.2)
kD 0
Note that this is not the entropy of a nite chain with length N; instead, it is the entropy of words or strings with length N drawn from a much longer chain. Nonetheless, since entropy is an extensive property, S ( N ) is proportional asymptotically to N for any spin chain, that is, S ( N ) ¼ S0 ¢ N. The usual goal in statistical mechanics is to understand this “thermodynamic limit” N ! 1, and hence to calculate the entropy density S0. Different sets of interactions Jij result in different values of S0 , but the qualitative result S ( N ) / N is true for all reasonable fJij g. We investigated three different spin chains of 1 billion spins each. As usual in statistical mechanics, the probability of any conguration of spins fsi g is given by the Boltzmann distribution, P[fsi g] / exp( ¡H / kB T ) ,
(2.3)
where to normalize the scale of the Jij we set kB T D 1. For the rst chain, only Ji,i C 1 D 1 was nonzero, and its value was the same for all i. The second chain was also generated using the nearest-neighbor interactions, but the value of the coupling was reset every 400,000 spins by taking a random number from a gaussian distribution with zero mean and unit variance. In the third case, we again reset the interactions at the same frequency, but now interactions were long-ranged; the variances of coupling constants decreased with the distance between the spins as hJij2 i D 1 / ( i ¡ j) 2 . We plot S ( N ) for all these cases in Figure 2, and, of course, the asymptotically linear
2414
W. Bialek, I. Nemenman, and N. Tishby 25
fixed J variable J, short range interactions variable J’s, long range decaying interactions
20
S
15
10
5
0
5
10
N
15
20
25
Figure 2: Entropy as a function of the word length for spin chains with different interactions. Notice that all lines start from S( N ) D log2 2 D 1 since at the values of the coupling we investigated, the correlation length is much smaller than the chain length (1 ¢ 109 spins).
behavior is evident—the extensive entropy shows no qualitative distinction among the three cases we consider. The situation changes drastically if we remove the asymptotic linear contribution and focus on the corrections to extensive behavior. Specically, we write S ( N ) D S0 ¢ N C S1 ( N ) , and plot only the sublinear component S1 ( N ) of the entropy. As we see in Figure 3, the three chains then exhibit qualitatively different features: for the rst one, S1 is constant; for the second one, it is logarithmic; and for the third one, it clearly shows a power-law behavior. What is the signicance of these observations? Of course, the differences in the behavior of S1 ( N ) must be related to the ways we chose J’s for the simulations. In the rst case, J is xed, and if we see N spins and try to predict the state of the N C 1st spin, all that really matters is the state of the spin sN ; there is nothing to “learn” from observations on longer segments of the chain. For the second chain, J changes, and the statistics of the spin words are different in different parts of the sequence. By looking at these statistics, one can “learn” the coupling at the current position; this estimate improves the more spins (longer words) we observe. Finally, in the third case, there are many coupling constants that can be learned. As N increases, one becomes sensitive to weaker correlations caused by interactions over larger and larger
Predictability, Complexity, and Learning 7
2415
fixed J variable J, short range interactions variable J’s, long range decaying interactions fits
6 5
S1=const1+const2 N
0.5
S
1
4 3 S =const +1/2 log N 1
1
2 S =const
1 0
1
5
10
N
15
20
25
Figure 3: Subextensive part of the entropy as a function of the word length.
distances. So, intuitively, the qualitatively different behaviors of S1 ( N ) in the three plotted cases are correlated with differences in the problem of learning the underlying dynamics of the spin chains from observations on samples of the spins themselves. Much of this article can be seen as expanding on and quantifying this intuitive observation. 3 Fundamentals The problem of prediction comes in various forms, as noted in the Introduction. Information theory allows us to treat the different notions of prediction on the same footing. The rst step is to recognize that all predictions are probabilistic. Even if we can predict the temperature at noon tomorrow, we should provide error bars or condence limits on our prediction. The next step is to remember that even before we look at the data, we know that certain futures are more likely than others, and we can summarize this knowledge by a prior probability distribution for the future. Our observations on the past lead us to a new, more tightly concentrated distribution: the distribution of futures conditional on the past data. Different kinds of predictions are different slices through or averages over this conditional distribution, but information theory quanties the “concentration” of the distribution without making any commitment as to which averages will be most interesting.
2416
W. Bialek, I. Nemenman, and N. Tishby
Imagine that we observe a stream of data x ( t) over a time interval ¡T < t < 0. Let all of these past data be denoted by the shorthand xpast. We are interested in saying something about the future, so we want to know about the data x( t) that will be observed in the time interval 0 < t < T 0 ; let these future data be called xfuture. In the absence of any other knowledge, futures are drawn from the probability distribution P ( xfuture ) , and observations of particular past data xpast tell us that futures will be drawn from the conditional distribution P ( xfuture |xpast ) . The greater concentration of the conditional distribution can be quantied by the fact that it has smaller entropy than the prior distribution, and this reduction in entropy is Shannon’s denition of the information that the past provides about the future. We can write the average of this predictive information as ½ µ ¶¾ P ( xfuture |xpast ) 0 Ipred ( T , T ) D log2 (3.1) P ( xfuture ) D ¡hlog2 P ( xfuture ) i ¡ hlog2 P ( xpast ) i £ ¤ ¡ ¡hlog2 P ( xfuture, xpast ) i , (3.2) where h¢ ¢ ¢i denotes an average over the joint distribution of the past and the future, P ( xfuture, xpast ) . Each of the terms in equation 3.2 is an entropy. Since we are interested in predictability or generalization, which are associated with some features of the signal persisting forever, we may assume stationarity or invariance under time translations. Then the entropy of the past data depends only on the duration of our observations, so we can write ¡hlog2 P ( xpast ) i D S ( T ) , and by the same argument ¡hlog2 P ( xfuture ) i D S ( T 0 ) . Finally, the entropy of the past and the future taken together is the entropy of observations on a window of duration T C T 0 , so that ¡hlog2 P ( xfuture, xpast ) i D S ( T C T 0 ) . Putting these equations together, we obtain
Ipred ( T, T0 ) D S ( T ) C S ( T0 ) ¡ S ( T C T0 ) .
(3.3)
It is important to recall that mutual information is a symmetric quantity. Thus, we can view Ipred ( T, T 0 ) as either the information that a data segment of duration T provides about the future of length T 0 or the information that a data segment of duration T 0 provides about the immediate past of duration T. This is a direct application of the denition of information but seems counterintuitive. Shouldn’t it be more difcult to predict than to postdict? One can perhaps recover the correct intuition by thinking about a large ensemble of question-answer pairs. Prediction corresponds to generating the answer to a given question, while postdiction corresponds to generating the question that goes with a given answer. We know that guessing questions given answers is also a hard problem3 and can be just as challenging 3 This is the basis of the popular American television game show Jeopardy! widely viewed as the most “intellectual” of its genre.
Predictability, Complexity, and Learning
2417
as the more conventional problem of answering the questions themselves. Our focus here is on prediction because we want to make connections with the phenomenon of generalization in learning, but it is clear that generating a consistent interpretation of observed data may involve elements of both prediction and postdiction (see, for example, Eagleman & Sejnowski, 2000); it is attractive that the information-theoretic formulation treats these problems symmetrically. In the same way that the entropy of a gas at xed density is proportional to the volume, the entropy of a time series (asymptotically) is proportional to its duration, so that limT!1 S ( T ) / T D S0; entropy is an extensive quantity. But from Equation 3.3, any extensive component of the entropy cancels in the computation of the predictive information: predictability is a deviation from extensivity. If we write S ( T ) D S0 T C S1 ( T ), then equation 3.3 tells us that the predictive information is related only to the nonextensive term S1 ( T ) . Note that if we are observing a deterministic system, then S0 D 0, but this is independent of questions about the structure of the subextensive term S1 ( T ) . It is attractive that information theory gives us a unied discussion of prediction in deterministic and probabilistic cases. We know two general facts about the behavior of S1 ( T ) . First, the corrections to extensive behavior are positive, S1 ( T ) ¸ 0. Second, the statement that entropy is extensive is the statement that the limit S( T ) D S0 T!1 T lim
(3.4)
exists, and for this to be true we must also have lim
T!1
S1 ( T ) D 0. T
(3.5)
Thus, the nonextensive terms in the entropy must be subextensive, that is, they must grow with T less rapidly than a linear function. Taken together, these facts guarantee that the predictive information is positive and subextensive. Furthermore, if we let the future extend forward for a very long time, T 0 ! 1, then we can measure the information that our sample provides about the entire future:
Ipred ( T, T0 ) D S1 ( T ) . Ipred ( T ) D lim 0 T !1
(3.6)
Similarly, instead of increasing the duration of the future to innity, we could have considered the mutual information between a sample of length T and all of the innite past. Then the postdictive information also is equal to S1 ( T ) , and the symmetry between prediction and postdiction is even more profound; not only is there symmetry between questions and answers, but observations on a given period of time provide the same amount of information about the historical path that led to our observations as about the
2418
W. Bialek, I. Nemenman, and N. Tishby
future that will unfold from them. In some cases, this statement becomes even stronger. For example, if the subextensive entropy of a long, discontinuous observation of a total length T with a gap of a duration dT ¿ T is ) , then the subextensive entropy of the present is not equal to S1 ( T ) C O ( dT T only its information about the past or the future, but also the information about the past and the future. If we have been observing a time series for a (long) time T, then the total amount of data we have collected is measured by the entropy S ( T ) , and at large T this is given approximately by S0T. But the predictive information that we have gathered cannot grow linearly with time, even if we are making predictions about a future that stretches out to innity. As a result, of the total information we have taken in by observing xpast , only a vanishing fraction is of relevance to the prediction: Ipred ( T ) Predictive information ! 0. D T!1 Total information S( T ) lim
(3.7)
In this precise sense, most of what we observe is irrelevant to the problem of predicting the future. 4 Consider the case where time is measured in discrete steps, so that we have seen N time points x1 , x2 , . . . , xN . How much have we learned about the underlying pattern in these data? The more we know, the more effectively we can predict the next data point xN C 1 and hence the fewer bits we will need to describe the deviation of this data point from our prediction. Our accumulated knowledge about the time series is measured by the degree to which we can compress the description of new observations. On average, the length of the code word required to describe the point xN C 1 , given that we have seen the previous N points, is given by `( N ) D ¡hlog2 P ( xN C 1 | x1 , x2 , . . . , xN ) i
bits,
(3.8)
where the expectation value is taken over the joint distribution of all the N C 1 points, P ( x1 , x2 , . . . , xN , xN C 1 ) . It is easy to see that `( N ) D S ( N C 1) ¡ S ( N ) ¼
@S ( N ) @N
.
(3.9)
As we observe for longer times, we learn more, and this word length decreases. It is natural to dene a learning curve that measures this improvement. Usually we dene learning curves by measuring the frequency or 4 We can think of equation 3.7 as a law of diminishing returns. Although we collect data in proportion to our observation time T, a smaller and smaller fraction of this information is useful in the problem of prediction. These diminishing returns are not due to a limited lifetime, since we calculate the predictive information assuming that we have a future extending forward to innity. A senior colleague points out that this is an argument for changing elds before becoming too expert.
Predictability, Complexity, and Learning
2419
costs of errors; here the cost is that our encoding of the point xN C 1 is longer than it could be if we had perfect knowledge. This ideal encoding has a length that we can nd by imagining that we observe the time series for an innitely long time, `ideal D limN!1 `( N ) , but this is just another way of dening the extensive component of the entropy S0 . Thus, we can dene a learning curve L ( N ) ´ `( N ) ¡ `ideal
(3.10)
D S ( N C 1) ¡ S ( N ) ¡ S 0 D S1 ( N C 1) ¡ S1 ( N )
¼
@S1 ( N ) @N
D
@Ipred ( N ) @N
,
(3.11)
and we see once again that the extensive component of the entropy cancels. It is well known that the problems of prediction and compression are related, and what we have done here is to illustrate one aspect of this connection. Specically, if we ask how much one segment of a time series can tell us about the future, the answer is contained in the subextensive behavior of the entropy. If we ask how much we are learning about the structure of the time series, then the natural and universally dened learning curve is related again to the subextensive entropy; the learning curve is the derivative of the predictive information. This universal learning curve is connected to the more conventional learning curves in specic contexts. As an example (cf. section 4.1), consider tting a set of data points fxn , yn g with some class of functions y D f ( xI ®) , where ® are unknown parameters that need to be learned; we also allow for some gaussian noise in our observation of the yn . Here the natural learning curve is the evolution of  2 for generalization as a function of the number of examples. Within the approximations discussed below, it is straightforward to show that as N becomes large, D E ¤2 E 1 D£ ! ( 2 ln 2) L ( N ) C 1,  2 ( N ) D 2 y ¡ f ( xI ® ) s
(3.12)
where s 2 is the variance of the noise. Thus, a more conventional measure of performance at learning a function is equal to the universal learning curve dened purely by information-theoretic criteria. In other words, if a learning curve is measured in the right units, then its integral represents the amount of the useful information accumulated. Then the subextensivity of S1 guarantees that the learning curve decreases to zero as N ! 1. Different quantities related to the subextensive entropy have been discussed in several contexts. For example, the code length `( N ) has been dened as a learning curve in the specic case of neural networks (Opper & Haussler, 1995) and has been termed the “thermodynamic dive” (Crutcheld & Shalizi, 1999) and “Nth order block entropy” (Grassberger, 1986).
2420
W. Bialek, I. Nemenman, and N. Tishby
The universal learning curve L ( N ) has been studied as the expected instantaneous information gain by Haussler, Kearns, & Schapire (1994). Mutual information between all of the past and all of the future (both semi-innite) is known also as the excess entropy, effective measure complexity, stored information, and so on (see Shalizi & Crutcheld, 1999, and references therein, as well as the discussion below). If the data allow a description by a model with a nite (and in some cases also innite) number of parameters, then mutual information between the data and the parameters is of interest. This is easily shown to be equal to the predictive information about all of the future, and it is also the cumulative information gain (Haussler, Kearns, & Schapire, 1994) or the cumulative relative entropy risk (Haussler & Opper, 1997). Investigation of this problem can be traced back to Renyi (1964) and Ibragimov and Hasminskii (1972), and some particular topics are still being discussed (Haussler & Opper, 1995; Opper & Haussler, 1995; Herschkowitz & Nadal, 1999). In certain limits, decoding a signal from a population of N neurons can be thought of as “learning” a parameter from N observations, with a parallel notion of information transmission (Brunel & Nadal, 1998; Kang & Sompolinsky, 2001). In addition, the subextensive component of the description length (Rissanen, 1978, 1989, 1996; Clarke & Barron, 1990) averaged over a class of allowed models also is similar to the predictive information. What is important is that the predictive information or subextensive entropy is related to all these quantities and that it can be dened for any process without a reference to a class of models. It is this universality that we nd appealing, and this universality is strongest if we focus on the limit of long observation times. Qualitatively, in this regime (T ! 1) we expect the predictive information to behave in one of three different ways, as illustrated by the Ising models above: it may either stay nite or grow to innity together with T; in the latter case the rate of growth may be slow (logarithmic) or fast (sublinear power) (see Barron and Cover, 1991, for a similar classication in the framework of the minimal description length [MDL] analysis). The rst possibility, limT!1 Ipred ( T ) D constant, means that no matter how long we observe, we gain only a nite amount of information about the future. This situation prevails, for example, when the dynamics are too regular. For a purely periodic system, complete prediction is possible once we know the phase, and if we sample the data at discrete times, this is a nite amount of information; longer period orbits intuitively are more complex and also have larger Ipred , but this does not change the limiting behavior limT!1 Ipred ( T ) D constant. Alternatively, the predictive information can be small when the dynamics are irregular, but the best predictions are controlled only by the immediate past, so that the correlation times of the observable data are nite (see, for example, Crutcheld & Feldman, 1997, and the xed J case in Figure 3). Imagine, for example, that we observe x ( t) at a series of discrete times ftn g, and that at each time point we nd the value xn . Then we always can write
Predictability, Complexity, and Learning
2421
the joint distribution of the N data points as a product, P ( x1 , x2 , . . . , xN ) D P ( x1 ) P ( x2 |x1 ) P ( x3 |x2 , x1 ) ¢ ¢ ¢ .
(3.13)
For Markov processes, what we observe at tn depends only on events at the previous time step tn¡1 , so that P ( xn |fx1·i ·n¡1 g) D P ( xn |xn¡1 ) ,
(3.14)
and hence the predictive information reduces to ½ µ ¶¾ P ( xn |xn¡1 ) . Ipred D log2 P ( xn )
(3.15)
The maximum possible predictive information in this case is the entropy of the distribution of states at one time step, which is bounded by the logarithm of the number of accessible states. To approach this bound, the system must maintain memory for a long time, since the predictive information is reduced by the entropy of the transition probabilities. Thus, systems with more states and longer memories have larger values of Ipred . More interesting are those cases in which Ipred ( T ) diverges at large T. In physical systems, we know that there are critical points where correlation times become innite, so that optimal predictions will be inuenced by events in the arbitrarily distant past. Under these conditions, the predictive information can grow without bound as T becomes large; for many systems, the divergence is logarithmic, Ipred ( T ! 1) / ln T, as for the variable Jij , short-range Ising model of Figures 2 and 3. Long-range correlations also are important in a time series where we can learn some underlying rules. It will turn out that when the set of possible rules can be described by a nite number of parameters, the predictive information again diverges logarithmically, and the coefcient of this divergence counts the number of parameters. Finally, a faster growth is also possible, so that Ipred ( T ! 1) / T a , as for the variable Jij long-range Ising model, and we shall see that this behavior emerges from, for example, nonparametric learning problems. 4 Learning and Predictability Learning is of interest precisely in those situations where correlations or associations persist over long periods of time. In the usual theoretical models, there is some rule underlying the observable data, and this rule is valid forever; examples seen at one time inform us about the rule, and this information can be used to make predictions or generalizations. The predictive information quanties the average generalization power of examples, and we shall see that there is a direct connection between the predictive information and the complexity of the possible underlying rules.
2422
W. Bialek, I. Nemenman, and N. Tishby
4.1 A Test Case. Let us begin with a simple example, already referred to above. We observe two streams of data x and y, or equivalently a stream of pairs ( x1 , y1 ), ( x2 , y2 ) , . . . , ( xN , yN ). Assume that we know in advance that the x’s are drawn independently and at random from a distribution P ( x) , while the y’s are noisy versions of some function acting on x, yn D f ( xn I ®) C gn ,
(4.1)
where f ( xI ® ) is a class of functions parameterized by ®, and gn is noise, which for simplicity we will assume is gaussian with known standard deviation s. We can even start with a very simple case, where the function class is just a linear combination of basis functions, so that f ( xI ® ) D
K X
am w m ( x ).
(4.2)
m D1
The usual problem is to estimate, from N pairs fxi , yi g, the values of the parameters ®. In favorable cases such as this, we might even be able to nd an effective regression formula. We are interested in evaluating the predictive information, which means that we need to know the entropy S ( N ) . We go through the calculation in some detail because it provides a model for the more general case. To evaluate the entropy S ( N ), we rst construct the probability distribution P ( x1 , y1 , x2 , y2 , . . . , xN , yN ) . The same set of rules applies to the whole data stream, which here means that the same parameters ® apply for all pairs fxi , yi g, but these parameters are chosen at random from a distribution P ( ®) at the start of the stream. Thus we write P ( x1 , y1 , x2 , y2 , . . . , xN , yN ) Z D dK aP( x1 , y1 , x2 , y2 , . . . , xN , yN | ® ) P ( ®) ,
(4.3)
and now we need to construct the conditional distributions for xed ®. By hypothesis, each x is chosen independently, and once we x ®, each yi is correlated only with the corresponding xi , so that we have P ( x1 , y1 , x2 , y2 , . . . , xN , yN | ®) D
N £ Y iD 1
¤ P ( xi ) P ( yi | xi I ®) .
(4.4)
Furthermore, with the simple assumptions above about the class of functions and gaussian noise, the conditional distribution of yi has the form 2 ³ ´2 3 K X 1 1 exp 4 ¡ 2 yi ¡ am w m ( xi ) 5 . (4.5) P ( yi | xi I ®) D p 2s 2p s 2 m D1
Predictability, Complexity, and Learning
2423
Putting all these factors together, P ( x1 , y1 , x2 , y2 , . . . , xN , yN ) " #³ " # ´N Z N N Y 1 1 X 2 K p D P ( xi ) d a P ( ®) exp ¡ 2 y 2s i D1 i 2p s 2 iD1 " # K K X N X £ exp ¡ Amº( fxi g)am aº C N Bm ( fxi , yi g) am , 2 m ,ºD 1 m D1
(4.6)
where Amº ( fxi g) D Bm ( fxi , yi g) D
N 1 X w m ( xi )wº ( xi ) s 2 N iD 1 N 1 X yi w m ( xi ) . s 2 N iD 1
and
(4.7)
(4.8)
Our placement of the factors of N means that both Amº and Bm are of order unity as N ! 1. These quantities are empirical averages over the samples fxi , yi g, and if the wm are well behaved, we expect that these empirical means converge to expectation values for most realizations of the series fxi g: Z 1 1 lim Amº (fxi g) D Amº (4.9) D 2 dxP ( x ) w m ( x ) wº ( x) , s N!1 lim Bm ( fxi , yi g) D Bm1 D
N!1
K X ºD 1
1 a Amº N º,
(4.10)
where ® stream fxi , yi g. N are the parameters that actually gave rise to the dataP In fact, we can make the same argument about the terms in y2i , " # N K X X 2 2 1 lim aN m AmºaN º C 1 . (4.11) yi D Ns N!1
i D1
m ,ºD 1
Conditions for this convergence of empirical means to expectation values are at the heart of learning theory. Our approach here is rst to assume that this convergence works, then to examine the consequences for the predictive information, and nally to address the conditions for and implications of this convergence breaking down. Putting the different factors together, we obtain " #³ ´N Z N Y 1 ( ) ( ) ! p P x1 , y1 , x2 , y2 , . . . , xN , yN f P xi dK aP ( ® ) 2p s 2 i D1 £ ¤ £ exp ¡NEN ( ®I fxi , yi g) , (4.12)
2424
W. Bialek, I. Nemenman, and N. Tishby
where the effective “energy” per sample is given by E N ( ®I fxi , yi g) D
K 1 1 X 1( (am ¡ aN m ) Amº C aº ¡ aN º) . 2 2 m ,ºD 1
(4.13)
Here we use the symbol f ! to indicate that we not only take the limit of large N but also neglect the uctuations. Note that in this approximation, the dependence on the sample points themselves is hidden in the denition of ® N as being the parameters that generated the samples. The integral that we need to do in equation 4.12 involves an exponential with a large factor N in the exponent; the energy EN is of order unity as N ! 1. This suggests that we evaluate the integral by a saddle-point or steepest-descent approximation (similar analyses were performed by Clarke & Barron, 1990; MacKay, 1992; and Balasubramanian, 1997): Z £ ¤ dK aP ( ®) exp ¡NEN ( ®I fxi , yi g) µ K N ¼ P ( ®cl ) exp ¡NEN ( ®cl I fxi , yi g) ¡ ln 2 2p ¶ 1 ¡ ln det FN C ¢ ¢ ¢ , (4.14) 2 where ®cl is the “classical” value of ® determined by the extremal conditions @EN ( ®I fxi , yi g) (4.15) D 0, @am aD acl the matrix FN consists of the second derivatives of EN , @2 EN ( ®I fxi , yi g) , FN D @am @aº aD acl
(4.16)
and ¢ ¢ ¢ denotes terms that vanish as N ! 1. If we formulate the problem of estimating the parameters ® from the samples fxi , yi g, then as N ! 1, the matrix NFN is the Fisher information matrix (Cover & Thomas, 1991); the eigenvectors of this matrix give the principal axes for the error ellipsoid in parameter space, and the (inverse) eigenvalues give the variances of parameter estimates along each of these directions. The classical ®cl differs from ® N only in terms of order 1 / N; we neglect this difference and further simplify the calculation of leading terms as N becomes large. After a little more algebra, we nd the probability distribution we have been looking for: P ( x1 , y1 , x2 , y2 , . . . , xN , yN ) " # µ ¶ N Y 1 N K ! f P(® N ) exp ¡ ln(2p es 2 ) ¡ ln N C ¢ ¢ ¢ , P ( xi ) 2 2 ZA iD 1
(4.17)
Predictability, Complexity, and Learning
where the normalization constant p ZA D ( 2p ) K det A1 .
2425
(4.18)
Again we note that the sample points fxi , yi g are hidden in the value of ® N that gave rise to these points.5 To evaluate the entropy S ( N ) , we need to compute the expectation value of the (negative) logarithm of the probability distribution in equation 4.17; there are three terms. One is constant, so averaging is trivial. The second term depends only on the xi , and because these are chosen independently from the distribution P ( x ), the average again is easy to evaluate. The third term involves ® N , and we need to average this over the joint distribution P ( x1 , y1 , x2 , y2 , . . . , xN , yN ) . As above, we can evaluate this average in steps. First, we choose a value of the parameters ® N , then we average over the samples given these parameters, and nally we average over parameters. But because ® N is dened as the parameters that generate the samples, this stepwise procedure simplies enormously. The end result is that µ ± ²¶ K 1 S ( N ) D N Sx C log2 2p es 2 C log2 N 2 2 « ¬ C Sa C log2 ZA a C ¢ ¢ ¢ , (4.19) where h¢ ¢ ¢ia means averaging over parameters, Sx is the entropy of the distribution of x, Z (4.20) Sx D ¡ dx P ( x ) log2 P ( x ) , and similarly for the entropy of the distribution of parameters, Z Sa D ¡ dK a P ( ® ) log2 P ( ® ).
(4.21)
5 We emphasize once more that there are two approximations leading to equation 4.17. First, we have replaced empirical means by expectation values, neglecting uctuations associated with the particular set of sample points fxi , yi g. Second, we have evaluated the average over parameters in a saddle-point approximation. At least under some conditions, both of these approximations become increasingly accurate as N ! 1, so that this approach should yield the asymptotic behavior of the distribution and, hence, the subextensive entropy at large N. Although we give a more detailed analysis below, it is worth noting here how things can go wrong. The two approximations are independent, and we could imagine that uctuations are important but saddle-point integration still works, for example. Controlling the uctuations turns out to be exactly the question of whether our nite parameterization captures the true dimensionality of the class of models, as discussed in the classic work of Vapnik, Chervonenkis, and others (see Vapnik, 1998, for a review). The saddle-point approximation can break down because the saddle point becomes unstable or because multiple saddle points become important. It will turn out that instability is exponentially improbable as N ! 1, while multiple saddle points are a real problem in certain classes of models, again when counting parameters does not really measure the complexity of the model class.
2426
W. Bialek, I. Nemenman, and N. Tishby
The different terms in the entropy equation 4.19 have a straightforward interpretation. First, we see that the extensive term in the entropy, 1 log2 ( 2p es 2 ) , (4.22) 2 reects contributions from the random choice of x and from the gaussian noise in y; these extensive terms are independent of the variations in parameters ®, and these would be the only terms if the parameters were not varying (that is, if there were nothing to learn). There also is a term that reects the entropy of variations in the parameters themselves, Sa. This entropy is not invariant with respect to coordinate transformations in the parameter space, but the term hlog2 ZA ia compensates for this noninvariance. Finally, and most interesting for our purposes, the subextensive piece of the entropy is dominated by a logarithmic divergence,
S 0 D Sx C
K log2 N ( bits). (4.23) 2 The coefcient of this divergence counts the number of parameters independent of the coordinate system that we choose in the parameter space. Furthermore, this result does not depend on the set of basis functions fwm ( x )g. This is a hint that the result in equation 4.23 is more universal than our simple example. S1 ( N ) !
4.2 Learning a Parameterized Distribution. The problem discussed above is an example of supervised learning. We are given examples of how the points xn map into yn , and from these examples we are to induce the association or functional relation between x and y. An alternative view is that the pair of points ( x, y ) should be viewed as a vector xE , and what we are learning is the distribution of this vector. The problem of learning a distribution usually is called unsupervised learning, but in this case, supervised learning formally is a special case of unsupervised learning; if we admit that all the functional relations or associations that we are trying to learn have an element of noise or stochasticity, then this connection between supervised and unsupervised problems is quite general. Suppose a series of random vector variables fxE i g is drawn independently from the same probability distribution Q ( xE | ®) , and this distribution depends on a (potentially innite dimensional) vector of parameters ®. As above, the parameters are unknown, and before the series starts, they are chosen randomly from a distribution P ( ® ) . With no constraints on the densities P ( ®) or Q ( xE | ® ) , it is impossible to derive any regression formulas for parameter estimation, but one can still say something about the entropy of the data series and thus the predictive information. For a nite-dimensional vector of parameters ®, the literature on bounding similar quantities is especially rich (Haussler, Kearns, & Schapire, 1994; Wong & Shen, 1995; Haussler & Opper, 1995, 1997, and references therein), and related asymptotic calculations have been done (Clarke & Barron, 1990; MacKay, 1992; Balasubramanian, 1997).
Predictability, Complexity, and Learning
2427
We begin with the denition of entropy: S ( N ) ´ S[fxE i g] Z E 1 ¢ ¢ ¢ dxE N P ( xE 1 , xE 2 , . . . , x E N ) log2 P ( x E 1 , xE 2 , . . . , xE N ) . D ¡ dx
(4.24)
By analogy with equation 4.3, we then write Z P ( xE 1 , xE 2 , . . . , xE N ) D
dK aP ( ®)
N Y iD 1
Q ( xE i | ® ).
(4.25)
Next, combining the equations 4.24 and 4.25 and rearranging the order of integration, we can rewrite S ( N ) as Z S ( N ) D ¡ dK ®P N (® N) 8
£
:
dxE 1 ¢ ¢ ¢ dxE N
N Y jD 1
N ) log2 P (fxE i g) Q ( xE j | ®
9 = ;
.
(4.26)
Equation 4.26 allows an easy interpretation. There is the “true” set of parameters ® N that gave rise to the data sequence xE 1 , . . . , xE N with the probability QN ( | E N ). We need to average log2 P ( xE 1 , . . . , xE N ) rst over all possible Q x ® j jD 1 realizations of the data keeping the true parameters xed, and then over the parameters ® N themselves. With this interpretation in mind, the joint probability density, the logarithm of which is being averaged, can be rewritten in the following useful way: P ( xE 1 , . . . , xE N ) D D
N Y jD 1 N Y jD 1
EN ( ®I fxEi g) D ¡
Z Q ( xE j | ® N)
dK aP ( ® )
¶ N µ Y Q ( xE i | ® ) iD 1
Q ( xE i | ® N)
Z N) Q ( xE j | ®
dK aP (®) exp [¡N EN ( ®I fxE i g)] ,
µ ¶ N 1 X Q ( xE i | ® ) ln . N) N iD 1 Q ( xE i | ®
(4.27)
(4.28)
Since by our interpretation, ® N are the true parameters that gave rise to the particular data fxE i g, we may expect that the empirical average in equation 4.28 will converge to the corresponding expectation value, so that µ ¶ Z Q ( xE | ®) D ( ( ) EN ®I fxEi g) D ¡ d xQ x | ® ¡ y ( ®, ® (4.29) N ln N I fxi g), Q ( xE | ® N) where y ! 0 as N ! 1; here we neglect y and return to this term below.
2428
W. Bialek, I. Nemenman, and N. Tishby
The rst term on the right-hand side of equation 4.29 is the Kullback¨ Leibler (KL) divergence, DKL ( ® N k® ), between the true distribution characterized by parameters ® N and the possible distribution characterized by ®. Thus, at large N, we have P ( xE 1 , xE 2 , . . . , xE N ) ! f
N Y jD 1
Z
N) Q( xE j | ®
N k® ) ] , dK aP ( ® ) exp [¡NDKL ( ®
(4.30)
where again the notation f ! reminds us that we are not only taking the limit of large N but also making another approximationin neglecting uctuations. By the same arguments as above, we can proceed (formally) to compute the entropy of this distribution. We nd (a)
S ( N ) ¼ S 0 ¢ N C S1 ( N ) , µ Z ¶ Z K D S0 D d aP ( ®) ¡ d xQ( xE | ® ) log2 Q ( xE | ® ) , and µZ
Z (a)
S1 ( N ) D ¡
dK a N P(® N ) log2
¶ N . dK aP( ® ) e ¡NDKL ( aka)
(4.31) (4.32) (4.33)
(a)
Here S1 is an approximation to S1 that neglects uctuations y . This is the same as the annealed approximation in the statistical mechanics of disordered systems, as has been used widely in the study of supervised learning problems (Seung, Sompolinsky, & Tishby, 1992). Thus, we can identify the particular data sequence xE 1 , . . . , xE N with the disorder, EN ( ®I fxE i g) with the energy of the quenched system, and DKL ( ® N k®) with its annealed analog. The extensive term S 0, equation 4.32, is the average entropy of a distribution in our family of possible distributions, generalizing the result of equation 4.22. The subextensive terms in the entropy are controlled by the N dependence of the partition function Z dK aP ( ® ) exp [¡NDKL ( ® N k® ) ] ,
Z( ® N I N) D
(4.34)
and Sa1 ( N ) D ¡hlog2 Z( ® N I N ) iaN is analogous to the free energy. Since what is important in this integral is the KL divergence between different distributions, it is natural to ask about the density of models that are KL divergence N, D away from the target ® Z
r ( DI ® N) D
dK aP ( ® )d[D ¡ DKL ( ® N k®) ].
(4.35)
Predictability, Complexity, and Learning
2429
This density could be very different for different targets. 6 The density of divergences is normalized because the original distribution over parameter space, P ( ® ), is normalized, Z Z N) D (4.36) dDr ( DI ® dK aP ( ® ) D 1. Finally, the partition function takes the simple form Z Z( ® N I N) D dDr ( DI ® N ) exp[¡ND].
(4.37)
We recall that in statistical mechanics, the partition function is given by Z (4.38) Z( b ) D dEr ( E ) exp[¡bE], where r ( E ) is the density of states that have energy E and b is the inverse temperature. Thus, the subextensive entropy in our learning problem is analogous to a system in which energy corresponds to the KL divergence relative to the target model, and temperature is inverse to the number of examples. As we increase the length N of the time series we have observed, we “cool” the system and hence probe models that approach the target; the dynamics of this approach is determined by the density of low-energy states, that is, the behavior of r ( DI ® N ) as D ! 0.7 The structure of the partition function is determined by a competition between the (Boltzmann) exponential term, which favors models with small D, and the density term, which favors values of D that can be achieved by the largest possible number of models. Because there (typically) are many parameters, there are very few models with D ! 0. This picture of competition between the Boltzmann factor and a density of states has been emphasized in previous work on supervised learning (Haussler et al., 1996). The behavior of the density of states, r ( DI ® N ) , at small D is related to the more intuitive notion of dimensionality. In a parameterized family of distributions, the KL divergence between two distributions with nearby parameters is approximately a quadratic form, DKL ( ® N k® ) ¼
¢ 1 X¡ aN m ¡ am Fmº ( a N º ¡ aº) C ¢ ¢ ¢ , 2 mº
(4.39)
6 If parameter space is compact, then a related description of the space of targets based on metric entropy, also called Kolmogorov’s 2 -entropy, is used in the literature (see, for example, Haussler & Opper, 1997). Metric entropy is the logarithm of the smallest number of disjoint parts of the size not greater than 2 into which the target space can be partitioned. 7 It may be worth emphasizing that this analogy to statistical mechanics emerges from the analysis of the relevant probability distributions rather than being imposed on the problem through some assumptions about the nature of the learning algorithm.
2430
W. Bialek, I. Nemenman, and N. Tishby
where NF is the Fisher information matrix. Intuitively, if we have a reasonable parameterization of the distributions, then similar distributions will be nearby in parameter space, and, more important, points that are far apart in parameter space will never correspond to similar distributions; Clarke and Barron (1990) refer to this condition as the parameterization forming a “sound” family of distributions. If this condition is obeyed, then we can approximate the low D limit of the density r ( DI ® N ): Z
dK aP ( ® )d[D ¡ DKL ( ® N k® ) ]
r ( DI ® N) D
"
Z
¢ 1 X¡ ¼ a d aP ( ® )d D ¡ N m ¡ am Fmº ( aN º ¡ aº) 2 mº " # Z 1X 2 K L m jm , D d aP ( ® N C U ¢ » )d D ¡ 2 m
#
K
where U (U
(4.40)
is a matrix that diagonalizes F , T
¢ F ¢ U ) mº D L m dmº.
(4.41)
The delta p function restricts the components of » in equation 4.40 to be of order D or less, and so if P ( ® ) is smooth, we can make a perturbation expansion. After some algebra, the leading term becomes r ( D ! 0I ® N ) ¼ P(® N)
2p K/ 2 ( det F ) ¡1 / 2 D( K¡2) / 2 . C ( K / 2)
(4.42)
Here as before, K is the dimensionality of the parameter vector. Computing the partition function from equation 4.37, we nd Z( ® N I N ! 1) ¼ f ( ® N) ¢
C ( K / 2) , N K/ 2
(4.43)
where f ( ® N ) is some function of the target parameter values. Finally, this allows us to evaluate the subextensive entropy, from equations 4.33 and 4.34: Z (a)
S1 ( N ) D ¡ !
N ) log2 Z ( ® N I N) dK aN P ( ®
K log2 N C ¢ ¢ ¢ 2
(bits) ,
(4.44) (4.45)
where ¢ ¢ ¢ are nite as N ! 1. Thus, general K-parameter model classes have the same subextensive entropy as for the simplest example considered in the previous section. To the leading order, this result is independent even
Predictability, Complexity, and Learning
2431
of the prior distribution P ( ®) on the parameter space, so that the predictive information seems to count the number of parameters under some very general conditions (cf. Figure 3 for a very different numerical example of the logarithmic behavior). Although equation 4.45 is true under a wide range of conditions, this cannot be the whole story. Much of modern learning theory is concerned with the fact that counting parameters is not quite enough to characterize the complexity of a model class; the naive dimension of the parameter space K should be viewed in conjunction with the pseudodimension (also known as the shattering dimension or Vapnik-Chervonenkis dimension dVC ), which measures capacity of the model class, and with the phase-space dimension d, which accounts for volumes in the space of models (Vapnik, 1998; Opper, 1994). Both of these dimensions can differ from the number of parameters. One possibility is that dVC is innite when the number of parameters is nite, a problem discussed below. Another possibility is that the determinant of F is zero, and hence both dVC and d are smaller than the number of parameters because we have adopted a redundant description. This sort of degeneracy could occur over a nite fraction but not all, of the parameter space, and this is one way to generate an effective fractional dimensionality. One can imagine multifractal models such that the effective dimensionality varies continuously over the parameter space, but it is not obvious where this would be relevant. Finally, models with d < dVC < 1 also are possible (see, for example, Opper, 1994), and this list probably is not exhaustive. Equation 4.42 lets us actually dene the phase-space dimension through the exponent in the small DKL behavior of the model density, r ( D ! 0I ® N ) / D( d¡2 ) / 2 ,
(4.46)
and then d appears in place of K as the coefcient of the log divergence in S1 ( N ) (Clarke & Barron, 1990; Opper, 1994). However, this simple conclusion can fail in two ways. First, it can happen that the density r ( DI ® N) accumulates a macroscopic weight at some nonzero value of D, so that the small D behavior is irrelevant for the large N asymptotics. Second, the uctuations neglected here may be uncontrollably large, so that the asymptotics are never reached. Since controllability of uctuations is a function of dVC (see Vapnik, 1998; Haussler, Kearns, & Schapire, 1994; and later in this article), we may summarize this in the following way. Provided that the small D behavior of the density function is the relevant one, the coefcient of the logarithmic divergence of Ipred measures the phase-space or the scaling dimension d and nothing else. This asymptote is valid, however, only for N À dVC . It is still an open question whether the two pathologies that can violate this asymptotic behavior are related. 4.3 Learning a Parameterized Process. Consider a process where samples are not independent, and our task is to learn their joint distribution
2432
W. Bialek, I. Nemenman, and N. Tishby
Q( xE 1 , . . . , xE N | ® ) . Again, ® is an unknown parameter vector chosen randomly at the beginning of the series. If ® is a K-dimensional vector, then one still tries to learn just K numbers, and there are still N examples, even if there are correlations. Therefore, although such problems are much more general than those already considered, it is reasonable to expect that the predictive information still is measured by ( K / 2) log2 N provided that some conditions are met. One might suppose that conditions for simple results on the predictive information are very strong, for example, that the distribution Q is a niteorder Markov model. In fact, all we really need are the following two conditions: Z S [fxE i g | ®] ´ ¡ dN xE Q (fxE i g | ®) log2 Q ( fxE i g | ® ) ! NS0 C S0¤ I
S0¤ D O ( 1),
DKL [Q( fxE i g | ® N ) kQ( fxE i g| ®) ] ! N DKL ( ® N k® ) C o ( N ) .
(4.47) (4.48)
Here the quantities S0 , S0¤ , and DKL ( ® N k® ) are dened by taking limits N ! 1 in both equations. The rst of the constraints limits deviations from extensivity to be of order unity, so that if ® is known, there are no long-range correlations in the data. All of the long-range predictability is associated with learning the parameters.8 The second constraint, equation 4.48, is a less restrictive one, and it ensures that the “energy” of our statistical system is an extensive quantity. With these conditions, it is straightforward to show that the results of the previous subsection carry over virtually unchanged. With the same cautious statements about uctuations and the distinction between K, d, and dVC , one arrives at the result: (a) S ( N ) D S 0 ¢ N C S1 ( N ) , (a)
S1 ( N ) D
K log2 N C ¢ ¢ ¢ 2
(4.49) ( bits),
(4.50)
where ¢ ¢ ¢ stands for terms of order one as N ! 1. Note again that for the result of equation 4.50 to be valid, the process considered is not required to be a nite-order Markov process. Memory of all previous outcomes may be kept, provided that the accumulated memory does not contribute a divergent term to the subextensive entropy. 8 Suppose that we observe a gaussian stochastic process and try to learn the power spectrum. If the class of possible spectra includes ratios of polynomials in the frequency (rational spectra), then this condition is met. On the other hand, if the class of possible spectra includes 1 / f noise, then the condition may not be met. For more on long-range correlations, see below.
Predictability, Complexity, and Learning
2433
It is interesting to ask what happens if the condition in equation 4.47 is violated, so that there are long-range correlations even in the conditional distribution Q( xE 1 , . . . , xE N | ® ). Suppose, for example, that S0¤ D ( K¤ / 2) log2 N. Then the subextensive entropy becomes (a) S1 ( N ) D
K C K¤ log2 N C ¢ ¢ ¢ 2
(bits) .
(4.51)
We see that the subextensive entropy makes no distinction between predictability that comes from unknown parameters and predictability that comes from intrinsic correlations in the data; in this sense, two models with the same K C K¤ are equivalent. This actually must be so. As an example, consider a chain of Ising spins with long-range interactions in one dimension. This system can order (magnetize) and exhibit long-range correlations, and so the predictive information will diverge at the transition to ordering. In one view, there is no global parameter analogous to ®—just the long-range interactions. On the other hand, there are regimes in which we can approximate the effect of these interactions by saying that all the spins experience a mean eld that is constant across the whole length of the system, and then formally we can think of the predictive information as being carried by the mean eld itself. In fact, there are situations in which this is not just an approximation but an exact statement. Thus, we can trade a description in 6 0, but K D 0) for one in which there terms of long-range interactions (K¤ D are unknown parameters describing the system, but given these parameters, 6 there are no long-range correlations (K D 0, K¤ D 0). The two descriptions are equivalent, and this is captured by the subextensive entropy.9 4.4 Taming the Fluctuations. The preceding calculations of the subextensive entropy S1 are worthless unless we prove that the uctuations y are controllable. We are going to discuss when and if this indeed happens. We limit the discussion to the case of nding a probability density (section 4.2); the case of learning a process (section 4.3) is very similar. Clarke and Barron (1990) solved essentially the same problem. They did not make a separation into the annealed and the uctuation term, and the quantity they were interested in was a bit different from ours, but, interpreting loosely, they proved that, modulo some reasonable technical assumptions on differentiability of functions in question, the uctuation term always approaches zero. However, they did not investigate the speed of this approach, and we believe that they therefore missed some important qualitative distinctions between different problems that can arise due to a difference between d and dVC . In order to illuminate these distinctions, we here go through the trouble of analyzing uctuations all over again. 9 There are a number of interesting questions about how the coefcients in the diverging predictive information relate to the usual critical exponents, and we hope to return to this problem in a later article.
2434
W. Bialek, I. Nemenman, and N. Tishby
Returning to equations 4.27, 4.29 and the denition of entropy, we can write the entropy S ( N ) exactly as Z dK a N P(® N)
S( N ) D ¡
£ log2
" N Y i D1
Z Y N £ ¤ dxE j Q ( xE j | ® N) j D1
Q( xE i | ® N)
Z dK aP ( ®) #
£e
C Ny (a, aIf ¡NDKL ( aka) N N xE i g)
.
(4.52)
This expression can be decomposed into the terms identied above, plus a new contribution to the subextensive entropy that comes from the uctua(f) tions alone, S1 ( N ) : (a) (f) S ( N ) D S 0 ¢ N C S1 ( N ) C S1 ( N ) , Z N £ Y ¤ (f) N) N) S1 D ¡ d K a N P(® dxE j Q ( xE j | ®
µZ £ log2
(4.53)
jD 1
¶ d ® ) ¡NDKL ( aka) C Ny (a, aIf N N xE i g) e , Z( ® N I N) K aP (
(4.54)
where y is dened as in equation 4.29, and Z as in equation 4.34. Some loose but useful bounds can be established. First, the predictive information is a positive (semidenite) quantity, and so the uctuation term (a) may not be smaller (more negative) than the value of ¡S1 as calculated in equations 4.45 and 4.50. Second, since uctuations make it more difcult to generalize from samples, the predictive information should always be reduced by uctuations, so that S (f) is negative. This last statement corresponds to the fact that for the statistical mechanics of disordered systems, the annealed free energy always is less than the average quenched free energy and may be proven rigorously by applying Jensen’s inequality to the (concave) logarithm function in equation 4.54; essentially the same argument was given by Haussler and Opper (1997). A related Jensen’s inequality argument allows us to show that the total S1 ( N ) is bounded, Z S1 ( N ) · N
Z dK a
dK a N P ( ®) P ( ® N ) DKL ( ® N k® )
´ hNDKL ( ® N k® ) ia,a N ,
(4.55)
so that if we have a class of models (and a prior P ( ® ) ) such that the average KL divergence among pairs of models is nite, then the subextensive entropy necessarily is properly dened.In its turn, niteness of the average
Predictability, Complexity, and Learning
2435
KL divergence or of similar quantities is a usual constraint in the learning problems of this type (see, for example, Haussler & Opper, 1997). Note also that hDKL ( ® includes S0 as one of its terms, so that usually S0 and N k® ) ia,a N S1 are well or ill dened together. Tighter bounds require nontrivial assumptions about the class of distributions considered. The uctuation term would be zero if y were zero, and y is the difference between an expectation value (KL divergence) and the corresponding empirical mean. There is a broad literature that deals with this type of difference (see, for example, Vapnik, 1998). We start with the case when the pseudodimension (dVC ) of the set of probability densities fQ( xE | ®) g is nite. Then for any reasonable function F ( xE I b ) , deviations of the empirical mean from the expectation value can be bounded by probabilistic bounds of the form ) R ( 1 P ( E j I b ) ¡ dx E Q ( xE | ® N ) F ( xE I b ) jF x N >2 P sup L[F] b
< M( 2 , N, dVC ) e ¡cN2 , 2
(4.56)
where c and L[F] depend on the details of the particular bound used. Typically, c is a constant of order one, and L[F] either is some moment of F or the range of its variation. In our case, F is the log ratio of two densities, so that L[F] may be assumed bounded for almost all b without loss of generality in view of equation 4.55. In addition, M ( 2 , N, dVC ) is nite at zero, grows at most subexponentially in its rst two arguments, and depends exponentially on dVC . Bounds of this form may have different names in different contexts—for example, Glivenko-Cantelli, Vapnik-Chervonenkis, Hoeffding, Chernoff, (for review see Vapnik, 1998, and the references therein). (f) To start the proof of niteness of S1 in this case, we rst show that only the region ® ¼ ® N is important when calculating the inner integral in equation 4.54. This statement is equivalent to saying that at large values of ®¡® N , the KL divergence almost always dominates the uctuation term, that is, the contribution of sequences of fxE i g with atypically large uctuations is negligible (atypicality is dened as y ¸ d, where d is some p small constant independent of N). Since the uctuations decrease as 1 / N (see equation 4.56) and DKL is of order one, this is plausible. To show this, we bound the logarithm in equation 4.54 by N times the supremum value of y . Then we realize that the averaging over ® N and fxE i g is equivalent to integration over all possible values of the uctuations. The worst-case density of the uctuations may be estimated by differentiating equation 4.56 with respect to 2 (this brings down an extra factor of N2 ). Thus the worst-case contribution of these atypical sequences is Z 1 2 2 (f) ,atypical » (4.57) S1 d2 N 22 2 M ( 2 ) e ¡cN2 » e ¡cNd ¿ 1 for large N. d
2436
W. Bialek, I. Nemenman, and N. Tishby
This bound lets us focus our attention on the region ® ¼ ® N . We expand the exponent of the integrand of equation 4.54 around this point and perform a simple gaussian integration. In principle, large uctuations might lead to an instability (positive or zero curvature) at the saddle point, but this is atypical and therefore is accounted for already. Curvatures at the saddle points of both numerator and denominator are of the same order, and throwing away unimportant additive and multiplicative constants of order unity, we obtain the following result for the contribution of typical sequences: Z Y (f) ,typical » N ) dN xE N ) N (B A ¡1 B)I S1 dK a N P(® Q( xE j | ® j
Bm D ( A ) mº D
1 X @ log Q ( xE i | ® N) , a N i @ Nm 1 X @2 log Q ( xE i | ® N) , N i @a N m @aN º
hBixE D 0I hA ixE D F .
(4.58)
Here h¢ ¢ ¢ixE means an averaging with respect to all xE i ’s keeping ® N constant. One immediately recognizes that B and A are, respectively, rst and second derivatives of the empirical KL divergence that was in the exponent of the inner integral in equation 4.54. We are dealing now with typical cases. Therefore, large deviations of A from F are not allowed, and we may bound equation 4.58 by replacing A ¡1 with F ¡1 (1 C d) , where d again is independent of N. Now we have to average a bunch of products like Ei| ® @ log Q( x N) @a Nm
( F ¡1 ) mº
Ej | ® N) @ log Q ( x @a Nº
(4.59)
over all xE i ’s. Only the terms with i D j survive the averaging. There are K2 N such terms, each contributing of order N ¡1 . This means that the total contribution of the typical uctuations is bounded by a number of order one and does not grow with N. This concludes the proof of controllability of uctuations for dVC < 1. One may notice that we never used the specic form of M (2 , N, dVC ) , which is the only thing dependent on the precise value of the dimension. Actually, a more thorough look at the proof shows that we do not even need the strict uniform convergence enforced by the Glivenko-Cantelli bound. With some modications, the proof should still hold if there exist some a priori improbable values of ® and ® N that lead to violation of the bound. That is, if the prior P ( ® ) has sufciently narrow support, then we may still expect uctuations to be unimportant even for VC-innite problems. A proof of this can be found in the realm of the structural risk minimization (SRM) theory (Vapnik, 1998). SRM theory assumes that an innite
Predictability, Complexity, and Learning
2437
structure C of nested subsets C1 ½ C2 ½ C3 ½ ¢ ¢ ¢ can be imposed onto S the set C of all admissible solutions of a learning problem, such that C D Cn . The idea is that having a nite number of observations N, one is conned to the choices within some particular structure element Cn , n D n( N ) when looking for an approximation to the true solution; this prevents overtting and poor generalization. As the number of samples increases and one is able to distinguish within more and more complicated subsets, n grows. If dVC for learning in any Cn , n < 1, is nite, then one can show convergence of the estimate to the true value as well as the absolute smallness of uctuations (Vapnik, 1998). It is remarkable that this result holds even if the capacity of the whole set C is not described by a nite dVC . In the context of SRM, the role of the prior P ( ® ) is to induce a structure on the set of all admissible densities, and the ght between the number of samples N and the narrowness of the prior is precisely what determines how the capacity of the current element of the structure Cn , n D n( N ) , grows with N. A rigorous proof of smallness of the uctuations can be constructed based on well-known results, as detailed elsewhere (Nemenman, 2000). Here we focus on the question of how narrow the prior should be so that every structure element is of nite VC dimension, and one can guarantee eventual convergence of uctuations to zero. Consider two examples. A variable x is distributed according to the following probability density functions: µ ¶ 1 1 exp ¡ ( x ¡ a) 2 , Q( x|a) D p 2 2p Q( x|a) D R 2p 0
exp ( ¡ sin ax)
dx exp ( ¡ sin ax)
,
x 2 ( ¡1I C 1) I
x 2 [0I 2p ) .
(4.60) (4.61)
Learning the parameter in the rst case is a dVC D 1 problem; in the second case, dVC D 1. In the rst example, as we have shown, one may construct a uniform bound on uctuations irrespective of the prior P ( ® ). The second one does not allow this. Suppose that the prior is uniform in a box 0 < a < amax and zero elsewhere, with amax rather large. Then for not too many sample points N, the data would be better tted not by some value in the vicinity of the actual parameter, but by some much larger value, for which almost all data points are at the crests of ¡ sin ax. Adding a new data point would not help, until that best, but wrong, parameter estimate is less than amax .10 So the uctuations are large, and the predictive information is 10 Interestingly, since for the model equation 4.61 KL divergence is bounded from below and from above, for amax ! 1 the weight in r (DI a) N at small DKL vanishes, and a nite weight accumulates at some nonzero value of D. Thus, even putting the uctuations aside, the asymptotic behavior based on the phase-space dimension is invalidated, as mentioned above.
2438
W. Bialek, I. Nemenman, and N. Tishby
small in this case. Eventually, however, data points would overwhelm the box size, and the best estimate of a would swiftly approach the actual value. At this point the argument of Clarke and Barron would become applicable, and the leading behavior of the subextensive entropy would converge to its asymptotic value of (1 / 2) log N. On the other hand, there is no uniform bound on the value of N for which this convergence will occur; it is guaranteed only for N À dVC , which is never true if dVC D 1. For some sufciently wide priors, this asymptotically correct behavior would never be reached in practice. Furthermore, if we imagine a thermodynamic limit where the box size and the number of samples both become large, then by analogy with problems in supervised learning (Seung, Sompolinsky, & Tishby, 1992; Haussler et al., 1996), we expect that there can be sudden changes in performance as a function of the number of examples. The arguments of Clarke and Barron cannot encompass these phase transitions or “Aha!” phenomena. A further bridge between VC dimension and the informationtheoretic approach to learning may be found in Haussler, Kearns, & Schapire (1994), where the authors bounded predictive information-like quantities with loose but useful bounds explicitly proportional to dVC . While much of learning theory has focused on problems with nite VC dimension, it might be that the conventional scenario in which the number of examples eventually overwhelms the number of parameters or dimensions is too weak to deal with many real-world problems. Certainly in the present context, there is not only a quantitative but also a qualitative difference between reaching the asymptotic regime in just a few measurements or in many millions of them. Finitely parameterizable models with nite or innite dVC fall in essentially different universality classes with respect to the predictive information. 4.5 Beyond Finite Parameterization: General Considerations. The previous sections have considered learning from time series where the underlying class of possible models is described with a nite number of parameters. If the number of parameters is not nite, then in principle it is impossible to learn anything unless there is some appropriate regularization of the problem. If we let the number of parameters stay nite but become large, then there is more to be learned and, correspondingly, the predictive information grows in proportion to this number, as in equation 4.45. On the other hand, if the number of parameters becomes innite without regularization, then the predictive information should go to zero since nothing can be learned. We should be able to see this happen in a regularized problem as the regularization weakens. Eventually the regularization would be insufcient and the predictive information would vanish. The only way this can happen is if the subextensive term in the entropy grows more and more rapidly with N as we weaken the regularization, until nally it becomes extensive at the point where learning becomes impossible. More precisely, if this scenario for the breakdown of learning is to work, there must be situations in which
Predictability, Complexity, and Learning
2439
the predictive information grows with N more rapidly than the logarithmic behavior found in the case of nite parameterization. Subextensive terms in the entropy are controlled by the density of models as a function of their KL divergence to the target model. If the models have nite VC and phase-space dimensions, then this density vanishes for small divergences as r » D ( d¡2) / 2 . Phenomenologically, if we let the number of parameters increase, the density vanishes more and more rapidly. We can imagine that beyond the class of nitely parameterizable problems, there is a class of regularized innite dimensional problems in which the density r ( D ! 0) vanishes more rapidly than any power of D. As an example, we could have µ ¶ B r ( D ! 0) ¼ A exp ¡ m , m > 0, (4.62) D that is, an essential singularity at D D 0. For simplicity, we assume that the constants A and B can depend on the target model but that the nature of the essential singularity (m ) is the same everywhere. Before providing an explicit example, let us explore the consequences of this behavior. From equation 4.37, we can write the partition function as Z Z( ® N I N) D
dDr ( DI ® N ) exp[¡ND] µ
¶ B( ® N) ¼ A( ® N ) dD exp ¡ m ¡ ND D µ ¶ 1m C 2 ¼ AQ ( ® ln N ¡ C( ® N ) exp ¡ N ) Nm / (m C 1) , 2m C 1 Z
(4.63)
where in the last step we use a saddle-point or steepest-descent approximation that is accurate at large N, and the coefcients are µ AQ ( ® N ) D A( ® N)
2p m 1 / (m C 1) m C1 µ
N ) D [B( ® N ) ]1/ (m C 1) C( ®
¶ 1/ 2
¢ [B( ® N ) ]1 / (2m C 2)
1
C m 1/ m m / (m C 1)
(4.64)
¶
(m C 1)
.
(4.65)
Finally we can use equations 4.44 and 4.63 to compute the subextensive term in the entropy, keeping only the dominant term at large N, (a) S1 ( N ) !
1 hC( ® N ) iaN Nm / (m C 1) ln 2
(bits) ,
where h¢ ¢ ¢iaN denotes an average over all the target models.
(4.66)
2440
W. Bialek, I. Nemenman, and N. Tishby
This behavior of the rst subextensive term is qualitatively different from everything we have observed so far. A power-law divergence is much stronger than a logarithmic one. Therefore, a lot more predictive information is accumulated in an “innite parameter” (or nonparametric) system; the system is intuitively and quantitatively much richer and more complex. Subextensive entropy also grows as a power law in a nitely parameterizable system with a growing number of parameters. For example, suppose that we approximate the distribution of a random variable by a histogram with K bins, and we let K grow with the quantity of available samples as K » Nº. A straightforward generalization of equation 4.45 seems to imply then that S1 ( N ) » Nº log N (Hall & Hannan, 1988; Barron & Cover, 1991). While not necessarily wrong, analogies of this type are dangerous. If the number of parameters grows constantly, then the scenario where the data overwhelm all the unknowns is far from certain. Observations may provide much less information about features that were introduced into the model at some large N than about those that have existed since the very rst measurements. Indeed, in a K-parameter system, the Nth sample point contributes » K / 2N bits to the subextensive entropy (cf. equation 4.45). If K changes as mentioned, the Nth example then carries » Nº¡1 bits. Summing this up (a) over all samples, we nd S1 » Nº, and if we let º D m / (m C 1) , we obtain equation 4.66; note that the growth of the number of parameters is slower than N (º D m / (m C 1) < 1), which makes sense. Rissanen, Speed, and Yu (1992) made a similar observation. According to them, for models with increasing number of parameters, predictive codes, which are optimal at each particular N (cf. equation 3.8), provide a strictly shorter coding length than nonpredictive codes optimized for all data simultaneously. This has to be contrasted with the nite-parameter model classes, for which these codes are asymptotically equal. Power-law growth of the predictive information illustrates the point made earlier about the transition from learning more to nally learning nothing as the class of investigated models becomes more complex. As m increases, the problem becomes richer and more complex, and this is expressed in the stronger divergence of the rst subextensive term of the entropy; for xed large N, the predictive information increases with m . However, if m ! 1, the problem is too complex for learning; in our model example, the number of bins grows in proportion to the number of samples, which means that we are trying to nd too much detail in the underlying distribution. As a result, the subextensive term becomes extensive and stops contributing to predictive information. Thus, at least to the leading order, predictability is lost, as promised. 4.6 Beyond Finite Parameterization: Example. While literature on problems in the logarithmic class is reasonably rich, the research on establishing the power-law behavior seems to be in its early stages. Some authors have found specic learning problems for which quantities similar
Predictability, Complexity, and Learning
2441
to, but sometimes very nontrivially related to S1 , are bounded by power– law functions (Haussler & Opper, 1997, 1998; Cesa-Bianchi & Lugosi, in press). Others have chosen to study nite-dimensional models, for which the optimal number of parameters (usually determined by the MDL criterion of Rissanen, 1989) grows as a power law (Hall & Hannan, 1988; Rissanen et al., 1992). In addition to the questions raised earlier about the danger of copying nite-dimensional intuition to the innite-dimensional setup, these are not examples of truly nonparametric Bayesian learning. Instead, these authors make use of a priori constraints to restrict learning to codes of particular structure (histogram codes), while a non-Bayesian inference is conducted within the class. Without Bayesian averaging and with restrictions on the coding strategy, it may happen that a realization of the code length is substantially different from the predictive information. Similar conceptual problems plague even true nonparametric examples, as considered, for example, by Barron and Cover (1991). In summary, we do not know of a complete calculation in which a family of power-law scalings of the predictive information is derived from a Bayesian formulation. The discussion in the previous section suggests that we should look for power-law behavior of the predictive information in learning problems where, rather than learning ever more precise values for a xed set of parameters, we learn a progressively more detailed description—effectively increasing the number of parameters—as we collect more data. One example of such a problem is learning the distribution Q( x ) for a continuous variable x, but rather than writing a parametric form of Q ( x ) , we assume only that this function itself is chosen from some distribution that enforces a degree of smoothness. There are some natural connections of this problem to the methods of quantum eld theory (Bialek, Callan, & Strong, 1996), which we can exploit to give a complete calculation of the predictive information, at least for a class of smoothness constraints. We write Q ( x) D ( 1 / l0 ) exp[¡w ( x )] so that the positivity of the distribution is automatic, and then smoothness may be expressed by saying that the “energy” (or action) associated with a function w ( x ) is related to an integral over its derivatives, like the strain energy in a stretched string. The simplest possibility following this line of ideas is that the distribution of functions is given by " ³ ´2 # µ Z ¶ Z 1 1 l @w (x) ¡w P [w ( x ) ] D exp ¡ d ¡1 , (4.67) dx dx e Z 2 @x l0 where Z is the normalization constant for P [w], the delta function ensures that each distribution Q( x ) is normalized, and l sets a scale for smoothness. If distributions are chosen from this distribution, then the optimal Bayesian estimate of Q( x ) from a set of samples x1 , x2 , . . . , xN converges to the correct answer, and the distribution at nite N is nonsingular, so that the regularization provided by this prior is strong enough to prevent the devel-
2442
W. Bialek, I. Nemenman, and N. Tishby
opment of singular peaks at the location of observed data points (Bialek et al., 1996). Further developments of the theory, including alternative choices of P[w ( x )], have been given by Periwal (1997, 1998), Holy (1997), Aida (1999), and Schmidt (2000); for a detailed numerical investigation of this problem see Nemenman and Bialek (2001). Our goal here is to be illustrative rather than exhaustive. 11 From the discussion above, we know that the predictive information is related to the density of KL divergences and that the power-law behavior we are looking for comes from an essential singularity in this density function. Thus, we calculate r ( D, wN ) in the model dened by equation 4.67. With Q ( x ) D ( 1 / l0 ) exp[¡w ( x ) ], we can write the KL divergence as Z 1 N ( ) ( ) [ ] w kw (4.68) DKL x x D dx exp[¡wN ( x) ][w ( x ) ¡ wN ( x) ] . l0 We want to compute the density, Z ¡ ¢ N ( ) r DI w D [dw ( x )]P [w ( x )]d D ¡ DKL [wN ( x) kw ( x) ] Z
D M
¡ ¢ [dw ( x) ]P [w ( x) ]d MD ¡ MDKL [wN ( x ) kw ( x ) ] ,
(4.69) (4.70)
where we introduce a factor M, which we will allow to become large so that we can focus our attention on the interesting limit D ! 0. To compute this integral over all functions w ( x ) , we introduce a Fourier representation for the delta function and then rearrange the terms: Z
Z dz exp( izMD ) [dw ( x) ]P [w ( x) ] exp(¡izMDKL ) (4.71) 2p ³ ´ Z Z dz izM N N ( ) ( )] C exp w exp[¡ w D M izMD dx x x 2p l0 Z £ [dw ( x ) ]P [w ( x ) ]
r ( DI wN ) D M
³ ´ Z izM £ exp ¡ dx w ( x ) exp[¡wN ( x ) ] . l0
(4.72)
The inner integral over the functions w ( x ) is exactly the integral evaluated in the original discussion of this problem (Bialek, Callan, & Strong, 1996); in the limit that zM is large, we can use a saddle-point approximation, and standard eld-theoreticmethods allow us to compute the uctuations 11 We caution that our discussion in this section is less self-contained than in other sections. Since the crucial steps exactly parallel those in the earlier work, here we just give references.
Predictability, Complexity, and Learning
2443
around the saddle point. The result is that ³ ´ Z Z izM N [dw ( x ) ]P [w ( x ) ] exp ¡ dx w ( x ) exp[¡w ( x ) ] l0 ³ ´ Z izM D exp ¡ dx wN ( x) exp[¡wN ( x )] ¡ Seff [wN ( x )I zM] , l0 N zM] D Seff [wI
l 2
³
Z dx
@wN
@x
´2 C
1 2
³
izM ll 0
´1/ 2Z
(4.73)
dx exp[¡wN ( x ) / 2]. (4.74)
Now we can do the integral over z, again by a saddle-point method. The two saddle-point approximations are both valid in the limit that D ! 0 and MD3/ 2 ! 1; we are interested precisely in the rst limit, and we are free to set M as we wish, so this gives us a good approximation for r ( D ! 0I wN ). This results in ³ ´ B[wN ( x ) ] r ( D ! 0I wN ) D A[wN ( x ) ]D ¡3 / 2 exp ¡ (4.75) , D µ ¶ Z ¡ ¢2 1 l exp ¡ A[wN ( x ) ] D p dx @x wN 2 16p ll 0 Z £ dx exp[¡wN ( x) / 2], (4.76) B[wN ( x ) ] D
1 16ll 0
³Z
dx exp[¡wN ( x ) / 2]
´2
.
(4.77)
Except for the factor of D ¡3/ 2 , this is exactly the sort of essential singularity that we considered in the previous section, with m D 1. The D ¡3/ 2 prefactor does not change the leading large N behavior of the predictive information, and we nd that ½ Z ¾ 1 (a) p (4.78) S1 ( N ) » dx exp[¡wN ( x ) / 2] N1 / 2 , 2 ln 2 ll 0 wN where h¢ ¢ ¢iwN denotes an average over the target distributions wN ( x) weighted once again by P [wN ( x )]. Notice that if x is unbounded, then the average in equation 4.78 is infrared divergent; if instead we let the variable x range from 0 to L, then this average should be dominated by the uniform distribution. Replacing the average by its value at this point, we obtain the approximate result ³ ´1 / 2 1 p L (a) bits. (4.79) S1 ( N ) » N 2 ln 2 l To understand the result in equation 4.79, we recall that this eld-theoretic approach is more or less equivalent to an adaptive binning procedure in p which we divide the range of x into bins of local size l / NQ( x ) (Bialek,
2444
W. Bialek, I. Nemenman, and N. Tishby
Callan, & Strong, 1996). From this point of view, equation 4.79 makes perfect sense: the predictive information is directly proportional to the number of bins that can be put in the range of x. This also is in direct accord with a comment in the previous section that power-law behavior of predictive information arises from the number of parameters in the problem depending on the number of samples. This counting of parameters allows a schematic argument about the smallness of uctuations in this particular nonparametric problem. If we p take the hint that at every step » N bins are being investigated, then we can imagine that the eld-theoretic prior in equation 4.67 has imposed a structure C on the set of all possible densities, so that the set Cn is formed of all continuous piecewise linear functions that have not more than n kinks. Learning such functions for nite n is a dp VC D n problem. Now, as N grows, the elements with higher capacities n » N are admitted. The uctuations in such a problem are known to be controllable (Vapnik, 1998), as discussed in more detail elsewhere (Nemenman, 2000). One thing that remains troubling is that the predictive information depends on the arbitrary parameter l describing the scale of smoothness in the distribution. In the original work, it was proposed that one should integrate over possible values of l (Bialek, Callan, & Strong, 1996). Numerical simulations demonstrate that this parameter can be learned from the data themselves (Nemenman & Bialek, 2001), but perhaps even more interesting is a formulation of the problem by Periwal (1997, 1998), which recovers complete coordinate invariance by effectively allowing l to vary with x. In this case, the whole theory has no length scale, and there also is no need to conne the variable x to a box (here of size L). We expect that this coordinate p invariant approach will lead to a universal coefcient multiplying N in the analog of equation 4.79, but this remains to be shown. In summary, the eld-theoretic approach to learning a smooth distribution in one dimension provides us with a concrete, calculable example of a learning problem with power-law growth of the predictive information. The scenario is exactly as suggested in the previous section, where the density of KL divergences develops an essential singularity. Heuristic considerations (Bialek, Callan, & Strong, 1996; Aida, 1999) suggest that different smoothness penalties and generalizations to higher-dimensional problems will have sensible effects on the predictive information—all have powerlaw growth, smoother functions have smaller powers (less to learn), and higher-dimensional problems have larger powers (more to learn)—but real calculations for these cases remain challenging. 5 Ipred as a Measure of Complexity The problem of quantifying complexity is very old (see Grassberger, 1991, for a short review). Solomonoff (1964), Kolmogorov (1965), and Chaitin (1975) investigated a mathematically rigorous notion of complexity that measures
Predictability, Complexity, and Learning
2445
(roughly) the minimum length of a computer program that simulates the observed time series (see also Li & VitÂanyi, 1993). Unfortunately there is no algorithm that can calculate the Kolmogorov complexity of all data sets. In addition, algorithmic or Kolmogorov complexity is closely related to the Shannon entropy, which means that it measures something closer to our intuitive concept of randomness than to the intuitive concept of complexity (as discussed, for example, by Bennett, 1990). These problems have fueled continued research along two different paths, representing two major motivations for dening complexity. First, one would like to make precise an impression that some systems—such as life on earth or a turbulent uid ow—evolve toward a state of higher complexity, and one would like to be able to classify those states. Second, in choosing among different models that describe an experiment, one wants to quantify a preference for simpler explanations or, equivalently, provide a penalty for complex models that can be weighed against the more conventional goodness-of-t criteria. We bring our readers up to date with some developments in both of these directions and then discuss the role of predictive information as a measure of complexity. This also gives us an opportunity to discuss more carefully the relation of our results to previous work. 5.1 Complexity of Statistical Models. The construction of complexity penalties for model selection is a statistics problem. As far as we know, however, the rst discussions of complexity in this context belong to the philosophical literature. Even leaving aside the early contributions of William of Occam on the need for simplicity, Hume on the problem of induction, and Popper on falsiability, Kemeney (1953) suggested explicitly that it would be possible to create a model selection criterion that balances goodness of t versus complexity. Wallace and Boulton (1968) hinted that this balance may result in the model with “the briefest recording of all attribute information.” Although he probably had a somewhat different motivation, Akaike (1974a, 1974b) made the rst quantitative step along these lines. His ad hoc complexity term was independent of the number of data points and was proportional to the number of free independent parameters in the model. These ideas were rediscovered and developed systematically by Rissanen in a series of papers starting from 1978. He has emphasized strongly (Rissanen, 1984, 1986, 1987, 1989) that tting a model to data represents an encoding of those data, or predicting future data, and that in searching for an efcient code, we need to measure not only the number of bits required to describe the deviations of the data from the model’s predictions (goodness of t), but also the number of bits required to specify the parameters of the model (complexity). This specication has to be done to a precision supported by the data. 12 Rissanen (1984) and Clarke and Barron (1990) in full
12 Within this framework, Akaike’s suggestion can be seen as coding the model to (suboptimal) xed precision.
2446
W. Bialek, I. Nemenman, and N. Tishby
generality were able to prove that the optimal encoding of a model requires a code with length asymptotically proportional to the number of independent parameters and logarithmically dependent on the number of data points we have observed. The minimal amount of space one needs to encode a data string (minimum description length, or MDL) within a certain assumed model class was termed by Rissanen stochastic complexity, and in recent work he refers to the piece of the stochastic complexity required for coding the parameters as the model complexity (Rissanen, 1996). This approach was further strengthened by a recent result (VitÂanyi & Li, 2000) that an estimation of parameters using the MDL principle is equivalent to Bayesian parameter estimations with a “universal” prior (Li & VitÂanyi, 1993). There should be a close connection between Rissanen’s ideas of encoding the data stream and the subextensive entropy. We are accustomed to the idea that the average length of a code word for symbols drawn from a distribution P is given by the entropy of that distribution; thus, it is tempting to say that an encoding of a stream x1 , x2 , . . . , xN will require an amount of space equal to the entropy of the joint distribution P ( x1 , x2 , . . . , xN ). The situation here is a bit more subtle, because the usual proofs of equivalence between code length and entropy rely on notions of typicality and asymptotics as we try to encode sequences of many symbols. Here we already have N symbols, and so it does not really make sense to talk about a stream of streams. One can argue, however, that atypical sequences are not truly random within a considered distribution since their coding by the methods optimized for the distribution is not optimal. So atypical sequences are better considered as typical ones coming from a different distribution (a point also made by Grassberger, 1986). This allows us to identify properties of an observed (long) string with the properties of the distribution it comes from, as VitÂanyi and Li (2000) did. If we accept this identication of entropy with code length, then Rissanen’s stochastic complexity should be the entropy of the distribution P ( x1 , x2 , . . . , xN ). As emphasized by Balasubramanian (1997), the entropy of the joint distribution of N points can be decomposed into pieces that represent noise or errors in the model’s local predictions—an extensive entropy—and the space required to encode the model itself, which must be the subextensive entropy. Since in the usual formulation all long-term predictions are associated with the continued validity of the model parameters, the dominant component of the subextensive entropy must be this parameter coding (or model complexity, in Rissanen’s terminology). Thus, the subextensive entropy should be the model complexity, and in simple cases where we can describe the data by a K-parameter model, both quantities are equal to ( K / 2) log2 N bits to the leading order. The fact that the subextensive entropy or predictive information agrees with Rissanen’s model complexity suggests that Ipred provides a reasonable measure of complexity in learning problems. This agreement might lead the reader to wonder if all we have done is to rewrite the results of Rissanen et al. in a different notation. To calm these fears, we recall again that our
Predictability, Complexity, and Learning
2447
approach distinguishes innite VC problems from nite ones and treats nonparametric cases as well. Indeed, the predictive information is dened without reference to the idea that we are learning a model, and thus we can make a link to physical aspects of the problem, as discussed below. The MDL principle was introduced as a procedure for statistical inference from a data stream to a model. In contrast, we take the predictive information to be a characterization of the data stream itself. To the extent that we can think of the data stream as arising from a model with unknown parameters, as in the examples of section 4, all notions of inference are purely Bayesian, and there is no additional “penalty for complexity.” In this sense our discussion is much closer in spirit to Balasubramanian (1997) than to Rissanen (1978). On the other hand, we can always think about the ensemble of data streams that can be generated by a class of models, provided that we have a proper Bayesian prior on the members of the class. Then the predictive information measures the complexity of the class, and this characterization can be used to understand why inference within some (simpler) model classes will be more efcient (for practical examples along these lines, see Nemenman, 2000, and Nemenman & Bialek, 2001). 5.2 Complexity of Dynamical Systems. While there are a few attempts to quantify complexityin terms of deterministic predictions (Wolfram, 1984), the majority of efforts to measure the complexity of physical systems start with a probabilistic approach. In addition, there is a strong prejudice that the complexity of physical systems should be measured by quantities that are not only statistical but are also at least related to more conventional thermodynamic quantities (for example, temperature, entropy), since this is the only way one will be able to calculate complexity within the framework of statistical mechanics. Most proposals dene complexity as an entropy-like quantity, but an entropy of some unusual ensemble. For example, Lloyd and Pagels (1988) identied complexity as thermodynamic depth, the entropy of the state sequences that lead to the current state. The idea clearly is in the same spirit as the measurement of predictive information, but this depth measure does not completely discard the extensive component of the entropy (Crutcheld & Shalizi, 1999) and thus fails to resolve the essential difculty in constructing complexity measures for physical systems: distinguishing genuine complexity from randomness (entropy), the complexity should be zero for both purely regular and purely random systems. New denitions of complexity that try to satisfy these criteria (LopezRuiz, Mancini, & Calbet, 1995; Gell-Mann & Lloyd, 1996; Shiner, Davison, & Landsberger, 1999; Sole & Luque, 1999, Adami & Cerf, 2000) and criticisms of these proposals (Crutcheld, Feldman, & Shalizi, 2000, Feldman & Crutcheld, 1998; Sole & Luque, 1999) continue to emerge even now. Aside from the obvious problems of not actually eliminating the extensive component for all or a part of the parameter space or not expressing complexity as an average over a physical ensemble, the critiques often are based on
2448
W. Bialek, I. Nemenman, and N. Tishby
a clever argument rst mentioned explicitly by Feldman and Crutcheld (1998). In an attempt to create a universal measure, the constructions can be made over–universal: many proposed complexity measures depend only on the entropy density S0 and thus are functions only of disorder—not a desired feature. In addition, many of these and other denitions are awed because they fail to distinguish among the richness of classes beyond some very simple ones. In a series of papers, Crutcheld and coworkers identied statistical complexity with the entropy of causal states, which in turn are dened as all those microstates (or histories) that have the same conditional distribution of futures (Crutcheld & Young, 1989; Shalizi & Crutcheld, 1999). The causal states provide an optimal description of a system’s dynamics in the sense that these states make as good a prediction as the histories themselves. Statistical complexity is very similar to predictive information, but Shalizi and Crutcheld (1999) dene a quantity that is even closer to the spirit of our discussion: their excess entropy is exactly the mutual information between the semi-innite past and future. Unfortunately, by focusing on cases in which the past and future are innite but the excess entropy is nite, their discussion is limited to systems for which (in our language) Ipred ( T ! 1) D constant. In our view, Grassberger (1986, 1991) has made the clearest and the most appealing denitions. He emphasized that the slow approach of the entropy to its extensive limit is a sign of complexity and has proposed several functions to analyze this slow approach. His effective measure complexity is the subextensive entropy term of an innite data sample. Unlike Crutcheld et al., he allows this measure to grow to innity. As an example, for low-dimensional dynamical systems, the effective measure complexity is nite whether the system exhibits periodic or chaotic behavior, but at the bifurcation point that marks the onset of chaos, it diverges logarithmically. More interesting, Grassberger also notes that simulations of specic cellular automaton models that are capable of universal computation indicate that these systems can exhibit an even stronger, power-law, divergence. Grassberger (1986, 1991) also introduces the true measure complexity, or the forecasting complexity, which is the minimal information one needs to extract from the past in order to provide optimal prediction. Another complexity measure, the logical depth (Bennett, 1985), which measures the time needed to decode the optimal compression of the observed data, is bounded from below by this quantity because decoding requires reading all of this information. In addition, true measure complexity is exactly the statistical complexity of Crutcheld et al., and the two approaches are actually much closer than they seem. The relation between the true and the effective measure complexities, or between the statistical complexity and the excess entropy, closely parallels the idea of extracting or compressing relevant information (Tishby, Pereira, & Bialek,1999; Bialek & Tishby, in preparation), as discussed below.
Predictability, Complexity, and Learning
2449
5.3 A Unique Measure of Complexity? We recall that entropy provides a measure of information that is unique in satisfying certain plausible constraints (Shannon, 1948). It would be attractive if we could prove a similar uniqueness theorem for the predictive information, or any part of it, as a measure of the complexity or richness of a time-dependent signal x ( 0 < t < T ) drawn from a distribution P[x ( t)]. Before proceeding along these lines, we have to be more specic about what we mean by “complexity.” In most cases, including the learning problems discussed above, it is clear that we want to measure complexity of the dynamics underlying the signal or, equivalently, the complexity of a model that might be used to describe the signal.13 There remains a question, however, as to whether we want to attach measures of complexity to a particular signal x ( t) or whether we are interested in measures (like the entropy itself) that are dened by an average over the ensemble P[x ( t)]. One problem in assigning complexity to single realizations is that there can be atypical data streams. Either we must treat atypicality explicitly, arguing that atypical data streams from one source should be viewed as typical streams from another source, as discussed by VitÂanyi and Li (2000), or we have to look at average quantities. Grassberger (1986) in particular has argued that our visual intuition about the complexity of spatial patterns is an ensemble concept, even if the ensemble is only implicit (see also Tong in the discussion session of Rissanen, 1987). In Rissanen’s formulation of MDL, one tries to compute the description length of a single string with respect to some class of possible models for that string, but if these models are probabilistic, we can always think about these models as generating an ensemble of possible strings. The fact that we admit probabilistic models is crucial. Even at a colloquial level, if we allow for probabilistic models, then there is a simple description for a sequence of truly random bits, but if we insist on a deterministic model, then it may be very complicated to generate precisely the observed string of bits.14 Furthermore, in the context of probabilistic models, it hardly makes sense to ask for a dynamics that generates a particular data stream; we must ask for dynamics that generate the data with reasonable probability, which is more or less equivalent to asking that the given string be a typical member of the ensemble generated by the model. All of these paths lead us to thinking not about single strings but about ensembles in the tradition of statistical mechanics, and so we shall search for measures of complexity that are averages over the distribution P[x ( t) ].
13 The problem of nding this model or of reconstructing the underlying dynamics may also be complex in the computational sense, so that there may not exist an efcient algorithm. More interesting, the computational effort required may grow with the duration T of our observations. We leave these algorithmic issues aside for this discussion. 14 This is the statement that the Kolmogorov complexity of a random sequence is large. The programs or algorithms considered in the Kolmogorov formulation are deterministic, and the program must generate precisely the observed string.
2450
W. Bialek, I. Nemenman, and N. Tishby
Once we focus on average quantities, we can start by adopting Shannon’s postulates as constraints on a measure of complexity: if there are N equally likely signals, then the measure should be monotonic in N; if the signal is decomposable into statistically independent parts, then the measure should be additive with respect to this decomposition; and if the signal can be described as a leaf on a tree of statistically independent decisions, then the measure should be a weighted sum of the measures at each branching point. We believe that these constraints are as plausible for complexity measures as for information measures, and it is well known from Shannon’s original work that this set of constraints leaves the entropy as the only possibility. Since we are discussing a time-dependent signal, this entropy depends on the duration of our sample, S( T ) . We know, of course, that this cannot be the end of the discussion, because we need to distinguish between randomness (entropy) and complexity. The path to this distinction is to introduce other constraints on our measure. First we notice that if the signal x is continuous, then the entropy is not invariant under transformations of x. It seems reasonable to ask that complexity be a function of the process we are observing and not of the coordinate system in which we choose to record our observations. The examples above show us, however, that it is not the whole function S ( T ) that depends on the coordinate system for x;15 it is only the extensive component of the entropy that has this noninvariance. This can be seen more generally by noting that subextensive terms in the entropy contribute to the mutual information among different segments of the data stream (including the predictive information dened here), while the extensive entropy cannot; mutual information is coordinate invariant, so all of the noninvariance must reside in the extensive term. Thus, any measure complexity that is coordinate invariant must discard the extensive component of the entropy. The fact that extensive entropy cannot contribute to complexity is discussed widely in the physics literature (Bennett, 1990), as our short review shows. To statisticians and computer scientists, who are used to Kolmogorov’s ideas, this is less obvious. However, Rissanen (1986, 1987) also talks about “noise” and “useful information” in a data sequence, which is similar to splitting entropy into its extensive and the subextensive parts. His “model complexity,” aside from not being an average as required above, is essentially equal to the subextensive entropy. Similarly, Whittle (in the discussion of Rissanen, 1987) talks about separating the predictive part of the data from the rest. If we continue along these lines, we can think about the asymptotic expansion of the entropy at large T. The extensive term is the rst term in this series, and we have seen that it must be discarded. What about the
15 Here we consider instantaneous transformations of x, not ltering or other transformations that mix points at different times.
Predictability, Complexity, and Learning
2451
other terms? In the context of learning a parameterized model, most of the terms in this series depend in detail on our prior distribution in parameter space, which might seem odd for a measure of complexity. More generally, if we consider transformations of the data stream x( t) that mix points within a temporal window of size t , then for T À t , the entropy S ( T ) may have subextensive terms that are constant, and these are not invariant under this class of transformations. On the other hand, if there are divergent subextensive terms, these are invariant under such temporally local transformations. 16 So if we insist that measures of complexity be invariant not only under instantaneous coordinate transformations, but also under temporally local transformations, then we can discard both the extensive and the nite subextensive terms in the entropy, leaving only the divergent subextensive terms as a possible measure of complexity. An interesting example of these arguments is provided by the statistical mechanics of polymers. It is conventional to make models of polymers as random walks on a lattice, with various interactions or self-avoidance constraints among different elements of the polymer chain. If we count the number N of walks with N steps, we nd that N ( N ) » ANc zN (de Gennes, 1979). Now the entropy is the logarithm of the number of states, and so there is an extensive entropy S0 D log2 z, a constant subextensive entropy log2 A, and a divergent subextensive term S1 ( N ) ! c log2 N. Of these three terms, only the divergent subextensive term (related to the critical exponent c ) is universal, that is, independent of the detailed structure of the lattice. Thus, as in our general argument, it is only the divergent subextensive terms in the entropy that are invariant to changes in our description of the local, small-scale dynamics. We can recast the invariance arguments in a slightly different form using the relative entropy. We recall that entropy is dened cleanly only for discrete processes and that in the continuum there are ambiguities. We would like to write the continuum generalization of the entropy of a process x ( t ) distributed according to P[x ( t) ] as Z (5.1) Scont D ¡ dx ( t ) P[x ( t)] log2 P[x ( t) ], but this is not well dened because we are taking the logarithm of a dimensionful quantity. Shannon gave the solution to this problem: we use as a measure of information the relative entropy or KL divergence between the distribution P[x( t) ] and some reference distribution Q[x ( t) ], ³ ´ Z P[x ( t) ] (5.2) Srel D ¡ dx ( t) P[x ( t) ] log2 , Q[x( t) ] 16 Throughout this discussion, we assume that the signal x at one point in time is nite dimensional. There are subtleties if we allow x to represent the conguration of a spatially innite system.
2452
W. Bialek, I. Nemenman, and N. Tishby
which is invariant under changes of our coordinate system on the space of signals. The cost of this invariance is that we have introduced an arbitrary distribution Q[x ( t) ], and so really we have a family of measures. We can nd a unique complexity measure within this family by imposing invariance principles as above, but in this language, we must make our measure invariant to different choices of the reference distribution Q[x ( t) ]. The reference distribution Q[x( t) ] embodies our expectations for the signal x ( t) ; in particular, Srel measures the extra space needed to encode signals drawn from the distribution P[x ( t)] if we use coding strategies that are optimized for Q[x ( t) ]. If x ( t) is a written text, two readers who expect different numbers of spelling errors will have different Qs, but to the extent that spelling errors can be corrected by reference to the immediate neighboring letters, we insist that any measure of complexity be invariant to these differences in Q. On the other hand, readers who differ in their expectations about the global subject of the text may well disagree about the richness of a newspaper article. This suggests that complexity is a component of the relative entropy that is invariant under some class of local translations and misspellings. Suppose that we leave aside global expectations, and construct our reference distribution Q[x( t) ] by allowing only for short-ranged interactions— certain letters tend to follow one another, letters form words, and so on—but we bound the range over which these rules are applied. Models of this class cannot embody the full structure of most interesting time series (including language), but in the present context we are not asking for this. On the contrary, we are looking for a measure that is invariant to differences in this short-ranged structure. In the terminology of eld theory or statistical mechanics, we are constructing our reference distribution Q[x ( t) ] from local operators. Because we are considering a one-dimensional signal (the one dimension being time), distributions constructed from local operators cannot have any phase transitions as a function of parameters. Again it is important that the signal x at one point in time is nite dimensional. The absence of critical points means that the entropy of these distributions (or their contribution to the relative entropy) consists of an extensive term (proportional to the time window T) plus a constant subextensive term, plus terms that vanish as T becomes large. Thus, if we choose different reference distributions within the class constructible from local operators, we can change the extensive component of the relative entropy, and we can change constant subextensive terms, but the divergent subextensive terms are invariant. To summarize, the usual constraints on information measures in the continuum produce a family of allowable complexity measures, the relative entropy to an arbitrary reference distribution. If we insist that all observers who choose reference distributions constructed from local operators arrive at the same measure of complexity, or if we follow the rst line of arguments presented above, then this measure must be the divergent subextensive component of the entropy or, equivalently, the predictive information. We have
Predictability, Complexity, and Learning
2453
seen that this component is connected to learning in a straightforward way, quantifying the amount that can be learned about dynamics that generate the signal, and to measures of complexity that have arisen in statistics and dynamical systems theory. 6 Discussion We have presented predictive information as a characterization of various data streams. In the context of learning, predictive information is related directly to generalization. More generally, the structure or order in a time series or a sequence is related almost by denition to the fact that there is predictability along the sequence. The predictive information measures the amount of such structure but does not exhibit the structure in a concrete form. Having collected a data stream of duration T, what are the features of these data that carry the predictive information Ipred ( T ) ? From equation 3.7, we know that most of what we have seen over the time T must be irrelevant to the problem of prediction, so that the predictive information is a small fraction of the total information. Can we separate these predictive bits from the vast amount of nonpredictive data? The problem of separating predictive from nonpredictive information is a special case of the problem discussed recently (Tishby, Pereira, & Bialek, 1999; Bialek & Tishby, in preparation): given some data x, how do we compress our description of x while preserving as much information as possible about some other variable y? Here we identify x D xpast as the past data and y D xfuture as the future. When we compress xpast into some reduced description xO past, we keep a certain amount of information about the past, I ( xO past I xpast ) , and we also preserve a certain amount of information about the future, I ( xO past I xfuture) . There is no single correct compression xpast ! xO past; instead there is a one-parameter family of strategies that trace out an optimal curve in the plane dened by these two mutual informations, I ( xO pastI xfuture ) versus I ( xO past I xpast ) . The predictive information preserved by compression must be less than the total, so that I ( xO pastI xfuture ) · Ipred ( T ) . Generically no compression can preserve all of the predictive information so that the inequality will be strict, but there are interesting special cases where equality can be achieved. If prediction proceeds by learning a model with a nite number of parameters, we might have a regression formula that species the best estimate of the parameters given the past data. Using the regression formula compresses the data but preserves all of the predictive power. In cases like this (more generally, if there exist sufcient statistics for the prediction problem) we can ask for the minimal set of xO past such that I ( xO pastI xfuture ) D Ipred ( T ) . The entropy of this minimal xO past is the true measure complexity dened by Grassberger (1986) or the statistical complexity dened by Crutcheld and Young (1989) (in the framework of the causal states theory, a very similar comment was made recently by Shalizi & Crutcheld, 2000.)
2454
W. Bialek, I. Nemenman, and N. Tishby
In the context of statistical mechanics, long-range correlations are characterized by computing the correlation functions of order parameters, which are coarse-grained functions of the system’s microscopic variables. When we know something about the nature of the order parameter (e.g., whether it is a vector or a scalar), then general principles allow a fairly complete classication and description of long-range ordering and the nature of the critical points at which this order can appear or change. On the other hand, dening the order parameter itself remains something of an art. For a ferromagnet, the order parameter is obtained by local averaging of the microscopic spins, while for an antiferromagnet, one must average the staggered magnetization to capture the fact that the ordering involves an alternation from site to site, and so on. Since the order parameter carries all the information that contributes to long-range correlations in space and time, it might be possible to dene order parameters more generally as those variables that provide the most efcient compression of the predictive information, and this should be especially interesting for complex or disordered systems where the nature of the order is not obvious intuitively; a rst try in this direction was made by Bruder (1998). At critical points, the predictive information will diverge with the size of the system, and the coefcients of these divergences should be related to the standard scaling dimensions of the order parameters, but the details of this connection need to be worked out. If we compress or extract the predictive information from a time series, we are in effect discovering “featur es” that capture the nature of the ordering in time. Learning itself can be seen as an example of this, where we discover the parameters of an underlying model by trying to compress the information that one sample of N points provides about the next, and in this way we address directly the problem of generalization (Bialek and Tishby, in preparation). The fact that nonpredictive information is useless to the organism suggests that one crucial goal of neural information processing is to separate predictive information from the background. Perhaps rather than providing an efcient representation of the current state of the world—as suggested by Attneave (1954), Barlow (1959, 1961), and others (Atick, 1992)—the nervous system provides an efcient representation of the predictive information.17 It should be possible to test this directly by 17 If, as seems likely, the stream of data reaching our senses has diverging predictive information, then the space required to write down our description grows and grows as we observe the world for longer periods of time. In particular, if we can observe for a very long time, then the amount that we know about the future will exceed, by an arbitrarily large factor, the amount that we know about the present. Thus, representing the predictive information may require many more neurons than would be required to represent the current data. If we imagine that the goal of primary sensory cortex is to represent the current state of the sensory world, then it is difcult to understand why these cortices have so many more neurons than they have sensory inputs. In the extreme case, the region of primary visual cortex devoted to inputs from the fovea has nearly 30,000 neurons for each photoreceptor cell in the retina (Hawken & Parker 1991); although much
Predictability, Complexity, and Learning
2455
studying the encoding of reasonably natural signals and asking if the information that neural responses provide about the future of the input is close to the limit set by the statistics of the input itself, given that the neuron only captures a certain number of bits about the past. Thus, we might ask if, under natural stimulus conditions, a motion-sensitive visual neuron captures features of the motion trajectory that allow for optimal prediction or extrapolation of that trajectory; by using information-theoretic measures, we both test the “efcient representation” hypothesis directly and avoid arbitrary assumptions about the metric for errors in prediction. For more complex signals such as communication sounds, even identifying the features that capture the predictive information is an interesting problem. It is natural to ask if these ideas about predictive information could be used to analyze experiments on learning in animals or humans. We have emphasized the problem of learning probability distributions or probabilistic models rather than learning deterministic functions, associations, or rules. It is known that the nervous system adapts to the statistics of its inputs, and this adaptation is evident in the responses of single neurons (Smirnakis et al., 1997; Brenner, Bialek, & de Ruyter van Steveninck, 2000); these experiments provide a simple example of the system learning a parameterized distribution. When making saccadic eye movements, human subjects alter their distribution of reaction times in relation to the relative probabilities of different targets, as if they had learned an estimate of the relevant likelihood ratios (Carpenter & Williams, 1995). Humans also can learn to discriminate almost optimally between random sequences (fair coin tosses) and sequences that are correlated or anticorrelated according to a Markov process; this learning can be accomplished from examples alone, with no other feedback (Lopes & Oden, 1987). Acquisition of language may require learning the joint distribution of successive phonemes, syllables, or words, and there is direct evidence for learning of conditional probabilities from articial sound sequences, by both infants and adults (Saffran, Aslin, & Newport, 1996; Saffran et al., 1999). These examples, which are not exhaustive, indicate that the nervous system can learn an appropriate probabilistic model,18 and this offers the opportunity to analyze the dynamics of this learning using information-theoretic methods: What is the entropy of N successive reaction times following a switch to a new set of relative probabilities in the saccade experiment? How
remains to be learned about these cells, it is difcult to imagine that the activity of so many neurons constitutes an efcient representation of the current sensory inputs. But if we live in a world where the predictive information in the movies reaching our retina diverges, it is perfectly possible that an efcient representation of the predictive information available to us at one instant requires thousands of times more space than an efcient representation of the image currently falling on our retina. 18 As emphasized above, many other learning problems, including learning a function from noisy examples, can be seen as the learning of a probabilistic model. Thus, we expect that this description applies to a much wider range of biological learning tasks.
2456
W. Bialek, I. Nemenman, and N. Tishby
much information does a single reaction time provide about the relevant probabilities? Following the arguments above, such analysis could lead to a measurement of the universal learning curve L ( N ). The learning curve L ( N ) exhibited by a human observer is limited by the predictive information in the time series of stimulus trials itself. Comparing L ( N ) to this limit denes an efciency of learning in the spirit of the discussion by Barlow (1983). While it is known that the nervous system can make efcient use of available information in signal processing tasks (cf. Chapter 4 of Rieke et al., 1997) and that it can represent this information efciently in the spike trains of individual neurons (cf. Chapter 3 of Rieke et al., 1997, as well as Berry, Warland, & Meister, 1997; Strong et al., 1998; and Reinagel & Reid, 2000), it is not known whether the brain is an efcient learning machine in the analogous sense. Given our classication of learning tasks by their complexity, it would be natural to ask if the efciency of learning were a critical function of task complexity. Perhaps we can even identify a limit beyond which efcient learning fails, indicating a limit to the complexity of the internal model used by the brain during a class of learning tasks. We believe that our theoretical discussion here at least frames a clear question about the complexity of internal models, even if for the present we can only speculate about the outcome of such experiments. An important result of our analysis is the characterization of time series or learning problems beyond the class of nitely parameterizable models, that is, the class with power-law-divergent predictive information. Qualitatively this class is more complex than any parametric model, no matter how many parameters there may be, because of the more rapid asymptotic growth of Ipred ( N ). On the other hand, with a nite number of observations N, the actual amount of predictive information in such a nonparametric problem may be smaller than in a model with a large but nite number of parameters. Specically, if we have two models, one with Ipred ( N ) » ANº and one with K parameters so that Ipred ( N ) » ( K / 2) log2 N, the innite parameter model has less predictive information for all N smaller than some critical value µ
K log2 Nc » 2Aº
³
K 2A
´¶1 /º .
(6.1)
In the regime N ¿ Nc , it is possible to achieve more efcient prediction by trying to learn the (asymptotically) more complex model, as illustrated concretely in simulations of the density estimation problem (Nemenman & Bialek, 2001). Even if there are a nite number of parameters, such as the nite number of synapses in a small volume of the brain, this number may be so large that we always have N ¿ Nc , so that it may be more effective to think of the many-parameter model as approximating a continuous or nonparametric one. It is tempting to suggest that the regime N ¿ Nc is the relevant one for much of biology. If we consider, for example, 10 mm2 of inferotemporal cor-
Predictability, Complexity, and Learning
2457
tex devoted to object recognition (Logothetis & Sheinberg, 1996), the number of synapses is K » 5 £109 . On the other hand, object recognition depends on foveation, and we move our eyes roughly three times per second throughout perhaps 10 years of waking life, during which we master the art of object recognition. This limits us to at most N » 109 examples. Remembering that we must have º < 1, even with large values of A, equation 6.1 suggests that we operate with N < Nc . One can make similar arguments about very different brains, such as the mushroom bodies in insects (Capaldi, Robinson, & Fahrbach, 1999). If this identication of biological learning with the regime N ¿ Nc is correct, then the success of learning in animals must depend on strategies that implement sensible priors over the space of possible models. There is one clear empirical hint that humans can make effective use of models that are beyond nite parameterization (in the sense that predictive information diverges as a power law), and this comes from studies of language. Long ago, Shannon (1951) used the knowledge of native speakers to place bounds on the entropy of written English, and his strategy made explicit use of predictability. Shannon showed N-letter sequences to native speakers (readers), asked them to guess the next letter, and recorded how many guesses were required before they got the right answer. Thus, each letter in the text is turned into a number, and the entropy of the distribution of these numbers is an upper bound on the conditional entropy `( N ) (cf. equation 3.8). Shannon himself thought that the convergence as N becomes large was rather quick and quoted an estimate of the extensive entropy per letter S 0. Many years later, Hilberg (1990) reanalyzed Shannon’s data and found that the approach to extensivity in fact was very slow; certainly there is evidence for a large component S1 ( N ) / N 1/ 2 , and this may even dominate the extensive component for accessible N. Ebeling and Poschel ¨ (1994; see also Poschel, ¨ Ebeling, & RosÂe, 1995) studied the statistics of letter sequences in long texts (such as Moby Dick) and found the same strong subextensive component. It would be attractive to repeat Shannon’s experiments with a design that emphasizes the detection of subextensive terms at large N.19 In summary, we believe that our analysis of predictive information solves the problem of measuring the complexity of time series. This analysis unies ideas from learning theory, coding theory, dynamical systems, and statistical mechanics. In particular, we have focused attention on a class of processes that are qualitatively more complex than those treated in conven-
19 Associated with the slow approach to extensivity is a large mutual information between words or characters separated by long distances, and several groups have found that this mutual information declines as a power law. Cover and King (1978) criticize such observations by noting that this behavior is impossible in Markov chains of arbitrary order. While it is possible that existing mutual information data have not reached asymptotia, the criticism of Cover and King misses the possibility that language is not a Markov process. Of course, it cannot be Markovian if it has a power-law divergence in the predictive information.
2458
W. Bialek, I. Nemenman, and N. Tishby
tional learning theory, and there are several reasons to think that this class includes many examples of relevance to biology and cognition. Acknowledgments We thank V. Balasubramanian, A. Bell, S. Bruder, C. Callan, A. Fairhall, G. Garcia de Polavieja Embid, R. Koberle, A. Libchaber, A. Melikidze, A. Mikhailov, O. Motrunich, M. Opper, R. Rumiati, R. de Ruyter van Steveninck, N. Slonim, T. Spencer, S. Still, S. Strong, and A. Treves for many helpful discussions. We also thank M. Nemenman and R. Rubin for help with the numerical simulations and an anonymous referee for pointing out yet more opportunities to connect our work with the earlier literature. Our collaboration was aided in part by a grant from the U.S.-Israel Binational Science Foundation to the Hebrew University of Jerusalem, and work at Princeton was supported in part by funds from NEC. References Where available, we give references to the Los Alamos e–print archive. Papers may be retrieved online from the web site http://xxx.lanl.gov/abs/*/*, where */* is the reference; for example, Adami and Cerf (2000) is found at http://xxx.lanl.gov/abs/adap-org/9605002. For preprints, this is a primary reference; for published papers there may be differences between the published and electronic versions. Abarbanel, H. D. I., Brown, R., Sidorowich, J. J., & Tsimring, L. S. (1993). The analysis of observed chaotic data in physical systems. Revs. Mod. Phys., 65, 1331–1392. Adami, C., & Cerf, N. J. (2000). Physical complexity of symbolic sequences. Physica D, 137, 62–69. See also adap-org/9605002. Aida, T. (1999). Field theoretical analysis of on–line learning of probability distributions. Phys. Rev. Lett., 83, 3554–3557. See also cond-mat/9911474. Akaike, H. (1974a). Information theory and an extension of the maximum likelihood principle. In B. Petrov & F. Csaki (Eds.), Second International Symposium of Information Theory Budapest: Akademia Kiedo. Akaike, H. (1974b).A new look at the statistical model identication. IEEE Trans. Automatic Control., 19, 716–723. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? In W. Bialek (Ed.), Princeton lectures on biophysics(pp. 223– 289). Singapore: World Scientic. Attneave, F. (1954). Some informational aspects of visual perception. Psych. Rev., 61, 183–193. Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Comp., 9, 349–368. See also cond-mat/9601030. Barlow, H. B. (1959). Sensory mechanisms, the reduction of redundancy and intelligence. In D. V. Blake & A. M. Uttley (Eds.), Proceedings of the Sympo-
Predictability, Complexity, and Learning
2459
sium on the Mechanization of Thought Processes, (Vol. 2, pp. 537–574), London: H. M. Stationery Ofce. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory Communication (pp. 217–234), Cambridge, MA: MIT Press. Barlow, H. B. (1983). Intelligence, guesswork, language. Nature, 304, 207–209. Barron, A., & Cover, T. (1991). Minimum complexity density estimation. IEEE Trans. Inf. Thy., 37, 1034–1054. Bennett, C. H. (1985). Dissipation, information, computational complexity and the denition of organization. In D. Pines (Ed.),Emerging syntheses in science (pp. 215–233). Reading, MA: Addison–Wesley. Bennett, C. H. (1990). How to dene complexity in physics, and why. In W. H. Zurek (Ed.), Complexity, entropy and the physics of information (pp. 137– 148) Redwood City, CA: Addison–Wesley. Berry II, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. (USA), 94, 5411–5416. Bialek, W. (1995). Predictive information and the complexity of time series (Tech. Note). Princeton, NJ: NEC Research Institute. Bialek, W., Callan, C. G., & Strong, S. P. (1996). Field theories for learning probability distributions. Phys. Rev. Lett., 77, 4693–4697. See also condmat/9607180. Bialek, W., & Tishby, N. (1999). Predictive information. Preprint. Available at condmat/9902341. Bialek, W., & Tishby, N. (in preparation). Extracting relevant information. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000).Adaptive rescaling optimizes information transmission. Neuron, 26, 695–702. Bruder, S. D. (1998).Predictiveinformationin spike trains from the blowy and monkey visual systems. Unpublished doctoral dissertation, Princeton University. Brunel, N., & Nadal, P. (1998). Mutual information, Fisher information, and population coding. Neural Comp., 10, 1731–1757. Capaldi, E. A., Robinson, G. E., & Fahrbach, S. E. (1999). Neuroethology of spatial learning: The birds and the bees. Annu. Rev. Psychol., 50, 651– 682. Carpenter, R. H. S., & Williams, M. L. L. (1995). Neural computation of log likelihood in control of saccadic eye movements. Nature, 377, 59–62. Cesa-Bianchi, N., & Lugosi, G. (in press). Worst-case bounds for the logarithmic loss of predictors. Machine Learning. Chaitin, G. J. (1975). A theory of program size formally identical to information theory. J. Assoc. Comp. Mach., 22, 329–340. Clarke, B. S., & Barron, A. R. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inf. Thy., 36, 453–471. Cover, T. M., & King, R. C. (1978).A convergent gambling estimate of the entropy of English. IEEE Trans. Inf. Thy., 24, 413–421. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Crutcheld, J. P., & Feldman, D. P. (1997). Statistical complexity of simple 1-D spin systems. Phys. Rev. E, 55, 1239–1243R. See also cond-mat/9702191.
2460
W. Bialek, I. Nemenman, and N. Tishby
Crutcheld, J. P., Feldman, D. P., & Shalizi, C. R. (2000). Comment I on “Simple measure for complexity.” Phys. Rev E, 62, 2996–2997. See also chaodyn/9907001. Crutcheld, J. P., & Shalizi, C. R. (1999). Thermodynamic depth of causal states: Objective complexity via minimal representation. Phys. Rev. E, 59, 275–283. See also cond-mat/9808147. Crutcheld, J. P., & Young, K. (1989). Inferring statistical complexity. Phys. Rev. Lett., 63, 105–108. Eagleman, D. M., & Sejnowski, T. J. (2000). Motion integration and postdiction in visual awareness. Science, 287, 2036–2038. Ebeling, W., & Poschel, ¨ T. (1994). Entropy and long-range correlations in literary English. Europhys. Lett., 26, 241–246. Feldman, D. P., & Crutcheld, J. P. (1998). Measures of statistical complexity: Why? Phys. Lett. A, 238, 244–252. See also cond-mat/9708186. Gell-Mann, M., & Lloyd, S. (1996). Information measures, effective complexity, and total information. Complexity, 2, 44–52. de Gennes, P.–G. (1979). Scaling concepts in polymer physics. Ithaca, NY: Cornell University Press. Grassberger, P. (1986). Toward a quantitative theory of self-generated complexity. Int. J. Theor. Phys., 25, 907–938. Grassberger, P. (1991). Information and complexity measures in dynamical systems. In H. Atmanspacher & H. Schreingraber (Eds.), Information dynamics (pp. 15–33). New York: Plenum Press. Hall, P., & Hannan, E. (1988). On stochastic complexity and nonparametric density estimation. Biometrica, 75, 705–714. Haussler, D., Kearns, M., & Schapire, R. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14, 83–114. Haussler, D., Kearns, M., Seung, S., & Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 195–236. Haussler, D., & Opper, M. (1995). General bounds on the mutual information between a parameter and n conditionally independent events. In Proceedings of the Eighth Annual Conference on Computational Learning Theory (pp. 402–411). New York: ACM Press. Haussler, D., & Opper, M. (1997). Mutual information, metric entropy and cumulative relative entropy risk. Ann. Statist., 25, 2451–2492. Haussler, D., & Opper, M. (1998). Worst case prediction over sequences under log loss. In G. Cybenko, D. O’Leary, & J. Rissanen (Eds.), The mathematics of information coding, extraction and distribution. New York: SpringerVerlag. Hawken, M. J., & Parker, A. J. (1991). Spatial receptive eld organization in monkey V1 and its relationship to the cone mosaic. In M. S. Landy & J. A. Movshon (Eds.), Computational models of visual processing (pp. 83–93). Cambridge, MA: MIT Press. Herschkowitz, D., & Nadal, J.–P. (1999). Unsupervised and supervised learning: Mutual information between parameters and observations. Phys. Rev. E, 59, 3344–3360.
Predictability, Complexity, and Learning
2461
Hilberg, W. (1990). The well-known lower bound of information in written language: Is it a misinterpretation of Shannon experiments? Frequenz, 44, 243– 248. (in German) Holy, T. E. (1997). Analysis of data from continuous probability distributions. Phys. Rev. Lett., 79, 3545–3548. See also physics/9706015. Ibragimov, I., & Hasminskii, R. (1972). On an information in a sample about a parameter. In Second International Symposium on Information Theory (pp. 295– 309). New York: IEEE Press. Kang, K., & Sompolinsky, H. (2001). Mutual information of population codes and distance measures in probabilityspace. Preprint. Available at cond-mat/0101161. Kemeney, J. G. (1953). The use of simplicity in induction. Philos. Rev., 62, 391–315. Kolmogoroff, A. N. (1939). Sur l’interpolation et extrapolations des suites stationnaires. C. R. Acad. Sci. Paris, 208, 2043–2045. Kolmogorov, A. N. (1941). Interpolation and extrapolation of stationary random sequences. Izv. Akad. Nauk. SSSR Ser. Mat., 5, 3–14. (in Russian) Translation in A. N. Shiryagev (Ed.), Selected works of A. N. Kolmogorov (Vol. 23, pp. 272–280). Norwell, MA: Kluwer. Kolmogorov, A. N. (1965). Three approaches to the quantitative denition of information. Prob. Inf. Trans., 1, 4–7. Li, M., & VitÂanyi, P. (1993). An Introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag. Lloyd, S., & Pagels, H. (1988). Complexity as thermodynamic depth. Ann. Phys., 188, 186–213. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annu. Rev. Neurosci., 19, 577–621. Lopes, L. L., & Oden, G. C. (1987). Distinguishing between random and nonrandom events. J. Exp. Psych.: Learning, Memory, and Cognition, 13, 392–400. Lopez-Ruiz, R., Mancini, H. L., & Calbet, X. (1995). A statistical measure of complexity. Phys. Lett. A, 209, 321–326. MacKay, D. J. C. (1992). Bayesian interpolation. Neural Comp., 4, 415–447. Nemenman, I. (2000). Information theory and learning:A physical approach. Unpublished doctoral dissertation. Princeton University. See also physics/0009032. Nemenman, I., & Bialek, W. (2001). Learning continuous distributions: Simulations with eld theoretic priors. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 287–295). MIT Press, Cambridge, MA: MIT Press. See also cond-mat/0009165. Opper, M. (1994). Learning and generalization in a two-layer neural network: The role of the Vapnik–Chervonenkis dimension. Phys. Rev. Lett., 72, 2113– 2116. Opper, M., & Haussler, D. (1995). Bounds for predictive errors in the statistical mechanics of supervised learning. Phys. Rev. Lett., 75, 3772–3775. Periwal, V. (1997). Reparametrization invariant statistical inference and gravity. Phys. Rev. Lett., 78, 4671–4674. See also hep-th/9703135. Periwal, V. (1998). Geometrical statistical inference. Nucl. Phys. B, 554[FS], 719– 730. See also adap-org/9801001. Poschel, ¨ T., Ebeling, W., & RosÂe, H. (1995). Guessing probability distributions from small samples. J. Stat. Phys., 80, 1443–1452.
2462
W. Bialek, I. Nemenman, and N. Tishby
Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20, 5392–5400. Renyi, A. (1964). On the amount of information concerning an unknown parameter in a sequence of observations. Publ. Math. Inst. Hungar. Acad. Sci., 9, 617–625. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the Neural Code (MIT Press, Cambridge). Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465– 471. Rissanen, J. (1984). Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Thy., 30, 629–636. Rissanen, J. (1986). Stochastic complexity and modeling. Ann. Statist., 14, 1080– 1100. Rissanen, J. (1987).Stochastic complexity. J. Roy. Stat. Soc. B, 49, 223–239, 253–265. Rissanen, J. (1989). Stochastic complexity and statistical inquiry. Singapore: World Scientic. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Trans. Inf. Thy., 42, 40–47. Rissanen, J., Speed, T., & Yu, B. (1992). Density estimation by stochastic complexity. IEEE Trans. Inf. Thy., 38, 315–323. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8month-old infants. Science, 274, 1926–1928. Saffran, J. R., Johnson, E. K., Aslin, R. H., & Newport, E. L. (1999). Statistical learning of tone sequences by human infants and adults. Cognition, 70, 27– 52. Schmidt, D. M. (2000). Continuous probability distributions from nite data. Phys. Rev. E, 61, 1052–1055. Seung, H. S., Sompolinsky, H., & Tishby, N. (1992). Statistical mechanics of learning from examples. Phys. Rev. A, 45, 6056–6091. Shalizi, C. R., & Crutcheld, J. P. (1999). Computational mechanics: Pattern and prediction, structure and simplicity. Preprint. Available at cond-mat/9907176. Shalizi, C. R., & Crutcheld, J. P. (2000). Information bottleneck, causal states, and statistical relevance bases: How to represent relevant information in memoryless transduction. Available at nlin.AO/0006025. Shannon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. J., 27, 379–423, 623–656. Reprinted in C. E. Shannon & W. Weaver, The mathematical theory of communication. Urbana: University of Illinois Press, 1949. Shannon, C. E. (1951). Prediction and entropy of printed English. Bell Sys. Tech. J., 30, 50–64. Reprinted in N. J. A. Sloane & A. D. Wyner, (Eds.), Claude Elwood Shannon: Collected papers. New York: IEEE Press, 1993. Shiner, J., Davison, M., & Landsberger, P. (1999). Simple measure for complexity. Phys. Rev. E, 59, 1459–1464. Smirnakis, S., Berry III, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386, 69–73. Sole, R. V., & Luque, B. (1999). Statistical measures of complexity for strongly interacting systems. Preprint. Available at adap-org/9909002.
Predictability, Complexity, and Learning
2463
Solomonoff, R. J. (1964). A formal theory of inductive inference. Inform. and Control, 7, 1–22, 224–254. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. In B. Hajek & R. S. Sreenivas (Eds.), Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (pp. 368–377). Urbana: University of Illinois. See also physics/0004057. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. VitÂanyi, P., & Li, M. (2000).Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Trans. Inf. Thy., 46, 446–464. See also cs.LG/9901014. Wallace, C. S., & Boulton, D. M. (1968).An information measure for classication. Comp. J., 11, 185–195. Weigend, A. S., & Gershenfeld, N. A., eds. (1994).Time series prediction:Forecasting the future and understanding the past. Reading, MA: Addison–Wesley. Wiener, N. (1949). Extrapolation, interpolation and smoothing of time series. New York: Wiley. Wolfram, S. (1984). Computation theory of cellular automata. Commun. Math. Physics, 96, 15–57. Wong, W., & Shen, X. (1995). Probability inequalities for likelihood ratios and convergence rates of sieve MLE’s. Ann. Statist., 23, 339–362. Received October 11, 2000; accepted January 31, 2001.
NOTE
Communicated by Idan Segev
Dendritic Subunits Determined by Dendritic Morphology K. A. Lindsay J. M. Ogden Department of Mathematics, University of Glasgow, Glasgow G12 8QQ, Scotland J. R. Rosenberg Division of Neuroscience and Biomedical Systems, University of Glasgow, Glasgow G12 8QQ, Scotland A theoretical framework is presented in which arbitrarily branched dendritic structures with nonhomoge neous membrane properties and nonuniform geometry can be transformed into an equivalent unbranched structure (equivalent cable). Rall’s equivalent cylinder is seen to be one part of the equivalent cable in the special case of dendrites satisfying the Rall criteria. The relation between the branched dendrite and its equivalent unbranched representation is uniquely dened by an invertible mapping that connects congurations of inputs on the branched structure with those on the unbranched structure, and conversely. This mapping provides a new denition of dendritic subunit and provides a mechanism for characterizing local and nonlocal signal processing within dendritic structures. 1 Introduction
Although the relevance of neuron morphology to signal processing within dendrites has been assumed, surprisingly little is known about how shape actually inuences the interactions among different synaptic inputs on a dendrite. In an attempt to resolve this situation, the notion of a dendritic subunit has been proposed as a physical domain or region of the dendritic tree within which local signal processing may take place (Mel, 1994). Subunit boundaries have been dened by the criterion that the voltage attenuation occurring between different sites within a subunit is small compared with that experienced between the subunit and the soma while different subunits are electrically decoupled (Koch, Poggio, & Torre, 1982). This denition of subunit depends on an arbitrary choice of parameter values and is recognised to be of limited use (Koch et al., 1982; Mulloney & Perkel, 1988; Woolf, Shepherd, & Green, 1991; Mel, 1994). Nevertheless, the idea of a functional dendritic subunit is intrinsically sound, but a different approach is required. The equivalent cable (Whitehead & Rosenberg, 1993), which is a generalization of Rall’s equivalent cylinder, provides the key to identifying a new denition of subunit. c 2001 Massachusetts Institute of Technology Neural Computation 13, 2465– 2476 (2001) °
2466
K. A. Lindsay, J. M. Ogden, and J. R. Rosenberg
Rall’s equivalent cylinder, with a suitably imposed input structure, describes exactly the effect at the soma of an arbitrary distribution of inputs on the original tree, provided that tree meets the conditions set by Rall (1962, 1964). Nevertheless, the Rall equivalent cylinder remains incomplete. Although the input structure on the original tree denes that on the equivalent cylinder, the converse, as recognized by Rall, is false. This is in part a consequence of the inadequate length of the equivalent cylinder. Indeed, the term equivalent when applied to the Rall cylinder more accurately refers to its ability to describe the effect of inputs on the tree at the soma. By denition, a fully equivalent representation of a tree, when it exists, necessarily contains all the morphological and biophysical properties of the original tree. Conversely, these properties of the original tree are determined uniquely from a fully equivalent representation. The desire to reduce the time required to calculate transients at the soma of a compartmental model has led to several empirical generalizations of the Rall equivalent cylinder (Clements & Redman, 1989; Fleshman, Segev, & Burke, 1988), and, although recognized as approximations, are referred to as “equivalent” cables. In common with Rall’s equivalent cylinder, these empirical generalizations are again incomplete representations of the original tree. There are two issues to be addressed. The rst is the completion of the Rall equivalent cylinder to the equivalent cable for a tree meeting the Rall criteria, and the second is the extension of this methodology to trees of arbitrary geometry. The equivalent cable introduced here fullls both of these requirements and, in the process of construction, naturally gives rise to dendritic subunits. To appreciate the nature of the equivalent cable, the rst step is to recognize that Rall’s equivalent cylinder must be part of the equivalent cable for trees meeting Rall’s criteria, since the equivalent cylinder predicts exactly the response at the soma for these trees. Generally the part of the equivalent cable that inuences the soma will henceforth be called the connected cable. The remainder of the equivalent cable, called the disconnected cable, must therefore represent all congurations of inputs on the tree that have no inuence on the soma. For representational convenience, these components are displayed separately (e.g., Figure 1B) but are generated as a single object. The process of constructing the equivalent cable described in this article generates both the connected and disconnected cables in sequence and passes smoothly from the connected to the disconnected cable. Furthermore, the disconnected cable itself may be divisible into smaller components, which are still generated sequentially (see Figure 2). The equivalent cable is uniquely determined by the morphology and biophysical properties of the neuron. The construction process is independent of dendritic input and actively partitions these inputs into congurations that are close electrically but not necessarily close geographically. The nal shape of the equivalent cable graphically represents the extent to which any group of electrically close congurations is isolated from or interacts with other congurations of input and the soma itself.
Dendritic Subunits
2467
Figure 1: Equivalent cables for a Y-junction with limbs of equal length when terminal boundary conditions are different in A and identical in B. Arrows directed into and out of limbs illustrate depolarizing and hyperpolarizing current densities, respectively. In B, the parts of the equivalent cable are illustrated separately for representational convenience, although they are generated sequentially as a single object.
The construction process builds each part of the equivalent cable as a sequence of subunits. Each new subunit connects to its predecessor (the soma for the rst subunit) so as to maintain continuity of voltage and conservation of current. The membrane potential within any subunit combines potentials from different geographical regions of the original tree, but has the property that this combination satises a cable equation. These requirements determine the conductance of the subunit and which congurations of input from the entire tree are combined in this subunit. The length of any subunit is the longest interval of X for which its associated membrane potential is dened. The properties of branch point and terminal boundary conditions are folded into the structure of the equivalent cable at each stage. Most important, the construction process does not require solutions of the cable equation, but instead takes advantage of its structure and the need to preserve continuity of membrane potential and conservation of current everywhere. The following sections illustrate the fundamental principles of the construction procedure and give examples of subunit structures for simple dendritic morphologies.
2468
K. A. Lindsay, J. M. Ogden, and J. R. Rosenberg
Figure 2: Reduction of a symmetric dendritic tree with sealed end boundary conditions to its equivalent cable. The equivalent cable D consists of a connected cable and a disconnected cable, which in this instance can be further subdivided into seven components (represented separately). The dotted lines denote the stem of the dendrite. The circled areas in A and B illustrate the transformation of a Y-junction to a connected and disconnected cable. The equivalent cable D is obtained from B via the intermediate stage C.
2 Construction of Elementary Equivalent Cables On a uniform cylindrical dendritic segment with perimeter P, cross-sectional area A, constant specic capacitance Cm , constant membrane conductance gm , and constant axial conductance ga , it is well known that V ( X, T ) , the deviation of the membrane potential from its equilibrium value, satises the cable equation, @V @T
CVD
@2 V @X2
¡
1 J , t g
X 2 ( 0, L) ,
(2.1)
Dendritic Subunits
2469
where X is electrotonic distance, T is nondimensional time, and J is the nondimensional density of exogenous current. Electrotonic distance X and physical distance x are connected by X D x / l, while nondimensional time T and physical time t are connected by T D t / t where parameters l, t , and the characteristic conductance g in equation (2.1) are dened by the formulas s A ga , P gm
lD
Cm , gm
t D
p
gD
gm ga AP .
(2.2)
Here g is numerically the conductance g1 dened by Rall (1989). When particularized to dendritic trees with uniform biophysical properties and segments modeled by right circular cylinders, g is directly proportional to the three-halves power of the dendritic diameter. Suppose now that a parent segment described by potential VP ( X, T ) branches at X D LP into two child segments described by potentials VA ( X, T ) and VB ( X, T ) , respectively; then VA and VB satisfy the cable equations @VA @T
C VA D
@2 VA @X2
¡
1 JA , t gA
@VB
@2 VB
C VB D
@T
@X2
¡
1 JB , t gB
(2.3)
together with the connectivity conditions VP ( LP , T ) D VA ( 0, T ) D VB ( 0, T ) , IP D gP
@VP ( LP , T ) @X
D gA
(continuity of potential)
@VA ( 0, T ) @X
C gB
@VB ( 0, T ) @X
.
(conservation of current) (2.4) Dene the potential y1 ( X, T ) and conductance g1 by the formulas
y1 ( X, T ) D
gA VA ( X, T ) C gB VB ( X, T ) , g1
g1 D gA C gB .
(2.5)
Then conditions (2.4) indicate that y1 ( X, T ) satises
y1 ( 0, T ) D VP ( LP , T ) ,
IP D gP
@VP ( LP , T ) @X
D g1
@y1 ( 0, T ) @X
.
(2.6)
Thus y1 ( X, T ) maintains continuity of potential and conservation of current at the branch point, thereby suggesting that if it is the potential in a section, then this section must have conductance g1 . The superposition property of cable equation solutions yields @y1 @T
C
y1 D
@2 y1 @X2
¡
1 t
³
JA C J B g1
´ ,
X 2 ( 0, L ) ,
(2.7)
2470
K. A. Lindsay, J. M. Ogden, and J. R. Rosenberg
where L D Min( LA , LB ) . Thus y1 is seen to be the potential in a cable with current density J1 D JA C JB . Since the parent branch cannot distinguish between a single connected section of characteristic conductance g1 at potential y1 ( X, T ) and the two connected child segments with conductances gA and gB , the parent branch can now be extended by length L; moreover, this extension will be seamless if g1 D gP . This analysis forms the basis of an algorithm that can be applied repeatedly to a complex structure until a dendritic terminal is reached. Dendritic terminals, once encountered, begin to fashion the shape of the equivalent cable. To appreciate how this happens, it is instructive to build the equivalent cable for a binary Y-junction (denoted D ) with limbs of equal electrotonic length L and various combinations of cut (C) and sealed (S) boundary conditions (see Figure 1). In these examples, the equivalent cables have length 2L (electrotonic length of D ) and a rst section with conductance g1 D gA C gB and length L at potential y1 ( X, T ) . The potential and axial current at X D L (end of the rst section of the equivalent cable) are, respectively,
y1 ( L, T ) D I1 ( L, T ) D g1
@y1 ( L, T ) @X
gA gB VA ( L, T ) C VB ( L, T ) , g1 g1
D gA
@VA ( L, T ) @X
C gB
@VB ( L, T ) @X
(2.8) .
(2.9)
2.1 Terminal Boundary Conditions 2.1.1 Different Terminals. For the combination of terminal boundary conditions shown in Figure 1A, the construction process is continued by observing that the function ² gB ± y2 ( X, T ) D (2.10) VB ( X, T ) ¡ VA ( X, T ) , X 2 ( 0, L ) g1 satises the continuity and current balance conditions
y2 ( L, T ) D y1 ( L, T ) , g2 D
I1 ( L, T ) D ¡g2
@y2 ( L, T ) @X
g1 gA . gB
, (2.11)
Equations (2.3) can now be used to verify that y2 ( X, T ) satises the cable equation, @y2 @T
C
y2 D
@2 y2 @X2
¡
1 J2 ( X, T ) , t g2
X 2 ( 0, L ) ,
deriving in the process the current density, gA J2 ( X, T ) D JB ( X, T ) ¡ JA ( X, T ). gB
Dendritic Subunits
2471
Thus, D has equivalent cable consisting of two subunits, each of length L. The rst subunit has potential y1 ( X, T ) and current density J1 ( X, T ), while the second has potential y2 ( X, T ) and current density J2 ( X, T ) and ends on a cut terminal since y2 ( 0, T ) D 0. The bijective mapping between input current densities JA and JB on D and current densities J1 and J2 on the equivalent cable has matrix representation: "
J1 ( X, T ) J2 ( X, T )
#
" D
1
1
¡1
gA / gB
#"
# JA ( X, T ) . JB ( X, T )
(2.12)
2.1.2 Identical Terminals. If the terminals of D are identical (see Figure 1B), then the potentials y1 ( X, T ) and y2 ( X, T ) given in formulas 2.5 and 2.10, respectively, are still valid, as are the expressions for the conductances g1 and g2 and the mapping of current densities given in equation 2.12. However, y1 and y2 are now both cut at X D L whenever D has two cut terminals and are both sealed at X D L whenever D has two sealed terminals. Evidently y1 ( X, T ) is the potential in the connected cable, and y2 ( X, T ) is the potential in the disconnected cable. Thus, the subunits of the equivalent cable dened by y1 and y2 are electrically isolated. Only the component of inputs on the dendritic tree that map to the rst subunit of the equivalent cable (the connected cable) inuences the soma. Those that map to the second subunit (disconnected cable) are electrically isolated from the soma. 2.1.3 General Remarks. Under all conditions, the potential y1 ( X, T ) denes the Rall equivalent cylinder (Rall, 1962, 1964). The current mapping (see equation 2.12) illustrates that a single input on D generates inputs on both the connected cable (Rall equivalent cylinder) and the disconnected cable. Thus, unique inputs on D cannot be inferred from inputs on the connected cable alone. This simple example illustrates in Figure 1B how the Rall equivalent cylinder can be completed to give the equivalent cable, and in Figure 1A how the equivalent cable is constructed for a tree that fails to meet one of the Rall conditions. For any dendritic tree, the entire procedure can be mechanized using Householder transformations to generate the equivalent cable and the mapping connecting current densities on the dendrite to those on its equivalent cable (Lindsay, Ogden, Halliday, & Rosenberg, 1999). 2.2 Terminal Boundary Conditions Modify Equivalent Cables. The procedure for reducing the simple Y-junction in Figure 1 to its equivalent cable can be applied to more complexdendritic structures to obtain their equivalent cables. Figure 2 illustrates the repeated application of these results to a symmetric dendritic tree formed from a number of simple Y-junctions for which the sum of the conductances of child segments is everywhere that of the parent. All the soma-to-tip paths have the same electrotonic length, and, in the rst instance, each terminal satises the same boundary condition.
2472
K. A. Lindsay, J. M. Ogden, and J. R. Rosenberg
2.2.1 All Boundary Conditions Identical. The reduction procedure starts at the terminal Y-junctions of the tree and passes through a series of equivalent branched structures, simultaneously generating an equivalent branched structure and components of the disconnected cable before nally arriving at the equivalent unbranched representation of the tree. Specically, each third-order Y-junction is collapsed to form a disconnected cable (which will contribute a component of the nal disconnected cable) of length z and a connected cable also of length z and conductance that of its parent. This operation produces an equivalent structure consisting of four disconnected components each of length z and a branched tree of order two with outermost limbs of length ( y C z) (see Figure 2B). This process is repeated twice more to generate the equivalent cable (see Figure 2D) after passing through the rst-order equivalent structure (see Figure 2C). The equivalent cable therefore consists of a connected cable of length ( x C y C z) and uniform conductance connected to the soma and a disconnected cable of length ( x C 3y C 7z) comprising seven components that are isolated electrically from each other and from the connected cable. Provided symmetry is maintained, the entire process can be repeated for this structure when the sum of child conductances at branch points is not that of the parent—that is, the Rall “three-halves” rule is violated. The previous construction procedure remains valid except that the connected and disconnected cables now have nonuniform conductances. 2.2.2 Different Boundary Conditions. The tree in Figure 2 can be used to explore the effect on the equivalent cable caused by changes in terminal boundary conditions alone (morphology remains xed). For example, if the terminal boundary conditions on the upper half-tree are identical (cut ends) but different from those on the lower half-tree (sealed ends), then two successive iterations of the reduction procedure give the structure illustrated in Figure 2C, but with different terminal boundary conditions on the residual branched tree and with the appropriate changes to the boundary conditions of the parts of the disconnected cable generated in the process. The nal reduction to the equivalent cable now gives a nonuniform connected cable of length 2( x C y C z) and a disconnected cable of length ( 2y C 6z) comprising six components, which are isolated electrically from each other and from the connected cable. If one terminal Y-junction in the upper and one in the lower half-tree end on sealed terminals while the remaining terminal Y-junctions end on cut terminals, then the reduction process gives an equivalent cable with a nonuniform connected cable of length ( x C 2y C 2z) and a disconnected cable of length ( x C 2y C 6z). In this instance, the disconnected cable can be further subdivided into a nonuniform component of length ( x C 2y C 2z) with two cut end boundary conditions and four additional components each of length z, two of which have two cut end boundary conditions and the other two with one cut and one sealed end boundary condition. There are other
Dendritic Subunits
2473
Figure 3: Morphologically different dendrites with sealed end terminal segments. Although both dendrites have equivalent cables with identical connected cables, the disconnected cable of the dendrite on the left has one component, whereas that on the right has three. The dotted lines on the right of the connected cable indicate a long, narrow portion of cable.
combinations of boundary conditions amenable to the same analysis that will give rise to different equivalent cables. This example shows that terminal boundary conditions affect not only the length and shape of the connected cable, but also the form and shape of the disconnected cable, as well as the number of electrically isolated components into which it may be subdivided. 3 Classication of Dendritic Trees The equivalent cable allows dendrites to be classied through a comparison of their connected cables. Figure 3 illustrates that two morphologically different dendrites can have the same connected cable but different disconnected cables and, consequently, different equivalent cables. The difference lies in the structure of the disconnected cable and in the mapping. Both dendrites are indistinguishable with respect to the soma; any voltage response at the soma on one dendrite can be generated by a conguration of activity over the other. By contrast, congurations of electrical activity on either dendrite that act only locally (independent of the soma) are identied through the mapping from the respective disconnected cable to the dendrite. The different number of components in the disconnected cable means that both trees have different local signal processing capabil-
2474
K. A. Lindsay, J. M. Ogden, and J. R. Rosenberg
ities. These local interactions meet the requirement of a dendritic subunit rst proposed by Koch et al. (1982) as a tool for investigating the integrative properties of arbitrary dendritic trees. In Steinberg’s (1996) analytical approach to solving the cable equation for a branched structure, he notes that “some of the eigenfunctions are spread all over the dendritic tree, whereas others are restricted to the distal parts of the tree. Obviously excitation of the latter eigenfunctions does not affect the voltage of the parent branch of the dendritic tree. . . . To the extent that they occur, they cannot contribute to summations of the signal at the soma.” Steinberg further notes that these “isolated eigenfunctions” are present in the analysis of Holmes, Segev, and Rall (1992). These isolated eigenfunctions relate to the disconnected cable and its components, if such exist. 4 Conclusion Continuity of potential and conservation of current guide the construction of the equivalent cable for a dendrite in a manner that is unique for the dendrite and depends on only its morphology and set of membrane parameters. In particular, if two dendrites are deemed to be “close,” then this uniqueness guarantees that the equivalent cables of both dendrites are likewise “close”; that is, the notion of equivalent cable is robust. The equivalent cable is constructed as a single object but can always be partitioned into two disjoint parts referred to as the connected cable and the disconnected cable (which can be empty). Dendritic subunits arise naturally in the process of generating the equivalent cable, which may be regarded as an ordered sequence of subunits. Each subunit has an electrotonic length, conductance (graphically represented by its thickness), and (bijective) mapping, which associates a unique conguration of inputs on the dendritic tree with a single input on the subunit. Subunits close to the soma represent congurations of inputs that are electrically close to the soma, whereas subunits more distant from the soma represent congurations of input that have progressively less inuence on the soma. Subunits comprising the disconnected cable represent congurations of inputs that are isolated from the soma. Similarly, congurations of inputs on the dendritic tree that fall on subunits of the equivalent cable connected by a region of high conductance represent congurations of input on the tree that are electrically close. By contrast, congurations of inputs on the dendritic tree that fall on subunits of the equivalent cable separated by a region(s) of low conductance (and zero conductance in the extreme case of the disconnected cable) represent congurations of inputs on the tree that are effectively electrically isolated. The graphical representation of the equivalent cable identies at a glance the regions of high and low conductance and hence provides an intuitive picture of the interaction between subunits. Zador, Agmon-Snir, and Segev (1995) introduced the Morpho-electrotonic Transform as a device for representing graphically the attenuation or delay experienced by a single input on the dendrite.
Dendritic Subunits
2475
The extension of this procedure to multiple inputs is not straightforward. However, it would seem reasonable to expect that the Morpho-electrotonic Transform, when applied to a single input on the equivalent cable, would generate an equivalent attenuation and delay for the associated conguration of inputs on the dendrite. Similarly, the moment-based approach introduced by Agmon-Snir (1995) and Agmon-Snir and Segev (1993) could equally well be applied to the equivalent cable. Consequently, morphology and biophysics, expressed through the subunit structure of the equivalent cable, shape the integration of inputs on a dendritic tree. References Agmon-Snir, H. (1995). A novel theoretical approach to the analysis of dendritic transients. Biophysical Journal, 69, 1635–1656. Agmon-Snir, H., & Segev, I. (1993). Signal delay and input synchronization in passive dendritic structures. Journal of Neurophysiology, 70, 2066–2085. Clements, J. D., & Redman, S. J. (1989). Cable properties of cat spinal motoneurones measured by combining voltage clamp, current clamp and intracellular staining. Journal of Physiology, 409, 63–87. Fleshman, J. W., Segev, I., & Burke, R. E. (1988). Electrotonic architecture of type-identied alpha-motoneurones in the cat spinal cord. Journal of Neurophysiology, 60, 60–85. Holmes, W. R., Segev, I., & Rall, W. (1992). Interpretation of time constant and electrotonic length estimates in multi-cylinder or branched neuronal structures. Journal of Neurophysiology, 68, 1401–1420. Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional interpretation of dendritic morphology. Phil. Trans. R. Soc. Lond. B, 298, 227– 263. Lindsay, K. A., Ogden, J. M., Halliday, D. M., & Rosenberg, J. R. (1999). An introduction to the principles of neuronal modelling. In U. Windhorst & H. Johannson (Eds.), Modern techniques in neuroscience research (pp. 213–306). Berlin: Springer-Verlag. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mulloney, B., & Perkel, D. H. (1988). The roles of synthetic models in the study of central pattern generators. In A. H. Cohen, S. Rossignol, & S. Grillner (Eds.), Neural control of rhythmic movements in vertebrates (pp. 415–453). NY: Wiley. Rall, W. (1962). Theory of physiological properties of dendrites. Ann N.Y. Acad. Sci., 96, 1071–1092. Rall, W. (1964). Theoretical signicance of dendritic trees for neuronal inputoutput relations. In R. F. Reiss (Ed.),Neural theory and modelling (pp. 73–97). Palo Alto, CA: Stanford University Press. Rall, W. (1989). Cable theory. In C. Koch & I. Segev (Eds.), Methods in neuronal modelling: From synapses to networks (pp. 9–62). Cambridge, MA: MIT Press.
2476
K. A. Lindsay, J. M. Ogden, and J. R. Rosenberg
Steinberg, I. Z. (1996).On the analytical solution of electronic spread in branched passive dendritic trees. Journal of Computational Neuroscience, 3, 301–311. Whitehead, R. R., & Rosenberg, J. R. (1993). On trees as equivalent cables. Proc. R. Soc. Lond., B252, 103–108. Woolf, T. B., Shepherd, G. M., & Green, C. M. (1991). Local information processing in dendritic trees: Subsets of spines in granule cells of the mammalian olfactory bulb. Journal of Neuroscience, 11, 1837–1854. Zador, A. M., Agmon-Snir, H., & Segev, I. (1995). The morphoelectronic transform: A graphical approach to dendritic function. Journal of Neuroscience, 15, 1669–1682. Received March 6, 2000; accepted January 18, 2001.
LETTER
Communicated by Laurence Abbott
Computing the Optimally Fitted Spike Train for a Synapse Thomas Natschl¨ager Wolfgang Maass Institute for Theoretical Computer Science, Technische Universit¨at Graz, Graz, Austria Experimental data have shown that synapses are heterogeneous:different synapses respond with different sequences of amplitudes of postsynaptic responses to the same spike train. Neither the role of synaptic dynamics itself nor the role of the heterogeneity of synaptic dynamics for computations in neural circuits is well understood. We present in this article two computational methods that make it feasible to compute for a given synapse with known synaptic parameters the spike train that is optimally tted to the synapse in a certain sense. With the help of these methods, one can compute, for example, the temporal pattern of a spike train (with a given number of spikes) that produces the largest sum of postsynaptic responses for a specic synapse. Several other applications are also discussed. To our surprise, we nd that most of these optimally tted spike trains match common ring patterns of specic types of neurons that are discussed in the literature. Hence, our analysis provides a possible functional explanation for the experimentally observed regularity in the combination of specic types of synapses with specic types of neurons in neural circuits. 1 Introduction
A large number of experimental studies have shown that biological synapses have an inherent dynamics, which controls how the pattern of amplitudes of postsynaptic responses depends on the temporal pattern of the incoming spike train (Katz, 1966; Magleby, 1987; Markram & Tsodyks, 1996; Thomson, 1997; Varela et al., 1997; Dobrunz & Stevens, 1999). Various quantitative models have been proposed (Varela et al., 1997; Abbott, Varela, Sen, & Nelson, 1997; Markram, Wang, & Tsodyks, 1998) involving a small number of hidden parameters, which allow us to predict the response of a given synapse to a given spike train once proper values for these hidden synaptic parameters have been found. The analysis of this article is based on the model of Markram et al. (1998), where three parameters U, F, D control the dynamics of a synapse. A fourth parameter A, which corresponds to the synaptic “weight” in static synapse models, scales the absolute sizes of the postsynaptic responses. The resulting model predicts the amplitude Am D A ¢ um ¢ Rm of the postsynaptic response to the ( m C 1) th spike in a c 2001 Massachusetts Institute of Technology Neural Computation 13, 2477– 2494 (2001) °
2478
Thomas Natschl¨ager and Wolfgang Maass
A
B
F1 type F2 type F3 type
synapse F¯1
1
synapse F¯2
D [sec]
0.75 0.5
synapse F¯3
0.25 0 1 0.75 0.5 0.25
F [sec]
0 0
0.25
0.5
0.75
1
0
0.2
U
0.4
0.6
0.8
time [sec]
Figure 1: (A) Synaptic heterogeneity. Shown is the distribution of values of the parameters U, F, D for inhibitory synapses investigated in Gupta et al. (2000), which can be grouped into three major classes: facilitating (F1), depressing (F2), and recovering (F3). (B) Spike trains that maximize the sum of postsynaptic responses for three given synapses (T D 0.8 sec, n C 1 D 15 spikes). The parameters for synapses FN 1 , FN 2 , and FN 3 are the mean values for the synapse types F1, F2, and F3 reported in Gupta et al. (2000) (see Figure 2 for numerical values).
spike train with interspike intervals (ISI’s) D 0 , D 1 , . . . , D m¡1 , through the equations 1 u kC 1 D U C u k (1 ¡ U ) exp( ¡D k / F ) R kC 1 D 1 C ( R k ¡ u kRk ¡ 1) exp( ¡D k / D) ,
(1.1)
which involve two hidden dynamic variables u 2 [0, 1] and R 2 [0, 1] with the initial conditions u 0 D U and R 0 D 1 for the rst spike. These dynamic variables evolve in dependence of the synaptic parameters U, F, D and the interspike intervals of the incoming spike train. It should be noted that this deterministic model predicts the cumulative response of a population of stochastic release sites that make up a synaptic connection. Gupta, Wang, and Markram (2000) reported that the synaptic parameters U, F, D are quite heterogeneous, even within a single neural circuit (see Figure 1A). Note that the time constants D and F are in the range of a few hundred ms. The synapses investigated in Gupta et al. (2000) can be grouped into three major classes: facilitating (F1), depressing (F2), and recovering (F3). In this article, we address the question which temporal pattern of a spike train is optimally tted to a given synapse characterized by the three param1 To be precise, the term u k Rk in eq. 1.1 was erroneously replaced by u kC 1 Rk in the corresponding equation 2 of Markram et al. (1998). The model that they actually tted to their data is the model considered in this article.
Computing the Optimally Fitted Spike Train for a Synapse
2479
eters U, F, D in a certain sense. One possible choice is to Plook for the temporal pattern of a spike train that produces the largest sum nkD 0 A ¢ u k ¢ Rk of postsynaptic responses. In the original formulation of the model (Markram et al., 1998) Am D A ¢ um ¢ Rm describes the amplitude of the excitatory postsynaptic current (or excitatory postsynaptic potential)P triggered by the ( m C 1)th spike in a spike train. Hence, the larger the sum nkD 0 A ¢ u k ¢ Rk is, the larger is the net effect of the synapse on its target neuron. Another interesting optimality criterion is the maximal integral of synaptic current. Note that in the case where the dendritic integration is approximately linear, the two optimality criteria maximal integral of synaptic current and maximal sum Pn 2 ¢R A¢u k k are equivalent. We would like to stress that the computational kD 0 methods we present are not restricted to any particular choice of the optimality criterion. For example, one can use them also to compute the spike train that produces the largest peak of the postsynaptic membrane voltage. We defer the discussion of such alternative optimality criteria to section 4 and focus in sections 2 and 3 on the question P of which temporal pattern of a spike train produces the largest sum nkD 0 A ¢ u k ¢ Rk of postsynaptic responses (or, equivalently, the largest integral of postsynaptic current). As we will discuss in section 2, there exists an algorithm toP compute for a given synapse the spike train that produces the largest sum nkD 0 A ¢ u k ¢ Rk exactly. Hence, such an algorithm maps each point (a given synapse) of the parameter space (three dimensions U, F, D in our case) onto a point (a particular spike train) in the space of spike trains (see Figure 1). More precisely, we x a time interval T, a minimum value D min for interspike intervals (ISIs), a natural number n, and synaptic parameters U, F, D. We then look forPthat spike train with n C 1 spikes during T and ISIs ¸ D min that maximizes nkD 0 A¢u k ¢Rk. Hence we seek a solution—that is, a sequence of ISIs D 0 , D 1 , . . . , D n¡1 —to the optimization problem maximize
n X kD 0
under
n¡1 X kD 0
A ¢ u k ¢ Rk
D k · T and D min · D k,
k D 0, . . . , n ¡ 1.
(1.2)
In section 2, we present an algorithmic approach based on dynamic programming (DP) that is guaranteed to nd the optimal solution of the optimization problem, equation 1.2 (up to discretization errors), and exhibit for six major types of synapses temporal patterns of spike trains that are 2 current Isyn (t) can be modeled as Isyn (t) D PnIn the linear case, the synaptic t t C (t ), (t) e e D u R ¡ t with exp( 1) (t synaptic time constant). Hence, ts ts R kD 0 k k Pnk R R s
A
Isyn (t) dt D A kD 0 uk Rk e (t ¡ t k ) dt, where tion time window is large enough ( > 5 ¢ ts ).
e (t ¡ tk ) dt is constant if the integra-
2480
Thomas Natschl¨ager and Wolfgang Maass
optimally tted to these synapses. In section 3 we present a faster heuristic method for computing optimally tted spike trains and apply it to analyze how their temporal pattern depends on the number n C 1 of allowed spikes during time interval T, that is, on the ring rate. Furthermore, we analyze in section 3 how changes in the synaptic parameters U, F, D and the spiking history affect the temporal pattern of the optimally tted spike train. In section 4 we discuss the application of the presented computational methods for other optimality criteria than the maximal sum of postsynaptic responses. 2 Computing Optimal Spike Trains for Comm on Types of Synapses 2.1 Dynamic Programming. For T D 1000 ms and n D 10, there are about 2100 spike trains among which one wants to nd the optimally tted one. We show that a computationally feasible solution to this complex optimization problem can be achieved via dynamic programming. We refer to Bertsekas (1995) for the mathematical background of this technique, which also underlies the computation of optimal policies in reinforcement learning. We consider the discrete-time dynamic system described by the equation
x 0 D hU, 1, 0i and
x k C 1 D f ( x k , ak )
for k D 0, 1, . . . , n ¡ 1,
(2.1)
where x k describes the state of the system at step k and a k is the “control” or “action” taken at step k. In our case, x k is the triple hu k, Rk, tk i consisting of the values of the dynamic variables u and R used to calculate the amplitude A ¢ u k ¢ Rk of the ( k C 1) th postsynaptic response and the time tk of the arrival of the ( k C 1)th spike at the synapse. The “action” ak is the length D k 2 [D min , T ¡ tk] of the ( k C 1) th ISI in the spike train that we construct, where D min is the smallest possible size of an ISI (we have set D min D 5 ms in our computations). As the function f in equation 2.1, we take the function that maps hu k, R k, tk i and D k via equation 1.1 on hu kC 1 , RkC 1 , tkC 1 i for tkC 1 D tk C D k. The “reward” for the ( k C 1) th spike is A ¢ u k ¢ Rk, that is, the amplitude of the postsynaptic response for the ( k C 1) th spike. Hence, P maximizing the total reward J ( x 0 ) D nkD 0 A ¢ u k ¢ Rk is equivalent to solving the maximization problem, equation 1.2. The maximal possible value of J0 ( x0 ) can be computed exactly using the equations Jn ( xn ) D A ¢ un ¢ Rn Jk ( x k ) D
max
D 2[D min ,T¡tk ]
( A ¢ u k ¢ Rk C JkC 1 ( f ( x k, D ) ) )
(2.2)
backward from k D n ¡ 1 to k D 0. Thus, the optimal sequence a 0, . . . , an¡1 of “actions” is the P sequence D 0 , . . . , D n¡1 of ISIs that achieves the maximal possible value of nkD 0 A ¢ u k ¢ Rk . Note that the evaluation of Jk ( x k ) for a
Computing the Optimally Fitted Spike Train for a Synapse
2481
single value of x k requires the evaluation of JkC 1 ( x kC 1 ) for many different values of x kC 1 . When one solves equation 2.2 on a computer, one has to replace the continuous-state variable x k by a discrete variable xQ k and round x kC 1 :D f ( xQ k, D ) to the nearest value of the corresponding discrete variable xQ kC 1 . This means that instead of applying equation 1.1, one updates the values of the hidden dynamic variables u and R and the auxiliary variable tk by the equations uQ kC 1 D [U C uQ k ( 1 ¡ U ) exp( ¡D k / F ) ]d
RQ kC 1 D [1 C ( RQ k ¡ uQ k ¢ RQ k ¡ 1) exp( ¡D k / D) ]d tQkC 1 D [tQk C D k]D t ,
(2.3)
where [z]2 denotes the nearest discrete value of z with discretization interval
2 . Computer simulations (see the appendix) show that for the values of T
and n that are considered in this article, it sufces to choose a discretization where uQ k 2 f0, d, 2d, . . . , 1g, RQ k 2 f0, d, 2d, . . . , 1g with d D 1 / 50 and tQk 2 f0, D t, . . . , Tg with D t D 1 ms. For this choice, the trajectory in the resulting discrete dynamic system, equation 2.3, approximates the trajectory in the continuous system, equation 1.1, very well. Therefore, we have based the computations of optimal spike trains for major synapse types that are discussed in the next paragraph on equation 2.3, with d D 1 / 50 and D t D 1 ms. 2.2 Results. We have applied the dynamic programming approach to six major types of synapses reported in Gupta et al. (2000) and Markram et al. (1998). The results are summarized in Figure 2. We refer informally to the temporal pattern of n C 1 spikes that maximizes the response of a particular synapse as the “key to this synapse.” It is shown in Figure 3A that the “keys” for the inhibitory synapses (FN1 , FN2 , and FN3 ) are rather specic in the sense that they exhibit a substantially smaller postsynaptic response on any other of the major types of inhibitory synapses reported in Gupta et al. (2000). A noteworthy—and possibly interesting—aspect of the “keys” shown in Figure 2 (and in Figures 4 and 5) is that they correspond to common ring patterns (accommodating, nonaccommodating, stuttering, bursting, and regular ring) of neocortical interneurons (reported under controlled conditions in vitro; Gupta et al., 2000, Fig. 5). For example, the optimal spike trains for synapses FN2 and FN3 are similar to the output of a stuttering cell. In the same manner, one can classify the optimal spike trains for synapses FN1 and E3 as accommodating (see also Figure 4). Another interesting effect arises by comparing the optimal values of the P sum nkD 0P u k ¢R k for synapses FN1 , FN2 , and FN3 (see Figure 3B) with the maximal values of nkD 0 A ¢ u k ¢ Rk (see Figure 3C), where we have set A equal to the value of Gmax reported in Gupta et al. (2000, Table 1). Whereas the values of
2482
Thomas Natschl¨ager and Wolfgang Maass
F¯1 0.16 376
45
F¯2 0.25 21
706
F¯3 0.32 62
144
E1 0.03 530 130 E2 0.03 3000 600 E3 0.12 3900 30 U F D 0 synaptic parameter
0.2
0.4 0.6 time [sec]
0.8
1
Figure 2: Spike trains that maximize the sum of postsynaptic responses for six common types of synapses (T D 1 sec, 20 spikes). The parameters (D and F in ms) for synapses FN 1 , FN 2 , and FN 3 are the mean values for the synapse types F1, F2, and F3 reported in Gupta et al. (2000, Table 1). Parameters for synapses E1 , E2 , and E3 are taken from Markram et al. (1998, Fig. 3).
A K1 K2 K3
1.0 .64
.88 1.0
4 .98 .89
B
3
.93
1.0
F¯1
F¯2
F¯3
C
10
2 5
1 .94
15
0
F¯1 F¯2 F¯3
0
F¯1 F¯2 F¯3
Figure 3: (A) Specicity of optimal spike trains. The optimal spike trains for synapses FN 1 , FN 2 , and FN 3 (denoted by K1 , K2 , and K3 , respectively) obtained for T D 1 sec and 10 spikes (see the bottom traces in Figure 4) are tested on the other two types of synapses. Plotted is the value of the sum of the postsynaptic 6 i in relation to its response response of each synapse FN i to spike train Kj with j D to Ki . The plotted values are normalized to have value 1 for the pairs hKi , FN i i, Pn ¢ if the spike train Ki i D 1, 2, 3. (B) Absolute values of the sums kD 0 uk R k is applied to synapse FN i , i D 1, 2, 3. (C) Same as B except that the value of Pn A ¢ kD 0 u k ¢ R k is plotted. For A we used the value of Gmax (in nS) reported in Gupta et al. (2000, Table 1). The quotient max / min is 1.3 compared to 2.13 in B.
Computing the Optimally Fitted Spike Train for a Synapse
2483
Gmax vary strongly among different synapse types, the resulting maximal response of a synapse to its proper “key” is almost the same for each synapse. Hence, one may speculate that the system is designed in such a way that each synapse should have an equal inuence on the postsynaptic neuron when it receives its optimal spike train. This effect is most evident for a spiking frequency ( n C 1) / T of 10 Hz and vanishes for frequencies above 20 Hz. 3 Exploring the Parameter Space 3.1 Sequential Quadratic Programming. The numerical approach for approximately computing optimal spike trains that was used in section 2 is sufciently fast that an average PC can carry out any of the computations whose results were reported in Figure 2 within a few hours. To be able to address computationally more expensive issues, we used a nonlinear optimization algorithm known as sequential quadratic programming (SQP),3 which is the state-of-the-art approach for heuristically solving constrained optimization problems such as equation 1.2. (See Powell, 1983, for the mathematical background of this technique.) Applying this algorithm requires calculating the partial derivatives ddDJ i of the objective funcP tion J ( D 0 , . . . , D n¡1 ) D nkD 0 u k ¢ Rk with respect to D i for i D 0, . . . , n ¡ 1. The calculations of these partial derivatives are found in the appendix. To compute an optimal spike train, we perform 100 runs of the SQP algorithm with different random initializations and select the result with the highest P value of nkD 0 u kRk. Unfortunately, the SQP algorithm is not guaranteed to nd the optimal spike train, even if the numerical precision is sufciently high, since it could in principle get stuck in local extrema. However, in all our tests, SQP has produced within a few minutes of computation time a solution that is very close to the spike train computed by the more rigorous dynamic programming approach (for comparison, see the spike trains marked by a gray background in Figure 4). This observation suggests that SQP tends to nd near to optimal solutions to the optimization problem, equation 1.2. 4 3.2 Optimal Spike Trains for Different Firing Rates. First we used SQP to explore the effect of the spike frequency ( n C 1) / T on the temporal pattern of the optimal spike train. For the synapses FN1 , FN2 , and FN3 , we computed the optimal spike trains for frequencies ranging from 10 Hz to 35 Hz. The 3 We used the implementation (function constr ) contained in the MATLAB Optimization Toolbox (see http://www.mathworks.com/products/optimization/). 4 The deviations between the solutions of SQP and DP also originate in part from the fact that SQP is applied to the continuous model, equation 1.1, whereas DP has to be applied to the discrete model, equation 2.3. A useful consequence is that if the DP solution is used as initialization for the SQP algorithm, one can sometimes improve the DP solution slightly.
2484
Thomas Natschl¨ager and Wolfgang Maass
A
synapse F¯1
B
synapse F¯2
C
synapse F¯3
35 30 25 20
DP 15 10 0
0.5 time [sec]
1 0
0.5 time [sec]
1 0
0.5 time [sec]
1
Figure 4: Dependence of the optimal spike train on the spike frequency ( n C 1) / T (T D 1 sec, n C 1 D 10, . . . , 35). For each of the synapses FN 1 , FN 2 , and FN 3 , the optimal spike train is plotted for different frequencies (from 10 Hz to 35 Hz). To compare the solutions of SQP and DP, we also show as an example the solution for 20 Hz obtained by DP (spike trains labeled DP), which occurred already in Figure 2.
results are summarized in Figure 4. For synapses FN1 and FN2 , the characteristic spike pattern (FN1 . . . accommodating, FN2 . . . stuttering) is the same for all frequencies. In contrast, the optimal spike train for synapse FN3 has a phase transition from stuttering to nonaccommodating at about 20 Hz. 3.3 The Impact of Individual Synaptic Parameters. We now address the question how the optimal spike train depends on the individual synaptic parameters U, F, and D. Since the parameters D and F of synapse FN3 lie between the D¡ and F¡values of synapses FN1 and FN2 , it is not surprising that the most interesting effects occur when one changes U, F, D in the vicinity of the parameters of synapse FN3 . The results are summarized in Figure 5, which shows that the optimal spike train moves through the same temporal patterns (even in the same order), no matter which of the three parameters is varied. Another interesting observation is that the sum of postsynaptic responses (gray horizontal bars in Figure 5) depends primarily on the temporal pattern of the optimal spike train and not that much on the actual values of U, F, D for which this pattern is optimal. We have marked in Figure 5 the range of parameters for synapses of type F3 reported in Gupta et al. (2000, Table 1) with a black bar (the pa-
Computing the Optimally Fitted Spike Train for a Synapse
A U variable (from 0.15 to 0.5), D
144, F
2485
62
0.50 0.45 0.40 0.35 0.32 0.30 0.25 0.20 0.15
B U
0 32, F
62, D variable (from 60 to 380 ms)
380 280 200 180 144 120 100 80 60
C U
0 32, F variable (from 20 to 100 ms), D
144
100 90 80 70 62 50 40 30 20 0
0.2
0.4 0.6 time [sec]
0.8
1 0
5 10
J
Figure 5: Dependence of the optimal spike train on the synaptic parameters U, F, D. Each panel shows how the optimal spike train changes if one parameter is varied. The other two parameters are set to the value corresponding to synapse FN 3 (see Figure 2). The black bar to the left marks the range of values (mean § std) reported in Gupta et al. (2000) for the parameter that varies. P To the right of each spike train we plotted the corresponding value of J D nkD 0 u k R k (gray bars).
rameters of synapse FN3 are the reported mean values for synapses of type F3). Within this parameter range, we nd stuttering and nonaccommodating spike patterns, with a phase transition right in the middle. We also
2486
Thomas Natschl¨ager and Wolfgang Maass
calculated the optimal spike trains for the range of parameters U, F, D for synapses of the types F1 and F2 reported in Gupta et al. (2000, Table 1) (results not shown in this article). We found that the temporal pattern of the optimal spike train for the variations within the class of F1 (F2) synapses is always of class accommodating (stuttering). 3.4 Inuence of the Spiking History. So far we had implicitly assumed—through the initial conditions u 0 D U and R0 D 1—that the input spike train follows a long period of complete presynaptic inactivity. However, initial conditions other than u0 D U and R 0 D 1 can account for an arbitrary spiking history of the synapse under consideration. This is due to the fact that an arbitrary spike train with ( m C 1) spikes and ISIs D 0 , D 1 , . . . , D m¡1 yields certain values um and Rm of the hidden dynamic variables u and R. Thus, using these values as initial values u 0 and R0 in our computation of an optimal tted spike train, we effectively compute the optimally tted spike train after the synapse had previously received a spike train with ISIs D 0 , D 1 , . . . , D m¡1 .5 Figure 6 shows the impact of different initial conditions u 0 and R0 on the optimally tted spike train for synapse FN3 . The basic ring pattern (nonaccommodating) does not change, but the onset of ring is delayed. The amount of delay increases with the level of depression (low R0 ) and increasing u 0 (fraction of resources used per spike). This also holds for the synapses FN1 and FN2 (results not shown). Note that the resulting optimal spike trains for u 0 , R0 < 1 match another family of ring patterns: delayed discharge (see column 3 of Figure 5 in Gupta et al., 2000). 4 Other Optimality Criteria
In this section, we discuss the application of the computational methods we have introduced to the computation of spike trains that are optimally tted to a given synapse P in a sense other than that discussed in the previous sections (largest sum nkD 0 A¢u k ¢Rk). A list of alternative criteria with respect to which a spike train can be dened as optimal for a given synapse may include the following: A. Maximal amplitude of a certain (e.g., the last (i D n)), postsynaptic response: maximize ui ¢ Ri
(4.1)
B. Maximal amplitude of the largest postsynaptic response: maximize max fu k ¢ Rkg 0 · k· n
(4.2)
5 Due to this notation, the rst spike of each optimally tted spike train should be interpreted as the (m C 1) spike of the spiking history described by the ISIs D 0 , D 1 , . . . , D m¡1 .
Computing the Optimally Fitted Spike Train for a Synapse
A
u0 variable (from 0.3 to 1.0), R0
2487
10
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
B
u0
0 32, R0 variable (from 0.0 to 1.0)
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0
0.2
0.4 0.6 time [sec]
0.8
1
Figure 6: Dependence of the optimal spike train of synapse FN 3 (see Figure 2 for parameter values) on the initial values u 0 and R 0 . Each panel shows how the optimal spike train changes if one value is varied. The other value holds constant.
C. Maximal peak of postsynaptic membrane potential V(t)6 : maximize max fV( t) g 0·t ·T
(4.3)
(t) The membrane voltage V(t) can be modeled by tm d Vd t D ¡V(t) C Isyn (t), where Pn (t) tm D 50 ms is the membrane time constant and Isyn D A ¢ u ¢ Rk ¢ e (t ¡ t k ) is the kD 0 k synaptic current, with e (t) D tts e(¡t / ts C 1) and ts D 2 ms. t k denotes the time of the (k C 1)th spike in the input spike train. 6
2488
Thomas Natschl¨ager and Wolfgang Maass
D. Maximal specicity of several spike trains among several synapses: maximize
X 1 ·i ·l
cii ¡
X
cij
(4.4)
6 j ·l 1·i D
where cij are the values of a specicity matrix C D [cij ]1·i, j·l , an example of which is shown in Figure 3. Note that here l spike trains (one for each synapse) are computed simultaneously. P As before these criteria are considered under the constraints n¡1 kD 0 D k · T and D min · D k , k D 0, . . . , n ¡ 1. In principle, the DP approach would allow computing the exact solution for all of the corresponding constrained optimization problems. However, one has to cope with the problem that the dimension of the state variables x k (see equation 2.1) has to be enlarged to incorporate additional information (e.g., the current maximum or the current membrane potential). This has the fatal consequence that it is impossible to run the DP algorithm on an average PC. Fortunately, it is rather straightforward to adapt the SQP approach to these other optimization problems. Note, however, that for criteria B and C, an analytical form of the partial derivatives ddDJ i , i D 0, . . . , n ¡ 1 of the corresponding objective function J ( D 0, . . . , D n¡1 ) in general cannot be obtained. But the partial derivatives ddDJ i are piecewise continuous and can be approximated numerically (by a nite difference quotient) quite well. The optimally tted spike trains for synapses FN1 , FN2 , and FN3 for each of the above optimality criteria obtained with the SQP algorithm are shown in Figure 7. Figure 7: Facing page. Optimally tted spike trains for synapses FN 1 , FN 2 , and FN 3 for the four other optimality criteria. (A1) Spike trains for T D 0.5 sec and 10 spikes, which maximize the amplitude of the last synaptic response. The amplitude u k ¢ R k is represented by the height of the ( k C 1) th vertical line. (A2) Shows (for the same optimization criterion) the dependence of the maximally achievable amplitude of the synaptic response for the last spike on the number of spikes in the spike train. (B1) Same as A1 but spike trains maximize the amplitude of the largest synaptic response. (B2) Same as A2, but for this different optimization criterion. Amplitude of the largest synaptic response for different numbers of spikes. (C1) Same as A1, but spike trains maximize the peak of the postsynaptic membrane potential (for a given number of spikes). (C2) Postsynaptic potential V ( t) corresponding to the optimal spike trains shown in C1 (vertical scale depends on actual electrophysiological parameters). (D1) The three spike trains shown (K1, K2, and K3) maximize the specicity (see equation 4.4) among the synapses FN 1 , FN 2 , and FN 3 . (D2) Specicity matrix for the spike trains shown in D1. For details, see equation 4.4 and Figure 3.
Computing the Optimally Fitted Spike Train for a Synapse
A1
2489
A2
F¯1 0.6 0.3 0.0
F¯2 F¯3 0 B1
0.25 time [sec]
0.5
0.25 time [sec]
0.5
2
4 6 8 number of spikes
10
2
4 6 8 number of spikes
10
B2
F¯1 F¯2 F¯3 0 C1
C2
F¯1 F¯2 F¯3 0
0.25 time [sec]
0.5
0
D1
0.25 time [sec]
0.5
D2
K1 K2 K3 0
0.2
0.4 0.6 time [sec]
0.8
1
1.0
.69
.74
.62
1.0
.73
.76
.97
1.0
F¯1
F¯2
F¯3
synapse
In the following, we provide some remarks regarding the results: A1 and A2. Note that the resulting optimal temporal pattern of the
spike trains happens to be similar to the protocol used for determining synapse types in Gupta et al. (2000).
2490
Thomas Natschl¨ager and Wolfgang Maass B1 and B2. For synapse FN1 , the last synaptic response also has the
largest amplitude. For FN2 and FN3 , the second spike always triggers the largest synaptic response. C1 and C2. For synapse FN1 , a burst with frequency lower than 1 / D min
yields the largest peak in postsynaptic potential. Note that for FN2 and FN3 , the spikes occurring after the peak in the membrane voltage are arbitrarily positioned in time. D1 and D2. A comparison of the spike trains shown in Figure 7, D1,
and Figure 2 shows that the qualitative nature of the spike trains is very similar and that the specicity shown in Figure 3 is very similar to the optimal specicity shown in Figure 7, D2. 5 Discussion
We have presented two complementary computational approaches for computing spike trains that optimize a given response criterion for a given synapse. One of these methods is based on dynamic programming (similar as in reinforcement learning), the other on sequential quadratic programming. Recent studies (Gupta et al., 2000) indicate that GABAergic synapses in neocortical layers II to IV belong to one of three major classes, called F1-, F2-, and F3-type synapses, respectively. It turns out that the spike trains that maximize the response of these types of synapses are well-known ring patterns (accommodating, nonaccommodating, stuttering, bursting, and regular ring) of specic neuron types. For F1- and F3-type synapses, the optimal spike train agrees with the most often found ring pattern of presynaptic neurons reported in Gupta et al. (2000), whereas for F2-type synapses, there is no such agreement. Note, however, that the ring pattern of interneurons reported in Gupta et al. (2000) were obtained in cortical interneurons in vitro following injection of current steps and may or may not represent how these neurons discharge in vitro. There are also many in vivo studies of discharge characteristics of classes of cortical neurons (Steriade, Timofeev, Durm ¨ uller, ¨ & Grenier, 1998; Gray & McCormick, 1996). Steriade et al. (1998) report a class of neurons—so-called fast-rhythmic bursting (FRB) neurons—with discharge patterns (in response to current injection) that have similar properties as the optimal spike train for synapse FN2 shown in Figure 2: rhythmic bursts with a very high intraburst frequency. If the current injection into FRB neurons is increased, the number of bursts, as well as the number of spikes per burst, increases as well. This is also characteristic for the optimal spike train for synapse FN2 (see Figure 4) if the spike frequency ( n C 1) / T is increased. Hence, a synapse of the F2-type would lter the spike train in such a way that FRB neurons have the maximal possible impact on a target neuron over a broad range of ring frequencies. On the other hand Gray and McCormick (1996)
Computing the Optimally Fitted Spike Train for a Synapse
synaptic response key to synapse F¯1
F¯1
F¯2
A
B
2491
synaptic response key to synapse F¯2
F¯1
F¯2
A
B
Figure 8: Preferential addressing of postsynaptic targets. Due to the specicity of the optimal spike train of a synapse, a presynaptic neuron may address (i.e., evoke stronger response at) either neuron A or B, depending on the temporal pattern of the spike train (with the same frequency f D ( n C 1) / T) it produces (T D 0.8 sec and ( n C 1) D 15 spikes in this example).
report a class of neurons called “chattering cells” that display discharge of repeated (80 Hz) brief (5 spikes) high-frequency (800 Hz) bursts. We have not yet seen such spiking patterns in our studies (even if we use D min D 1 ms). This is probably due to the fact that the synapse model we used (Markram et al., 1998) was derived and validated for synapses of neurons that show much lower ring frequencies. These observations show that our new computational techniques provide tools for analyzing the possible functional role of the specic combinations of synapse types and neuron types that was recently found in Gupta et al. (2000). Another noteworthy aspect of the optimal spike trains is their specicity for a given synapse (see Figure 3); suitable temporal ring patterns activate preferentially specic types of synapses. One may speculate that due to this feature, a neuron can activate a particular subpopulation of its target neurons by generating a series of action potentials (of a xed mean frequency) with a suitable temporal pattern (see Figure 8 for an illustration). Recent experiments (Wang & McCormick, 1993; Brumberg, Nowak, and McCormick, 2000) show that neuromodulators can control the ring mode of cortical neurons. Wang and McCormick (1993) show that bursting neurons may switch to regular ring if norepinephine is applied. Together with the specicity of synapses to certain temporal patterns, these ndings point to one possible mechanism how neuromodulators can change the effective connectivity of a neural circuit. Furthermore, we have shown in section 4 (see Figure 7, D2) that for the types of synapses discussed in Gupta et al. (2000), spike trains that maximize the response of any single one of these synapses have already close to the optimal specicity. Our analysis provides the platform for a deeper understanding of the specic role of different synaptic parameters, because with the help of the computational techniques that we have introduced, one can now see directly how the temporal structure of the optimal spike train for a synapse depends on the individual synaptic parameters. We believe that this inverse analysis is essential for understanding the computational role of neural circuits.
2492
Thomas Natschl¨ager and Wolfgang Maass
0.1
mae
0.08 0.06
E3 E2 E1
0.04
F¯1 0.02 F¯3 F¯2 0
10 20 30 40 50 60 70 80 90 100
N
Figure 9: Impact of discretization for dynamic programming. For different synapses (see Figure 2 for parameters), we plotted the average mean absolute error (mae) between the trajectory of the discrete and the continuous model for various levels of discretization intervals d D 1 / N (for details, see the text).
Appendix A: Impact of Discretization for Dynamic Programming
To assess the inuence of the degree of discretization on the P trajectory of the synaptic model, we measured the mean absolute error n C1 1 nkD 0 |u kRk ¡ uQ kRQ k | between the trajectory of the discrete model, equation 3.3, and the continuous model, equation 1.1. For each synapse type, we calculated the mean absolute error as an average over 1000 Poisson spike trains consisting of 20 spikes and a mean frequency of 20 Hz. The calculations were performed for various discretization intervals d D 1 / N. The results are summarized in Figure 9. We used N D 50 for the computations in section 3. Appendix B: Partial Derivatives of the Objective Function for Use with SQP
In this section we give a calculation of the partial derivatives ddDJ i , i D Pn 0, . . . , n ¡ 1 of the objective function J ( D 0 , . . . , D n¡1 ) D kD 0 Au k R k . Obviously one has to calculate the derivatives dduDk Ri k for k D 0, . . . , n and i D 0, . . . , n ¡ 1. For k · i, one has
d u k Rk d Di
D 0 since
D i inuences only
Computing the Optimally Fitted Spike Train for a Synapse
2493
u k and Rk if k > i. In that case (k > i) one gets d u kRk d uk d Rk D Rk C uk d Di d Di d Di ³ ´ d u k d u k¡1 d R k d u k¡1 d Rk d Rk¡1 C ¢ ¢ Rk C ¢ ¢ D uk d u k¡1 d D i d u k¡1 d D i d R k¡1 d D i with d uk D ( 1 ¡ U ) exp( ¡D k¡1 / F ) , d u k¡1 d Rk and D ¡Rk¡1 exp( ¡D k¡1 / D ) , d u k¡1 d Rk D ( 1 ¡ u k¡1 ) exp( ¡D k¡1 / D ) . d Rk¡1 Hence, d uk d uk D d Di d u k¡1 d Rk d Rk D d Di d u k¡1
d u k¡1 and d Di d u k¡1 d R k d Rk¡1 C ¢ ¢ d Di d Rk¡1 d D i ¢
can be calculated from
d uk¡1 d Di
and
d Rk¡1 d Di .
Hence, starting with the equations
d ui C 1 1 U ¡ ui C 1 D ¡ ui ( 1 ¡ U ) e¡D i / F D d Di F F d Ri C 1 1 1 ¡ Ri C 1 ¡ D F / D ¡ ( Ri ¡ ui Ri ¡ 1) e i D , d Di D D the calculation of
d u k Rk d Di
for k > i is rather straightforward.
Acknowledgments
We thank Terry Sejnowski, Henry Markram, Misha Tsodyks, and Tony Zador for helpful and stimulating comments to our research. This work was supported by the Salk Sloan Foundation for Theoretical Neuroscience, project P12153 of the Fonds zur Forderung ¨ wissenschaftlicher Forschung (FWF), Austria, and the NeuroCOLT project of the EC (ESPRIT Working Group No. 27150). References Abbott, L. F., Varela, J. A., Sen, K. A., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224.
2494
Thomas Natschl¨ager and Wolfgang Maass
Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 1). Belmont, MA: Athena Scientic. Brumberg, J. C., Nowak, L. G., & McCormick, D. A. (2000). Ionic mechanisms underlying repetitive high frequency burst ring in supragranular cortical neurons. Journal of Neuroscience, 20(1), 4829–4843. Dobrunz, L., & Stevens, C. F. (1999). Response of hippocampal synapses to natural stimulation patterns. Neuron, 22, 157–166. Gray, C. M., & McCormick, D. A. (1996). Chattering cells: Supercial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109–113. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Katz, B. (1966). Nerve, muscle, and synapse. New York: McGraw-Hill. Magleby, K. (1987). Short term synaptic plasticity. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Synaptic function. New York: Wiley. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci., 95, 5323–5328. Powell, M. J. D. (1983). Variable metric methods for constrained optimization. In A. Bachem, M. Grotschel, & B. Korte (Eds.), Mathematical programming: The state of the art (pp. 288–311). Berlin: Springer-Verlag. Steriade, M., Timofeev, I., Durm ¨ uller, ¨ N., & Grenier, F. (1998).Dynamic properties of corticothalamic neurons and local cortical interneurons generating fast rhythmic (30–40 Hz) spike bursts. Journal of Neurophysiology, 79, 483–490. Thomson, A. M. (1997). Activity-dependent properties of synaptic transmission at two classes of connections made by rat neocortical pyramidal axons in vitro. J. Physiol., 502, 131–147. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci, 17, 220–224. Wang, Z., & McCormick, D. A. (1993). Control of ring mode of corticotectal and corticopontine layer V burst generating neurons by norepinephrine. Journal of Neuroscience, 13(5), 2199–2216. Received June 16, 2000; accepted January 23, 2001.
LETTER
Communicated by Carl van Vreeswijk
Period Focusing Induced by Network Feedback in Populations of Noisy Integrate-and-Fire Neurons Francisco B. RodrÂõ guez Alberto Su a rez Vicente L Âopez Escuela TÂecnica Superior de InformÂatica, Universidad AutÂonoma de Madrid, Cantoblanco, 28049 Madrid, Spain The population dynamics of an ensemble of nonleaky integrate-and-re stochastic neurons is studied. The model selected allows for a detailed analysis of situations where noise plays a dominant role. Simulations in a regime with weak to moderate interactions show that a mechanism of excitatory message interchange among the neurons leads to a decrease in the ring period dispersion of the individual units. The dispersion reduction observed is larger than what would be expected from the decrease in the period. This “period focusing” is explained using a mean-eld model. It is a dynamical effect that arises from the progressive decrease of the effective ring threshold as a result of the messages received by each unit from the rest of the population. A back-of-the-envelope formula to calculate this nontrivial dispersion reduction and a simple geometrical description of the effect are also provided. 1 Introduction
One of the most remarkable phenomena that takes place in natural information processing networks, such as the brain or a biochemical system, is the ability to maintain functionality even in the presence of signicant levels of noise. Sources of noise are generally present in biological systems, and despite recent insight (Shimokawa, Rogel, Pakdaman, & Sato, 1999; Frank, Daffertshofer, Beek, & Haken, 1999), the role of noise regarding the processing of information has not yet been claried. In early studies, it was assumed that uctuations are an undesired, if unavoidable, element in such systems. Consequently, most models studied in connection with information processing assume that noise is but a small perturbation, whose effect on the functionality of the system is pernicious and therefore should be minimized. A remarkable exception to this trend is the work on stochastic resonance (Gammaitoni, Hanggi, Jung, & Marchesoni, 1998), where noise is a source of stability and organization in the dynamics. The increasing amount of work devoted to characterizing synchronic ring also looks for information processing in spite of noise. The synchronizac 2001 Massachusetts Institute of Technology Neural Computation 13, 2495– 2516 (2001) °
2496
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
tion of neural signals, supported by neural recordings, leads to a paradigm of coherent perception and feature binding that can persist further in a noisy environment than the stimulus-induced activity of the isolated units (von der Malsburg, 1994; Gray, Konig, ¨ Engel, & Singer, 1989; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996). Moreover, synchronic message interchange is a cooperative phenomenon observed in populations of quite different living units ranging from reies (Buck, 1988) to opera theater attendants (NÂeda, Ravasz, Brechet, Vicsek, & Barab Âasi, 2000). An analysis of cooperative behavior leading to relatively simple dynamical states that arise from interacting complex units is usually undertaken on the context of synchrony search (Pham, Pakdaman, & Vibert, 1998). Theoretical work devoted to the understanding of population synchrony has been growing since the seminal work of Mirollo and Strogatz (1990), where the study of a pulse-coupled population of oscillators was reduced to the analysis of ring mappings. Since then, synchronization of oscillators has been studied following different theoretical approaches: phase models suited for the case of weak interactions (Izhikevich, 1999), spike response models following integral equations (Gerstner, 1995, 2000), spike density models evolving according Fokker-Planck equations (Kuramoto, 1991), and chaotic maps (Kaneko, 1994). The original ring map approach of Mirollo and Strogatz has been also continued and extended (see, e.g., Ernst, Pawelzik, & Geisel, 1995; Mathar & Mattfeldt, 1996; Bottani, 1996; van Vreeswijk, 1996). This kind of work can be considered as the predecessor of this article, although we focus on a regime where interactions between the units are not strong enough to produce synchronization. One of our goals in this work is to construct a simple discrete-probabilistic model where the effects of noise and interactions can be investigated. The model system should be able to reproduce a rich behavior with a small number of parameters, so that analytical estimations of relevant properties can be easily derived. It is our belief that in biological information processing systems, uctuations should play an important role. Hence, the focus is on the dynamics of systems with a large amount of randomness. The use of discrete states and discrete times allows for a complete analysis of population dynamics. Furthermore, computer simulations are feasible and stable even for populations with a large number of units. The system we investigate is a network of integrate-and-re neurons exchanging excitatory messages. The dynamics of each of these units are stochastic, which implies that the ring period is dened only as an average quantity about which the actual interspike delay uctuates. Two different regimes are found depending on the degree of interaction among neurons. For strong interaction among network units, the system settles into a regime where units re regularly in groups (RodrÂõ guez & L Âopez, 1999). In this state, which closely resembles the one described in van Vreeswijk and Abbott (1993), the strength of incoming messages is enough to produce synchronization and even to override stochasticity.
Period Focusing in Populations of Noisy Neurons
2497
The focus of this article is on a regime with smaller interaction, where neurons are observed to re independently. Simulations show that the messages that arrive to any given neuron cover uniformly the time elapsed between the production of two consecutive spikes by that neuron. Although the system is far from synchronization, every neuron is generating spikes with a period that is more regular than it would be expected if the dynamics of messages received from the population were neglected. This narrowing of the period distribution (period focusing) can be accurately described within a mean-eld model where the effect of the messages received is represented by a progressive lowering of the unit ring threshold. 2 Dynamics of the Neural Model
The neural model studied in this article is composed of a globally coupled network of nonleaky stochastic integrate-and-re units. The dynamics are entirely discrete. Each of the units can exist only in one of a nite number of states, labeled l D 1, 2, . . . , L¡1, where all states beyond L¡1 are metastable states that deterministically produce the release of a signal to the rest of the units (a spike) in the following time step. Transitions between states can take place only at discrete time steps. The discrete nature of the model makes it possible to carry out a detailed analysis of the phenomena that arise in the network as a result of interactions between the units. Consider the dynamics of an isolated neuron. In the absence of coupling, each unit performs a random walk (Gerstein & Mandelbrot, 1964). At every time step, the state of unit can be incremented with a probability p, or it can remain unchanged with a probability (1 ¡ p ) . The state of the ith unit in the network at time t is given by the variable ai ( t ), whose domain is the state labels 1, 2, . . . , L ¡ 1. When the state of the neuron reaches the level L, the unit generates a spike with probability one. States with labels L and beyond are identied with the state of label 0. The spike is produced in the deterministic transition from state with label 0 to state with label 1. To summarize, the dynamics are those of a stochastic oscillator characterized by a probability p and a threshold L: ( ai ( t C 1) D
ai ( t) C 1 ai ( t)
with probability p with probability ( 1 ¡ p )
ai ( t C 1) D 1 with probability 1.
if ai ( t) < L if ai ( t) ¸ L. (2.1)
The time evolution of the probability distribution for an ensemble of Nindependent oscillators starting at the origin (ai ( 0) D 1; i D 1, 2, . . . , N ) at time 0 is given by the binomial distribution ³ PB ( l, t ) D
´ t p ( l¡1 ) (1 ¡ p) ( t¡l C 1) , l¡1
l D 1, 2, . . . , t C 1.
(2.2)
2498
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
The time elapsed between the generation of consecutive spikes follows a negative binomial distribution (Feller, 1971), ³ P NB ( TI L, p ) D
´ T ¡ 2 ( L¡1 ) ( 1 ¡ p ) ( T¡L) , p L ¡2
T > 1.
(2.3)
For this distribution, we can derive the mean interspike delay and its standard deviation: L ¡1 t D1C p p ( L ¡ 1) ( 1 ¡ p ) s D . p
(2.4)
This model for the isolated neuron is compatible with many sources of noise. The probabilistic drift given to the unit dynamics using the parameter p could arise from any source (internal or external to the unit or the network), provided that it is Poissonian. We should also point out that the model we use in this work yields spike trains with a small coefcient of variation 1 (Cv ´ s / t » L ¡ 2 as L ! 1 ), which cannot account for the behavior observed in cortical neurons (Cv » 1). It is understood that to reproduce such a noisy train of spikes, both excitatory and inhibitory inputs need to be considered (Shadlen & Newsome, 1994). Research with an extended model including both types of inputs is now in progress. Although the model can be easily extended to consider continuous time, we stay on the discrete-time domain since it has two main advantages. First, it allows for a simpler explanation of the observed behavior, and, second, it leads to a simple and coherent procedure to update the dynamics of the population. Complex differential equations are avoided in the model interaction we have selected, and that is explained in what follows. The interaction between units is a delayed spike transmission through the neural connection, which produces a change in the state of the receiving unit. More explicitly, a spike of unit j at time t ¡ 1 affects the state of unit i at time t according to the rule ai ( t) D ai ( t ¡ 1) C 2 ij ( t) ,
(2.5)
where the magnitude 2 ij ( t) refers to the connection strength or, in other words, the size of the message received by unit i from unit j. Only integer positive values are allowed for 2 ij . If after receiving the message associated to a spike, the activity of unit i reaches a state at or above the threshold L, the unit res and resets its state to 1 in the following time step, regardless of the precise value of ai ( t) before receiving the message. We focus here on an ensemble of N interacting units with global coupling of equal strength 2 and with a fully connected architecture. Four parameters
Period Focusing in Populations of Noisy Neurons
2499
are sufcient to characterize the ensemble: N, the size of the ensemble, 2 the message strength, and L, p describing the evolution of the isolated units. In this article, we present the results in the limit of small 2 . Since we can only reduce to 1 the size of 2 in our discrete model, results are given only for the case 2 D 1. Nonetheless, 2 is not dropped from any expression. In the 6 0, besides the spontaneous evolution of presence of a nonzero coupling 2 D the neural units, given by eq. 2.1, there is an additional step to account for interactions among units, ai ( t C 1) D a¤i ( t C 1) C 2 si ( t) ,
(2.6)
where a¤i is the state of unit i after spontaneous evolution and si ( t) the number of units, excluding neuron i, that red in the previous time step. Physically, the quantity si ( t) represents the number of messages received by unit i at time t. The maximum value this variable can take is V D N ¡ 1. The quantity V is also the number of spikes that a test unit receives on average between two consecutive spikes. Therefore, 2 V represents the average total strength of the messages received by a single unit in the period between two consecutive spikes. In order to characterize the strength of interactions among the units, we introduce the parameter gD
L¡1 , V2
(2.7)
which measures the number of times each of the units in the ensemble has to send a message, on average, in order to induce a spike in a test neuron, assuming that no spontaneous evolution occurs in the test neuron. The inverse of this parameter represents the fraction of the evolution of a given unit that is accounted for by the messages received from the other neurons. The value of this parameter determines the regime in which the dynamics of the network takes place. For large g ! 1, the dynamics are dominated by the spontaneous evolution of the units. As g decreases and gets closer to 1, the relative importance of the coupling to the other units increases. We proceed now to examine the different regimes in detail. 3 The Limit of Large g
The dynamics of the network at the microscopic level (i.e., the level of each of the units) are complex and do not convey much relevant information. Borrowing the techniques for the analysis of collective behavior in complex systems from statistical mechanics, we focus on the mesoscopic level and describe the properties of ensembles of such networks. In particular, we focus on the statistics of spike production in the system. Presumably it is the production of spikes and their temporal structure the mechanism by which information is processed in the brain (Rieke, Warland, de Ruyter van
2500
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
Steveninck, & Bialek, 1997). That is the reason for our focusing on this aspect of the dynamics. Let us denote by Ti the interval between the generation of two consecutive spikes by unit i, which has been arbitrarily singled out in the network. There are several different averages that will be performed. A rst average is made over time for each of the units of the network. The mean ring interval for unit i over time is tit and its standard deviation sit . These quantities give us an idea of the distribution of ring intervals of a particular unit in the network. The mean of the time-averaged ring interval over the ensemble of network units and over the different initial congurations for the network ensemble (average over the phases of the stochastic oscillators, assuming random initial phases) is noted as htit ie,w 0 . Similarly, the mean of the standard deviation over the ensemble and initial phases is hsit ie,w 0 . Focusing on the regime in which most of the evolution can be accounted for by the spontaneous evolution of the units g ! 1, we can give an estimate for the mean and standard deviation of the unit ring time by the following argument: In the absence of coupling between units, g D 1, the average interspike period and its standard deviation are given by equations 2.4. Since, on average, during an interval between two spikes of unit i, all neurons re once, the strength of the messages received by this unit in this interval is V2 . Assuming all of these messages are effective in increasing the state of the unit in question, the ensemble should behave as a collection of independent stochastic oscillators with a renormalized threshold, LQ D L ¡ V2 .
(3.1)
Under these assumptions, the estimated value for the mean and variance of the ring period is L ¡ V2 ¡ 1 tO D 1 C p p ( L ¡ V2 ¡ 1) (1 ¡ p) sO D . p
(3.2)
Since the stochastic evolution of every unit is affected by the same type of message perturbation from the ensemble, these quantities should be the same for all N units in the system. The approximation given by the formulas equation 3.2 are valid only if the spikes from the neurons are well separated from each other. Quantitatively, this means that the average time per spike should be larger than the typical width of the ring-period distribution, tO À s. O N In terms of g, N, and p, condition 3.3 reads N¿
g ¡1 , 1¡p
(3.3)
(3.4)
Period Focusing in Populations of Noisy Neurons
2501
Table 1: Results for Large g. L
N
g
htit ie,w 0
hsit ie,w 0
tO
sO
1000 500 1000 500
11 11 101 101
99.9 49.9 9.99 4.99
1099.9 (0.11) 544.3 (0.08) 1000.0 (0.03) 444.56 (0.02)
10.9 (0.08) 7.6 (0.07) 9.5 (0.03) 5.7 (0.02)
1099.99 544.33 999.89 444.33
11.05 7.77 10.54 7.02
Note: Averages are performed over 100 initial phases and time evolution allowed up to 1000L. p D 0.9 in all cases.
or, equivalently, g À (1 ¡ p) N C 1.
(3.5)
In Table 1 we compare the results of a simulation for different values of g with the theoretical estimates. The results in the fourth and fth column of Table 1 are the empirically measured mean ring time and dispersion for a unit. Between parentheses we display the standard deviation of these values when we average the individual ring times over the whole ensemble of units and the initial phases. It is worth remarking that these standard deviations are small, which conrms the hypothesis that a homogeneous state for the ensemble of units obtains. Columns 6 and 7 present the prediction given by equation 3.2. For the cases with N D 11 (where condition 3.5 translates into g À 10), both the mean and the standard deviation predicted are fairly close to the actual ones, which have been extracted from the simulations. For a number of oscillators N D 101, equation 3.5 would require g À 100 . This condition is violated for the cases with N D 101 displayed in Table 1, and the simple estimation formulas, 3.2, fail to produce a good-quality prediction for the standard deviation. In all cases, although the prediction for the mean ring time is fairly accurate, the standard error is systematically overestimated in the renormalized independent oscillators picture. This means that the interplay of the spontaneous evolution and the mechanism of message exchanges leads to a narrower distribution for the ring period than the simple argument leading to the variance estimate of equation 3.2 predicts. 4 The Limit of Intermediate g
The picture of renormalized independent oscillators can hold only in the limit of large g. The failure of this approximation as the relative importance of the interactions with other neurons is increased is made evident by the following observations. First, as g decreases, the percentage of messages that are effective in contributing to the evolution of other units decreases. Second, the interactions between the different units induce correlations in their dynamics, which can no longer be considered as independent. Third,
2502
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez 99 98
eff + -
Ve
Ve
eff
97
s
eff
Ve
96 95 94 93 92 91
1
2
h
5
6
Figure 1: Average of V eff 2 versus parameter g. The error bars, s
, show the
V eff 2
3
4
V eff 2
deviation of . The results are measured on an ensemble of 100 neurons with p D 0.9. Averages are performed over 50 initial phases and time evolution allowed up to 1000L.
the exchange of messages between the units has a regularizing effect that further reduces the dispersion of the ring period. The rst phenomenon is illustrated in Figure 1, which displays the number of messages in a system composed of 100 units that are effective in the evolution of a test unit as a function of g. The amount V eff 2 represents the effective number of messages that a test unit receives from population between two consecutive spikes. If all messages were effective, messages coming from spikes that occur in the instant preceding the ring of a test unit are effectively dissipated because they induce a transition that overshoots the threshold. At small values of this parameter, most of the oscillators reach their threshold by spontaneous evolution. The spikes of the units occur thus in isolation, and almost all the messages produced are effective in affecting the evolution of the remaining units under the threshold. As the strength of the interactions increases, some neurons start to re simultaneously, and it becomes a relatively frequent event that a given oscillator res only after receiving the messages coming from the spikes of other units. The appearance of correlations in the dynamics can also be observed in the simulations. Intuitively, correlations appear because the ring of a neuron produces messages that increase the state of the remaining neurons and therefore increases the likelihood of neurons’ ring immediately after. In order to characterize this effect quantitatively, we measure the correlations
Period Focusing in Populations of Noisy Neurons
s
q (t) q (t )
0.4
0.4
0.3 0.25
q (t)
q (t)
h =4.63
q (t )
0.35
0.35 0.3 0.25
0.2
0.15 10
20
30
40
50
60
70
0.1
80
Time
0.15
h =4.63
t =10
t
0.05
=45.63
0.02 0
-0.05 -10
-0.02 -10
10
20
30
t
40
50
60
70
80
100
120
Cq
t =10
0.06 0.04
60
Time
0.08
0 0
40
0.1
0.1 t i
20
0.12
Cq
0.2
0
C q (t;t )
0
0.25
C q (t;t )
s
0.45
s 0.45
0.2
q (t)
0.5
h =4.63
s
0.5
P=0.5
0.55 q (t )
0.55 q (t )
N=11
P=0.9
0.6
2503
t 10
t i
h =4.63 =81.21
30
50
t
70
90
110
Figure 2: Results of simulations for an ensemble of 11 units, where the parameter g has the value of 4.634. (Top plots) The number of messages received by a test unit at time t and its deviation. (Bottom plots) The correlation function Ch ( 10, t ) versus the delay t . We show results for two probabilities, p D 0.5, 0.9. Averages are performed over 50 initial phases and time evolution allowed up to 1000L.
in h ( t) , the number of messages received by a test unit at time t, which is placed in state l D 1 at t D 0. The correlation function of this quantity is ± ²± ² Ch ( t, t ) D h ( t C t ) ¡ h ( t C t ) h ( t) ¡ h ( t ) ,
(4.1)
where the bar indicates an ensemble average. These correlations for t D 10 are presented in Figure 2. The structure of the correlations exhibits two salient features: The positive component at short delays reects the fact that spikes in a neuron or group of neurons favor the occurrence of other spikes, owing to the production of messages that increase the state of the other neurons. Similarly, a uctuation where the number of spikes is lower than the average implies that fewer messages are produced, thus decreasing the likelihood of further spikes in the remaining neurons. There is an echo of
2504
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
these correlations centered at a delay equal to the average period of the units, tO . The peak is wider due to the uctuations arising from the spontaneous evolution of the units in that interval. In particular, the echo is less dened for p D 0.5, because the magnitude of the uctuations is larger. There are also long-range negative correlations. These are a consequence of the approximate conservation of the number of neurons that re between two consecutive spikes of a given unit, tO X tD0
h ( t) ¼
tO X
h ( t) ,
(4.2)
tD 0
which implies tO X t D0
Ch ( t, t ) ¼ 0.
(4.3)
Since the correlations during an average ring period should average to approximately zero, there must be a negative contribution at longer delays to compensate the positive contribution of the peak of the correlations at short delays. Given that the direct inuence of a given unit ring is short-ranged in time, the negative correlations are constant. The observation of these constant negative long-range correlations, which appear as a consequence of global conservation laws, has been previously reported in the literature (Bussemaker, Ernst, & Dufty, 1995; SuÂarez, Boon, & Grosls, 1996). This phenomenon can also be observed in the analysis of the accumulated number of messages received by the test unit in the interval [0, T]: H ( t) D
t X
h ( t0 ) .
(4.4)
t0 D 0
Figure 3 exhibits the roughly linear increase of H ( t) with the amount of time elapsed. The slope of this straight line can be easily deduced from the fact that the accumulated number of messages received by the test unit during the mean ring period, t , should be H ( t ) ¼ V eff 2 ,
(4.5)
where V eff N ¡ 1, the difference being attributable to the dissipated messages that only contribute to overshooting over the threshold of the unit. The average linear increase observed for the accumulated number of messages received by the test unit shows how far from synchronization the neurons in the ensemble are in the large g regime. Moreover, a simple description of the ring regime reached by the system is that neurons re independently, covering uniformly the time frame between two consecutive spikes.
Period Focusing in Populations of Noisy Neurons
N=11
P=0.9
12
Q (t) -+ s
10
P=0.5
14
Q (t) -+ s
12
q (t )
10
8
q (t )
Q (t)
Q (t)
8
6
6
4
4
h =4.63
2 0
2505
5
10 15 20 25 30 35 40 45 50
Time
h =4.63
2 0
10
20
30
40
50 60
70
80
90
Time
Figure 3: Results of simulations for an ensemble of 11 units, where the parameter g has the value of 4.634. The plots show the accumulated number of messages received by a test unit in an interval covering one ring period. Averages are performed over 50 initial phases and time evolution allowed up to 1000L. We show results for two probabilities, p D 0.5, 0.9.
On average, one neuron from the ensemble res every (g ¡ 1) / p time units. This result will be understood more precisely with the mean-eld approach presented in the next section. The characteristic form of the dispersion (with a narrowing around the ring period) is due to the presence of short–time positive correlations and long-time negative correlations. We emphasize that the focus of this article is on a regime (intermediate and large values of g) where these correlations need not be explicitly considered in estimating the mean and standard deviation of ring periods. For small values of g, correlations are of crucial importance to understand the temporal ring patterns that appear in the system (van Vreeswijk & Abbott, 1993; RodrÂõ guez & L Âopez, 1999). An analysis of the strong interaction regime is outside the scope of this work. Another important phenomenon observed in the simulations is the interplay between the intrinsic uctuations of the oscillations and the dynamic exchange of messages between the units, which leads to a reduction of the dispersion of the ring period. Qualitatively this effect can be understood as follows: Assume that, owing to a uctuation in the spontaneous evolution, one of the units in the network takes longer to re than average. During the longer spontaneous evolution, this unit will also tend to receive a larger number of messages. Thus, the contribution from the messages received from other units in the ensemble tends to reduce the duration of the interspike interval. In the opposite case, when the spontaneous evolution is faster than the average, which would lead to a shorter ring period, the unit
2506
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
30
Excitation State
25
20
15
10
5
0 0
5
10
15
20
Time
25
30
35
40
45
Figure 4: Comparison between the distributions for the time when the threshold is reached in the model with a renormalized threshold (continuous curve) and the mean-eld model (histogram), for L D 31, N D 16, p D 0.5, g D 2. The horizontal line corresponds to the renormalized threshold (see eq. 3.1), and the staircase line corresponds to the effective threshold (see equation 5.2) in the mean-eld model. A number of typical trajectories of the neuron in the space of activation states are drawn. The mean-eld distribution has been obtained in a simulation, where a gaussian smoothing term that reects the effect of uctuations of the effective threshold has been used.
receives on average fewer messages from the other neurons, which results in a delay in the production of a spike. The net result is a narrowing of the distribution of ring times. This effect can be studied in a a mean-eld model of the neural ensemble, which we describe in detail in the following section. 5 Mean-Field Theory of the Oscillator Ensemble
The narrowing of the ring period distribution caused by the interplay between uctuations in the evolution of the oscillators and the accumulated effect of messages received by a given neuron can be studied in greater detail within the framework of mean-eld theory (see Figure 4). Consider a test neuron that is singled out from the ensemble. The spontaneous evolution
Period Focusing in Populations of Noisy Neurons
2507
of this test unit is governed by the dynamics specied in equation 2.1. The effect of the messages from the remaining oscillators can be accounted for by a progressive lowering of the threshold. Assuming that we measure time from the instant that the test neuron has reached the threshold and is about to re (i.e., atest ( t D 1) D 1), the mathematical expression for the effective threshold for t > 0 is LQ ( t) D L ¡ H ( t ) ,
(5.1)
where H ( t ) (see equation 4.4) is the accumulated number of messages received by the test unit in the interval [0, t]. H( t) is a uctuating quantity, and, in principle, it depends on the actual state of the test neuron. The results of the simulations presented in Figure 3 indicate that for large and intermediate values of g, the uctuations of H ( t ) are actually small and can in a rst approximation be neglected. Furthermore, the increase with time of the cumulated number of messages received by the test neuron is approximately linear, which indicates that the ring probability density of the remaining neurons in the network is roughly constant and independent of the state of the test oscillators (i.e., correlations can be neglected). This is the case because for g > > 1, the feedback effect from the network is not sufcient to induce synchronization among the units. These then behave approximately as if they were independent integrate-and-re oscillators with a threshold that is decreasing in time with a constant rate a. As g approaches one, the feedback effect of the network becomes stronger, and correlations between the neurons are no longer negligible. In fact at g D 1, the oscillators re periodically within synchronization clusters (van Vreeswijk & Abbott, 1993; RodrÂõ guez & L Âopez, 1999). The theoretical analysis of such regime, where the simplifying assumptions we make in the formulation of this mean-eld theory are no longer valid, is more involved and outside the scope of this article. In summary, in the regime of intermediate and large g, we can assume that in equation 5.1, the quantity H( t) can be replaced by its average H ( t), which increases approximately in a linear fashion with time, H ( t) ¼ at, LQ ( t) ¼ L ¡ H ( t ) ¼ L ¡ at.
(5.2)
The speed of the lowering of the threshold a is determined in a self-consistent manner by requiring that, on average, during the period between two consecutive spikes of the test neuron, the remaining neurons have red. The distribution of ring periods predicted by this model is given by the expression Pmf (t ) D
´ ³ ´ Lu ³ X t ¡ 2 l¡1 t ¡1¡l t ¡ 2 Ll ¡1 t ¡Ll C . p q p q l ¡1 Ll ¡ 2 l D Ll
(5.3)
2508
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
The quantities Ll and Lu are the lower and upper limits of the states that are swept through by the lowering of the threshold at each time step, respectively, Ll D dLQ (t ) e, ( b( LQ (t ¡ 1) ) c Lu D b( LQ (t ¡ 1) ) c ¡ 1
if if
LQ ( t ¡ 1) is not integer, LQ ( t ¡ 1) is integer.
(5.4)
The rst term in equation 5.3 represents the contribution from the states that are swept through by the lowering of the threshold. The second term corresponds to the instances in which the oscillator reaches the threshold by spontaneous evolution. From the distribution of the ring period given by equation 5.3, we can estimate the expected period and its standard deviation: X ht imf D t Pmf ( t ) t
smf D
X t
(t ¡ ht imf ) 2 Pmf ( t ).
(5.5)
Analytical expressions for these quantities can be obtained if some simplifying assumptions are made. The details of the derivation of the approximate expressions for the mean and the standard deviation of the ring period are give in appendix A. The resulting expressions are tmf D tO smf D
p
pCa
sO D
g¡1 s, O g
(5.6)
where tO and sO are given by equation 3.2, where the mean speed for the lowering of the threshold is aD
V2 p . D tmf ¡ 1 g ¡1
(5.7)
It is clear from equation 5.6 that whereas the ring period is the same as in the model with a renormalized threshold, the width of the distribution is reduced by the factor g¡1 < 1. g
(5.8)
The results can be given a graphical interpretation (see Figure 6), whose justication can be found in appendix B. A summary of results is presented in Table 2.
Period Focusing in Populations of Noisy Neurons
2509
Table 2: Comparison Between Mean-Field Theory and Simulation (Exact) Results. L
N
g
htit ie, w 0
hsit ie, w 0
ht imf
smf
1000 500 1000 500
11 11 101 101
99.9 49.9 9.99 4.99
1099.9 (0.11) 544.3 (0.08) 1000.0 (0.03) 444.56 (0.02)
10.9 (0.08) 7.6 (0.07) 9.5 (0.03) 5.7 (0.02)
1099.89 544.33 999.92 444.42
10.94 7.54 9.48 5.61
Note: p D 0.9 in all cases.
We have observed in our study that the quality of the mean-eld approach varies with the stochastic character of units. The larger the randomness of the spontaneous advance in the unit activity, the better the results the mean-eld model gives. That is, the quality of the agreement between values measured in simulations and those given by the model is retained for smaller values of g. In Figure 5 we present results as a function of g for three different values of the randomness parameter p. The reason to choose the dispersion reduction magnitude of Figure 5 will be made clear in the next section. 6 Discussion
The dynamics of the population of neurons studied in this work give rise to a rather complex structure for the correlations. Nonetheless, in the regime where synchronization does not obtain, these dynamics can be described using a simple mean-eld model. In this approach, the effect of the ensemble of neurons on a probe unit is reduced to a “rain” of messages with constant speed. Despite its simplicity, this model accounts for the bulk part in the observed decrease of dispersion with respect to the trivial renormalized threshold picture. The results displayed in Table 2 give a rst idea of the excellent agreement between period dispersions calculated with the meaneld approach and the ones measured in numerical simulations. Down to g D 9.99, discrepancies between the results for the dispersion are within the variability of simulations that start from different initial conditions or focus on a different probe unit in the ensemble. A combined reading of the results in Tables 1 and 2 gives a quantitative measure of the impact in the dynamics of the simple mechanism included in the mean-eld approach at intermediate values of g. For instance, at g D 4.99, the 19% reduction of dispersion with respect to the renormalized threshold model observed in the simulation is correctly computed in the mean-eld approach, which predicts a 20% reduction (see Figure 5). It is useful to take the renormalized threshold model as a reference, since it accounts for trivial effects that arise from the interaction among units in the ensemble. The sudden lowering of the threshold included in this
2510
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
50
mean field p = 0.9 p = 0.5 p = 0.1
Percentage of dispersion reduction
45 40 35
30 25 20 15 10 5
2
4
6
8
10
h
12
14
16
18
20
Figure 5: Comparison between the observed percentage of dispersion reduction and the value given by the mean-eld approach as a function ofg. The percentage is calculated as 100( sO ¡ s) / s. O In the mean-eld model (solid line) this quantity reduces to (g ¡ 1) /g. Simulation results are for N D 51 and three different values of p (0.9 : I 0.5 : I 0.1 : C ).
model leads to a shorter ring period and therefore to a certain reduction of the dispersion. This reduction is explained by the decrease in the number of random steps the neuron takes, on average, before it produces a spike. What is left out of this picture is a dynamical effect that arises from the rhythm at which messages are received by units interacting with the rest of the ensemble. This dynamical effect leads to a sharper focusing of the ring period of the units. Every neuron in the ensemble res with a period that has a lower dispersion than what it would in isolation, even if, in both cases, the unit had the same intrinsic randomness (measured as the average number of random steps it has to perform to reach the ring threshold). Firing dispersion reductions of around 50% have been measured in the performed simulations for values g » 2 (see Figure 5). Up to that amount, the reduction is well explained by the mean-eld approach. For values of g smaller than 2, the ensemble dynamics becomes gradually different. When the parameter g is exactly equal to 1, the dynamics are entirely dominated by the interactions (van Vreeswijk & Abbott, 1993; RodrÂõ guez & Lopez, 1999). In summary, the results displayed in Figure 5 clearly show that the main part of the observed dispersion reduction can be explained by a simplied
Period Focusing in Populations of Noisy Neurons
2511
model where the threshold decreases at constant speed as a consequence of the messages received from the ensemble. Although more complex correlations are present in the original model, as discussed in section 5, they have a minor effect for large and intermediate values of g. The mean-eld model is realistic, yet simple enough to give a clear picture of the phenomenon at hand. First, it allows for a simple geometrical construct explaining the origin of dispersion reduction and therefore of period focusing. Second, it provides a back-of-the-envelope formula to calculate the amount of dispersion reduction. This formula can be derived using simple geometrical considerations (see appendix B) and in its more general form, equation B.11, may be useful to analyze results in several situations. In particular, consider a network in which each neuron receives excitatory signals from various random sources and messages from the other units in the network. As a consequence of the message exchange mechanism, the period of each neuron becomes more sharply focused than in isolation, even with a properly renormalized ring threshold. If a denotes the speed at which the neuron is receiving the messages and p the mean speed at which p the unit travels toward the threshold with a dispersion proportional to 1 ¡ p / p, then the dispersion reduction factor can be calculated directly as the ratio p / ( p C a), as given by equation 5.6. Appendix A: Mean-Field Formulas
Let us study in detail the mean-eld model in which the effect of the ensemble of ring neurons on a test neuron is represented by a smoothly decreasing effective threshold, LQ ( t) D L ¡ at,
(A.1)
where a is to be determined in a self-consistent fashion. Explicit formulas for the continuous version of this problem (without the self-consistent condition for a) can be found in Buonocore, Nobile, and Ricciardi (1987). The state of the test neuron at time t is a( t ) D 1 C
t X
jt .
(A.2)
t D1
The variables fjt gtt D 1 are independent and identically distributed random variables, corresponding to dichotomic noise » 0 with probability ( 1 ¡ p ) jt D (A.3) 1 with probability p with the properties hjt i D p
2512
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
hjt2 i D p jOt D jt ¡ hjt i D jt ¡ p O O hjt jt 0 i D p( 1 ¡ p ) dt t 0 .
(A.4)
The instantaneous distance between the threshold and the state of the neuron is d ( t) D LQ ( t) ¡ a ( t) D L ¡ at ¡ 1 ¡
T X
jt .
(A.5)
t D1
The time at which the neuron reaches the threshold is T: 0 D L ¡ aT ¡ 1 ¡
T X
jt .
(A.6)
t D1
Taking the ensemble average of this equation, we obtain hTi D
L ¡1 . aCp
(A.7)
The speed of the lowering of the threshold, a, is such that during the interval hTi, the test neuron has received messages with a total strength V2 : ahTi D V2 .
(A.8)
Therefore, the average interspike interval in this model is tmf ¼ 1 C hTi D 1 C
L ¡ V2 ¡ 1 , p
(A.9)
where the extra one corresponds to the deterministic time step in which the neuron is discharged. The value from the dispersion of the period can be found by rewriting equation A.6 using equation A.4 and the denition TO D T ¡ hTi: ( a C p ) TO D ¡
T X t D1
jOt .
(A.10)
Taking the square of both sides of the equation, using the properties of the dichotomic noise, equation A.4, and the denition of g, equation 2.7, we obtain, after some manipulation, ³ 2 smf
¼ hTO 2 i D
p aCp
´2
³ 2
sO D
where sO is given by equation 3.2.
g¡1 g
´2 sO 2 ,
(A.11)
Period Focusing in Populations of Noisy Neurons
2513 L a t1 L a t 1
30
mf
pt 1/2 pt±c t
25
Activation state
20
C B
15
A
E D
10
5
0
t 5 0
5
10
15
20
1
Time
t ´ 1 25
t
mf
30
t ´ 2 35
t 40
2
45
Figure 6: Graphical interpretation of the mean-eld model, with L D 31, N D 16, p D 0.5. The continuous distribution curve represents the probability density for the time at which the neuron reaches the renormalized threshold. The histogram corresponds to simulations with the mean-eld model, including a gaussian smoothing term that reects the effect of uctuations of the effective threshold.
Appendix B: Graphical Interpretation of the Mean-Field Theory
The mean-eld model presented in section 5 can be derived from a simple graphical construct. The effective threshold is assumed to decrease linearly, according to equation A.1. The random walk of the neuron in the space of excitation states is replaced by a linear increase in the excitation state with gaussian uctuations around this average evolution. The variance of these uctuations increases linearly with time: ha(t) i ¼ pt sa2 ( t) ¼ Â 2 t.
(B.1) (B.2)
This replacement is valid asymptotically, when the binomial distribution describing the evolution of an ensemble of random walkers is well approximated by a normal distribution. In Figure 6, Â has been chosen so that at a given instant, 90% of the trajectories are within the region delimited by
2514
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
p the curves pt § Â t. The time when, on average, the neuron reaches the threshold is the abscissa of point A in Figure 6: tO D tmf D
L¡1 . aC p
(B.3)
From the geometrical construct in Figure 6, it is apparent that the ratio between the standard deviation estimated by the effective threshold model (section 3) and that predicted by the mean-eld model (section 5) is approximately given by smf sO
¼
t20 ¡ t10 . t2 ¡ t1
(B.4)
The values t1 , t2 can be obtained from the abscissas of points B and E, respectively, in Figure 6: p pt1 C Â t1 D L ¡ 1 ¡ atO p O pt2 ¡ Â t2 D L ¡ 1 ¡ at.
(B.5) (B.6)
From these equations, using equation B.3, we obtain p p p( t2 ¡ t1 ) D Â ( t1 C t2 ) .
(B.7)
Similarly, from the expressions for the abscissas of points C and D and equation B.3, ( p C a) ( t20 ¡ t10 ) D Â
³q t10 C
q ´ t20 .
Given that the relation q q p p t1 C t2 ¼ t10 C t20
(B.8)
(B.9)
obtains, at least in an approximate manner, we have p( t2 ¡ t1 ) ¼ ( p C a) ( t20 ¡ t10 ) ,
(B.10)
which leads to the desired relation smf ¼
p s. O aCp
(B.11)
Acknowledgments
We are grateful to acknowledge nancial support from CICyT, grant TIC980247-C02-02.
Period Focusing in Populations of Noisy Neurons
2515
References Bottani, S. (1996). Synchronization of integrate and re oscillators with global coupling. Phys. Rev. E, 54(3), 2334–2350. Buck, J. (1988). Synchronous rhythmic ashing of reies, II. Quart. Rev. Biol., 63, 265–289. Buonocore, A., Nobile, A. G., & Ricciardi, L. M. (1987). A new integral equation for the evaluation of rst-passage-time probability densities. Adv. Appl. Prob, 19, 784–800. Bussemaker, H. J., Ernst, M. H., & Dufty, J. W. (1995). Generalized Boltzmann equation for lattice gas automata. J. Stat. Phys., 78, 1521–1554. Ernst, U., Pawelzik, K., & Geisel, T. (1995). Synchronization induced by temporal delays in pulse-coupled oscillators. Phys. Rev. Lett., 74, 1570–1573. Feller, W. (1971). An introduction to probability theory and its applications (2nd ed.). New York: Wiley. Frank, T. D., Daffertshofer, A., Beek, P. J., & Haken, H. (1999). Impacts of noise on a eld theoretical model of the human brain. Physica D, 127, 233–249. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis-theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9(8), 1303–1350. Gammaitoni, L., Hanggi, P., Jung, P., & Marchesoni, F. (1998). Stochastic resonance. Rev. Mod. Phys., 70, 223–288. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of single neuron. Biophys. J., 4, 41–68. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Izhikevich, E. M. (1999). Weakly pulse-coupled oscillators, FM interactions, synchronization, and oscillatory associative memory. IEEE Trans. on Neural Networks, 10, 508–526. Kaneko, K. (1994). Relevance of clustering to biological networks. Physica D, 75, 55–73. Kuramoto, Y. (1991).Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Mathar, R., & Mattfeldt, J. (1996). Pulse-coupled decentral synchronization. SIAM J. Appl. Math, 56, 1094–1106. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50(6), 1645–1662. NÂeda, Z., Ravasz, E., Brechet, Y., Vicsek, T., & BarabÂasi, A.-L. (2000). Selforganizing processes: The sound of many hands clapping. Nature, 403, 849– 850. Pham, J., Pakdaman, K., & Vibert, J.-F. (1998). Noise-induced coherent oscillations in randomly connected neural networks. Phys. Rev. E, 58(3), 3610–3622.
2516
F. B. RodrÂõ guez, A. SuÂarez, and V. LÂopez
Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. RodrÂõ guez, F. B., & LÂopez, V. (1999). Periodic and synchronic ring in an ensemble of identical stochastic units: Structural stability. In J. Mira & J. V. SÂanchezAndr e s (Eds.), In Foundations and tools for neural modeling (pp. 367–376). Berlin: Springer-Verlag. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shimokawa, T., Rogel, A., Pakdaman, K., & Sato, S. (1999). Stochastic resonance and spike-timing precision in an ensemble of leaky integrate and re neuron models. Phys. Rev. E, 59(3), 3461–3470. SuÂarez, A., Boon, J. P., & Grosls, P. (1996). Long-range correlations in nonequilibrium systems: Lattice gas automaton approach. Phys. Rev. E, 54(2), 1208– 1224. van Vreeswijk, C. A. (1996). Partially synchronized states in networks of pulsecoupled neurons. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C. A., & Abbott, L. F. (1993). Self–sustained ring in populations of integrate-and-re neurons. SIAM J. Appl. Math., 53, 253–264. von der Malsburg, C. (1994). The correlation theory of brain function. In E. Domnay, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks II. Berlin: Springer-Verlag. Received May 31, 2000; accepted January 25, 2001.
LETTER
Communicated by Michael Lewicki
A Variational Method for Learning Sparse and Overcomplete Representations Mark Girolami Laboratory of Computing and Information Science, Helsinki University of Technology, Finland An expectation-maximization algorithm for learning sparse and overcomplete data representations is presented. The proposed algorithm exploits a variational approximation to a range of heavy-tailed distributions whose limit is the Laplacian. A rigorous lower bound on the sparse prior distribution is derived, which enables the analytic marginalization of a lower bound on the data likelihood. This lower bound enables the development of an expectation-maximization algorithm for learning the overcomplete basis vectors and inferring the most probable basis coefcients. 1 Introduction
Overcomplete representations have been studied within the eld of signal and image processing as a means of providing a compact form of signal representation. The advantages of such a representation are greater exibility in capturing the inherent structure in the signals, robustness to additive noise (Chen, Donoho, & Saunders, 1996), providing a representation that displays invariance to translations of the input signal (Simoncelli, Freeman, Adelson, & Heeger, 1992), and the possibility of high-precision coding using low-precision coefcients (Lee, 1996). This compactness suggests a sparse representation as there may be a large set of specialized basis functions available with which to compose the signal or image, but few of these will be required for any one composition. The proposal of viewing the decomposition of data into an overcomplete basis as a probabilistic model was originally presented in Lewicki and Sejnowski (2000) and Lewicki and Olshausen (1999). 1 There it was claimed that the chief advantage of a probabilistic representation is that it allows straightforward and natural extensions to more general model representations, in addition to adapting the bases to the underlying statistics of the data. In Lewicki and Sejnowski (2000), a sparse prior probability on the basis coefcients, in the form of a Laplacian, is employed to remove the redun1 Despite the publication dates, the methods developed in Lewicki and Sejnowski (2000) in fact predate Lewicki and Olshausen (1999), and the methods of the former were employed in the latter.
c 2001 Massachusetts Institute of Technology Neural Computation 13, 2517– 2532 (2001) °
2518
M. Girolami
dancy inherent in an overcomplete representation. Indeed, the imposition of the Laplacian prior is implicit in using the L1 norm of the basis vectors as a constrained objective function in Chen et al. (1996). The key assumption made by Lewicki and Sejnowski (2000) in developing a gradient-based method for learning is that the posterior density is approximately gaussian about the maximum a posteriori (MAP) coefcient value. Therefore, a Laplace approximation to the data likelihood about the MAP is made, a low level of isotropic noise is assumed, and the gradient of this approximate likelihood is employed in adapting the basis vectors. The algorithm developed in Lewicki and Sejnowski (2000) is a generalization of the natural gradient form of independent component analysis (ICA) (Attias, 1999; Amari, Cichocki, & Yang, 1996; Bell & Sejnowski, 1995), and standard noise-free ICA is recovered when the number of coefcients2 equals the observation dimensionality. Learning overcomplete representations has also been shown to be one method for performing ICA and blind signal separation (BSS) where the number of sources is greater than the number of mixed observations (Lee, Lewicki, Girolami, & Sejnowski, 1999). In the gradient-based method of Lewicki and Sejnowski (2000), the estimation of the basis vectors requires a nonlinear optimization of the approximated posterior density for each data point to obtain the MAP value of the corresponding basis coefcient, which is then used for each gradient update of the basis vectors. Due to the Laplace approximation that is employed, there is no guarantee that the data likelihood increases at each iteration. Whereas the Laplace approximation for the data likelihood is sufcient in generating gradients for learning, the actual value of the likelihood may be inaccurate, giving unreliable estimates of measures such as coding cost (Lewicki and Sejnowski, 2000). The gradient-based algorithm does not consider the estimation of the noise covariance due to the assumption of a low level of isotropic noise, which is manually set during learning. In contrast to gradient-based algorithms, this article presents a method for estimating overcomplete bases and inferring the most probable coefcients based on the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). The intractable nature of the posterior integral is transformed into a gaussian form by representing the prior density as a variational optimization problem. This provides a strict lower bound on the data likelihood, which is optimized within the EM updating procedure. The necessity for a separate nonlinear optimization to obtain the MAP coefcient estimates for each data point is obviated as the MAP value coincides with the posterior mode, and this is available from the moments of the posterior lower bound. Due to the well-known convergence properties of the EM algorithm and the rigorous lower bound on the true posterior given by the variational approximation, there is a guaranteed increase in the data like-
2
These are referred to as sources in ICA parlance.
Learning Sparse and Overcomplete Representations
2519
lihood that approaches the true likelihood at each EM step. The proposed EM algorithm generalizes the complete case or regular ICA with additive noise; however, no requirement for the assumption of a low level of noise is required. The following section presents the linear model for learning overcomplete representations. Section 3 presents the variational EM algorithm for learning overcomplete representations, with some simulations presented in section 4. The nal section provides some concluding discussion on the proposed method. 2 Linear Model for an Overcomplete Basis
The observed data x 2 < M can be described using an overcomplete basis by the following linear model: x D As C ².
(2.1)
The columns of the matrix A 2 <M£L where L ¸ M dene the overcomplete basis, and the noise is modeled as gaussian such that ² » N ( 0, § ) and so p ( x | s ) D N x ( As , § ). The basis coefcients are assumed independent Q such that p( s ) D LlD 1 p ( sl ) . In addition, each prior p( sl ) assumes sparseness of the latent coefcients typied by the Laplacian distribution (Lewicki & Sejnowski, 2000). In formulating a method for learning the parameters of the model 2.1— that is, the basis coefcients A and the noise covariance § —it is required that the joint density p ( x , s | A , § ) is marginalized over s . In that case, the likelihood of an observation x is given by Z p( x | A , § ) D
p ( x | s, A , § ) p ( s ) ds.
(2.2)
This integral in general is analytically intractable due to the nongaussian form of the prior. An approximation can be made based on a second-order expansion about the MAP coefcient estimate sO (Lewicki & Sejnowski, 2000; Bermond & Cardoso, 1999), that is, the Laplace estimate. Because the Laplacian prior is unimodal, a gaussian approximation with mode sO is found to be reasonable in practice for generating gradients for optimization purposes, although the accuracy of the approximation may be inadequate in certain circumstances (Lewicki & Sejnowski, 2000). Based on the Laplace approximation, the approximate conditional moments for the gaussian posterior density estimate are given as follows: EQ L fs | xg D sO D argmax logfpL ( s | x )g
(2.3)
s
EQ L fssT | xg D sO sO T C H ( sˆ ) ¡1 .
(2.4)
2520
M. Girolami
The subscript L denotes the moments associated with the Laplace approximation, and the Hessian of the approximate log posterior computed at the MAP value is denoted as H ( sˆ ) D A T § ¡1 A ¡ rr logfp ( sO ) g. Numerical optimization is required to estimate sO , and there is no guaranteed monotonic increase in the data likelihood based on the estimation of the model parameters using these approximate moments. In addition, there is no natural way of assessing the goodness of the approximation and the accuracy of the likelihood estimate. By assuming a low level of isotropic noise, which is considered predetermined, a natural-gradient-based method for adapting the basis vectors is developed in Lewicki and Sejnowski (2000). However, for this form of linear model, it is desirous to use the EM framework for estimation and inference, as this is a most natural method for maximizing the data likelihood. The approach advocated in this article is that a variational approximation (Rockafellar, 1970) to the sparse prior, which is a quadratic function of the latent variables s, be employed. An extensive treatment of such methods for machine learning applications can be found in Jaakkola (1997). This approximation will provide a strict lower bound on the posterior probability, which can be maximized using the additional variational parameters introduced into the model. This form of variational approximation was originally proposed and exploited in Bayesian regression and learning in belief networks by Jaakkola and Jordan (1997). The following section presents the EM algorithm for learning sparse and overcomplete representations. 3 A Variational Parameter Estimation Method
In learning an overcomplete basis from a linear model, a sparse prior is placed on the basis coefcients, reecting the idea that only a small number of the available coefcients will be required to represent the data. In particular, the prior of interest is the factorable Laplacian: L
p( s ) D ( 2) ¡ 2
L Y lD 1
p exp( ¡ 2|sl |) .
(3.1)
This prior renders the marginalization of p( x , s | A , § ) analytically intractable, and various approximations for the integral are required (Lewicki & Sejnowski, 2000). An alternative and more general approach is to employ a mixture distribution in modeling the prior (see Attias, 1999, and references therein). Whereas representing the prior as a mixture distribution is powerful in terms of the modeling capability, note that for K mixture components and a D-dimensional prior, such methods scale as O ( KD ) (Attias, 1999). In learning a sparse and overcomplete data representation, such generality is not required. Approximation techniques based on the variational method (Rockafellar, 1970) have found extensive application in domains such as physics,
Learning Sparse and Overcomplete Representations
2521
mechanics, statistics, and, somewhat more recently, machine learning. The variational method is based on the ability to represent a convex function in terms of an optimization of the function’s dual form over a set of parameters, which are referred to as the variational parameters. The idea is that the dual-form representation of the function will be more suitable for the particular application than the original form. By dropping the optimization, a bound on the original function is provided. The bound will be either a lower or an upper bound depending on whether the original function is convex or concave. (For a detailed treatment of such methods, see Rockafellar, 1970, and for specic application to the machine learning domain, see Jaakkola, 1997.) The sparse Laplacian prior can be represented as an optimization over the additional variational parameter » D [j1 , . . . , jL ]T (see appendix A) such that L Y lD 1
( exp( ¡|sl | ) D sup
»
L Y
) s ( 0,
Q (jl ) N
¤ ) .
(3.2)
l D1
The notation N s ( 0, ¤ ) denotes a normal distribution computed at s with zero mean and covariance given as ¤ D diag[|j1 |, . . . , |jL |]. Due to the convexity of the log prior (see appendix A) this provides a strict lower bound on the prior density and hence on the marginal likelihood. Figure 1 demonstrates the lower-bound approximation for various values of j in the univariate case. We note that for any value of the variational parameter » , Z p ( x | A , § ) ¸ p( x | A , § , » ) D D D
p ( x | s , A , § ) p ( s | » ) ds L Y
Z
Q (jl )
N
x ( As,
§ )N
s ( 0,
¤ ) ds
l D1
L Y
Q (jl ) N
x ( 0,
l D1
A¤ AT C § ) .
(3.3)
Rather than using this likelihood to compute gradients with respect to the parameters A , § , and » , an EM method is now developed. As the variational approximation to the prior provides a strict lower bound, the corresponding posterior can be given as p( x n | s , A , § ) p ( s | » ) p ( x n | s, A , § ) p ( s | » ) ds
p( s | x, A , § ) ¸ p( s | x, A , § , » ) D R D R
N N
x ( As,
§ )N ( As , § )N x
s ( 0,
¤ ) . ( 0 s , ¤ ) ds
(3.4)
2522
M. Girolami
Figure 1: Lower-bound approximation exp( ¡|s|) ¸ Q (j ) Ns ( 0, |j |) for various values of j . The parameter s runs along the horizontal axis, with the vertical axis giving the value of both exp( ¡|s|) and Q (j ) Ns ( 0, |j |) . The plot of exp( ¡|s|) is denoted by a dot-dash line, and Q (j ) Ns ( 0, |j |) is plotted with the solid line. (a) Highlights the lower bound for |j | D 0.1. (b) Shows |j | D 1.0, (c) |j | D 2.0, and (d) |j | D 3.0. Equality holds only for the points where |s| D |j |.
The lower bound, equation 3.4, on the posterior p( s | x , A , § , » ) somewhat conveniently admits a normal form with the following moments: Efs | x n g D ( A T §
Efss | x n g D ( A § T
T
¡1 ¡1
AC¤ AC¤
¡1 ¡1 T ¡1 xn n ) A § ¡1 ¡1 C Efs | xn gEfs n )
(3.5) | xn g . T
(3.6)
Comparing these moments with equations 2.3 and 2.4, the similarity between them should be noted; however, the posterior mean (MAP in this case, as the posterior will be unimodal) estimate now comes naturally from the lower bound on the posterior and does not require a separate nonlinear
Learning Sparse and Overcomplete Representations
2523
optimization to establish this value. Now possessing expressions for the posterior moments, it is straightforward to formulate an EM algorithm in closed form for parameter estimation. For each data sample, there is a corresponding L-dimensional variational parameter associated with it. For N observations fx n gnD 1,...,N , a relative likelihood can be formed and solved for the parameters A , § , f» n gn D1,...,N . The standard forms of M-step emerge with the additional expression for the variational parameters (see appendix B): ( A
new
D
N X nD 1
( 1 D N
§
N X
)( x n Efs | x n g
T
N X nD 1
T
Efss | x n g
)¡1 (3.7)
) xn xTn
nD1
¡A
new
Efs |
x n gxTn
(3.8)
h i »n ± »n D diag EfssT | x n g .
(3.9)
The ± in equation 3.9 denotes elementwise multiplication. The variational M-step (see equation 3.9) shows that the variational parameters are updated at each M-step by the posterior standard deviation for each marginal term q such that jln D sp( sl | x n ) D Efs2l | xn g. Inserting the terms for the posterior moments into equations 3.7 through 3.9 and noting that ( A T § ¡1 A C ¤ n¡1 ) ¡1 D ¤ n ¡ ¤ n A T ( A ¤ n A T C § ) ¡1 A ¤ n gives the following updates: " A
new
D
N X
#" N X
¤
n
nD 1
new
§
x n x Tn M Tn A ¤
D§
nD 1
x
¡ N ¡1 A new
»n ± »n D diag[¤
n fI
n fI
¡ A Mn( I
nA
T
N X
¤
T
¡ x n x Tn M Tn ) A ¤
M n xn xTn
ng
#¡1 (3.10)
(3.11)
nD 1
¡ A T M n ( I ¡ xn xTn M Tn ) A ¤
n g],
(3.12)
where M n D ( A ¤ n A T C § ) ¡1 and the observed data sample covariance P T matrix is § x D N ¡1 N n D 1 x n x n . Equation 3.12 is used to improve the lowerbound approximation to the likelihood, and the convergence properties of EM (Dempster et al., 1977) ensure that p ( x | A , § ) ¸ p( x | A , § , » new ) ¸ p ( x | A , § , » old ) ; thus, equation 3.12 provides a means of controlling and improving the estimate of the data likelihood due to the prior density approximation.
2524
M. Girolami
By taking the Laplace estimate for the posterior as in Lewicki and Sejnowski (2000), the MAP value for the coefcients must be estimated using a nonlinear optimization routine for each data point at each iterative update for the matrix parameters A . This can initially be wasteful of computational resources as precise MAP estimates are made using poorly estimated parameter matrices. In contrast, the EM algorithm derived here (equations 3.10–3.12) computes the MAP solution for the N observations while simultaneously estimating the new model parameters A , § , » . In addition, a strict lower bound on the likelihood is given that can be optimized using the variational parameters and the associated M-step (see equation 3.12) introduced into the model as noted above. The variational method replaces the nonlinear optimization required to obtain the MAP coefcient estimates with a single matrix inversion for each data point whose complexity scales as O ( M3 ) . If we consider optimization using, for example, conjugate gradients, then the complexity scales as O ( M2 P ) , where P is the number of iterations required to reach an acceptable level of convergence. Due to the nonlinear nature of the optimization, P À M, indicating a potential computational saving with the proposed method for low-dimensional observations. For situations where M is large (natural image patches), the matrix inversion for each sample may add computational cost. However there are no step sizes or annealing rates that require setting and adjusting manually, and the data likelihood is guaranteed to increase monotonically at each iteration until a local maximum is reached. The additional computational cost comes from the M-step estimation of the variational parameters (see equation 3.12). However, the ability to improve the lower-bound estimate of the data likelihood comes with this additional computational cost. 4 Simulation
To demonstrate the effectiveness of the proposed lower-bound approximation to the Laplacian prior, two-dimensional data were generated from three independent univariate Laplacians with no additive noise term. The snapshot development of one of the dimensions of the lower-bound prior density during the EM parameter updating procedure is given in Figure 2. It is clear that the parameterized form of the prior approaches the Laplacian limit during the estimation process. The second experiment shows the ability of the proposed variational EM method to learn overcomplete representations by employing a particularly challenging simulation. The blind separation of more sources than observations was considered as a potential application of learning overcomplete representations in Lee et al. (1999). For comparative purposes and to demonstrate visually the method, the same three sources of natural speech and mixing matrix as used in Lee et al. (1999) were used in this experiment. The observed data are an overcomplete mixture of sources with the speech sources being chosen as they are particularly sparse naturally occurring signals.
Learning Sparse and Overcomplete Representations
2525
(a)
(b)
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 4
2
0
2
4
0 4
2
(c) 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 2
0
2
4
2
4
(d)
1
0 4
0
2
4
0 4
2
0
Figure 2: The prior density approximation Q (j ) N s ( 0, |j |) compared to the Laplacian prior exp( ¡|s|) at each EM step. (a) Highlights the lower bound compared to the limiting Laplacian at the initial EM step where the variational parameters have been randomly initialized (by drawing them from a zero-mean unit variance gaussian distribution) (b) after 5 EM steps, (c) 10 EM steps, and (d) 20 EM steps. It is clear that the parameterized form of the prior approaches the Laplacian limit during the estimation process.
The proposed EM procedure was randomly initialized and converged in 10 parameter updates with the variational parameters being updated twice in each pass. As expected, the data likelihood monotonically increased during the EM procedure. The signal-to-noise ratio (SNR) was computed for each of the inferred sources and compared with the results reported in Lee et al. (1999), where the gradient-based method of Lewicki and Sejnowski (2000) was used. The results are given in Table 1. The SNR values are on average improved over those reported in Lee et al. (1999); however, this simulation using complex natural signals that are sparsely distributed highlights that the Laplace approximation for such data is also reasonably accurate for this application. Somewhat interestingly, an EM-based estimation of the model parameters and basis coefcients for the overcomplete mixture of speech signals using Monte-Carlo integration techniques provided no measurable
2526
M. Girolami
Table 1: SNR Values in dBs: Gradient-Based Method and Variational EM Algorithm.
Speech 1 Speech 2 Speech 3
Gradient
Variational
20 17 21
26 18 24
improvement in SNR values over both analytic methods reported. This further conrms the reasonable quality of the Laplace estimate and the lower bound presented here. Figure 3 shows 10,000 samples of the original sources, the noisy mixtures, and the posterior mean estimates given by equation 3.5. This simulation serves to demonstrate the ability of the method to learn overcomplete representations of natural data that conform to the standard linear model and infer the most probable coefcients (or sources).
(a)
(b)
(d)
(f)
(c)
(e)
(g)
(h)
Figure 3: (a–c) The three original sources. (d, e) The noisy observations. (f–h) The posterior mean estimates of the sources after the basis vectors have been learned using the EM method outlined.
Learning Sparse and Overcomplete Representations
2527
5 Discussion 5.1 Relation to Similar Work. Learning an overcomplete basis has been considered in a number of other related works, which will now be briey discussed. Perhaps the most general framework for learning linear models as described by equation 2.1 is the independent factor analysis (IFA) method developed in Attias (1999). The problematic marginalization of the latent factors to obtain the data likelihood is performed analytically by modeling the prior density as a mixture of independent univariate gaussians. Despite the powerful representational capability of the IFA method, the number of mixture components requires predetermining, and the computational complexity has cubic scaling, as already noted. The IFA model uses the EM algorithm for parameter estimation, and so the standard updates (see equations 3.7 and 3.8) with additional xed-point estimates for the mixture component parameters (i.e., the means, variances and mixing weights) emerge. For the specic case where we are trying to model a zero-mean Laplacian prior, each of the gaussian mixture components will also have zero mean, and the posterior moments that emerge for IFA in this case are:
Efs | x n g D
X
p ( q | xn ) ( A T §
q
EfssT | x n g D
X
¡1
A C V q¡1 ) ¡1 A T §
¡1
(5.1)
xn
p( q | xn )
q
n £ ( AT§
¡1
o
A C V q¡1 ) ¡1 C Efs | x n gEfs | x n gT .
(5.2)
The diagonal matrix V q contains the variances of the qth gaussian component for each dimension of the latent variables. The posterior probability of each of the mixture components or source states as termed in Attias (1999) is denoted by p ( q | x n ) . Whereas with the IFA model the posterior moments are a convex sum of elements related to each mixture component dening the prior density model, the variational method yields a much simpler form, which is a guaranteed lower bound to the sparse Laplacian prior. The similarity between the associated posterior moments of both methods is interesting to note. Another interesting similarity is that in the proposed variational method, 2 the additional parameters are updated according to jln D Efs2l | xn g, and in IFA for a mixture of zero-mean gaussians, the variance of each gaussian P ( ) 2 component is updated as vl,ql D a¡1 N n D 1 p ql | xn Efsl | x n , ql g, where the PN normalizing coefcient is a D nD 1 p( ql | xn ). Both depend on the posterior variance of the source estimates. Interestingly, in addition to using gaussian mixtures, the modeling of complex priors may also be accomplished by using a mixture based on a number of ICA models as detailed in Lee, Lewicki, and Sejnowski (2000). It will be of value to investigate the ICA
2528
M. Girolami
mixture modeling approach to learning overcomplete representations for natural data. The learning of an overcompletebasis (also termed dictionary) and the associated coefcients for observed data has been studied by Kreutz-Delgado and Rao (2000). In that work, a general approximate maximum-likelihood method is developed for sparseness inducing priors that are concave–shurconcave. As with the other approaches discussed in this article, the nonanalytic nature of the integration required to marginalize the coefcients is the starting point of the method developed. In Kreutz-Delgado and Rao (2000), the prior is approximated as a delta function about the approximate MAP estimate of the coefcients, and a general iterative update for the dictionary A and the associated coefcients is given. For the specic case of the Laplacian prior, the estimated MAP coefcient values yields O s | xng D ¦ Ef
nA
T(
A¦
nA
T
C s 2 I ) ¡1 x n .
(5.3)
O 1 | xn g|, . . . , | Efs O L | x n g|] appears Because the diagonal matrix ¦ n D diag[| Efs on the right-hand side of equation 5.3, the iterative updating of equation 5.3 O s | x n g. An isotropic noise model with a preset variis required to obtain Ef ance s 2 is utilized, and this is not estimated during the learning process. The algorithm for updating the dictionary A is an approximate form of equaP PN T T ¡1 O O O tion 3.7, given by A new D f N n D 1 x n Efs | xn g gf n D1 Efs | x n gEfs | x n g g , and this is proven to converge to a local optimum of the approximate log-likelihood (Kreutz-Delgado & Rao, 2000). There is a most interesting similarity between this approach and the method proposed in this article. We can write the posterior mean for the variational model as Efs | xn g D ¤ n A T ( A¤ n A T C § ) ¡1 x n and compare it with equation 5.3. Apart from the isotropic noise model used in Kreutz-Delgado and Rao (2000), what is particularly interesting is the diagonal matrix ¦ n for each observation. This should be compared with the matrix ¤ n , which contains the absolute values of the variational parameters, which in turn are estimated by the posterior variance of the coefcients. This decoupling of the posterior estimates by the variational parameters obviates the iterative estimation of Efs | x n g, as is required in equation 5.3. It is important to stress at this point that the likelihood of linear models as given by equation 2.1 is a highly nonlinear function of the parameter values, and as such the likelihood will have many local optima. The convergence of the EM method to a local optimum of the likelihood is guaranteed and so will be dependent on the initial parameter values. This dependency on the initial parameter values is also the case with, for example, the gradient method of Lewicki and Sejnowski (2000). There will be cases where the local optima may yield poor parameter estimates, and reinitialization of the algorithm will be required. This is a universal problem for maximumlikelihood learning, and methods such as simulated annealing in EM-type methods have been proposed to counteract this problem (Buhmann, 1998).
Learning Sparse and Overcomplete Representations
2529
6 Conclusion
This article has presented a novel method for learning sparse and overcomplete representations. The main point is that for a particular range of sparse priors on the coefcients, a lower-bound quadratic variational approximation to the prior can be made. This approximation renders all integrals required for marginalization over the latent variables analytically tractable, and this property was originally exploited for Bayesian regression in Jaakkola and Jordan (1997). Indeed, as the approximation is a bound that can be optimized using the additional variational parameters, an implicit means of improving the quality of the estimated data likelihood has been introduced. This overcomes the problem of the accuracy of the likelihood estimate when computing a coding cost, as in Lewicki and Sejnowski (2000). The proposed method can also be considered a method for performing standard complete linear ICA with additive noise for strictly supergaussian signals. 3 This variational method for approximating the prior also has utility in learning nonlinear IFA models using ensemble learning methods (Lappalainen & Honkela, 2000). This will prove useful when modeling sparse latent variables and eradicates the exponential complexity incurred when employing a mixture of gaussians. This proposed EM method may also be used in learning a sparse and overcomplete dictionary for observed signals and then exploiting this sparse representation for BSS, as recently proposed in Zibulevsky and Pearlmutter (2001). In this case, a set of parameters corresponding to the mixing matrix must be estimated, which is in addition to the basis vectors and the associated coefcients; the EM method can easily be extended to accommodate this. Appendix A
Consider the following heavy-tailed distribution: p( s) D
1 1 cosh ¡ b ( bs) , Z(b)
where Z (b ) is the normalizing coefcient. The log-prior logfp( s ) g D logfp(sl )g in this case is a convex function, as @2 @s2
h i logfp( s ) g D ¡ diag b sech2 ( bs1 ) , . . . , b sech2 ( bsL )
(A.1) PL
lD 1
(A.2)
is a positive semidenite matrix. Indeed, equation A.2 is convex in s2 (Jaakkola, 1997). Therefore, utilizing conjugate forms of a convex function 3 In this text, the term supergaussian is used in referring to signals that have a symmetric and positively kurtotic distribution.
2530
M. Girolami
h ( s ), we use the following denitions (Jaakkola, 1997): n o n o and hQ ( l) D max lj 2 ¡ h(j ) . h ( s) D max ls2 ¡ hQ ( l) l
(A.3)
j
Setting logfp( s) g D h ( s ) and denoting E ( l) D lj 2 ¡ h(j ) , @E (l) @j
D 2lj ¡
@ @j
logfp(j ) g.
(A.4)
Dening the score of the prior as y (j ) D lD
@ @j
logfp(j ) g, then
y (j ) I 2j
(A.5)
therefore, h ( s) ¸
o 1 y (j ) n 2 s ¡ j 2 ¡ logfcosh( bj ) g, 2j b
(A.6)
and so »
cosh
¡ b1
( bs) ¸ cosh
¡ b1
¼ y (j ) 2 2 ( bj ) exp (s ¡ j ) . 2j
Finally, some manipulation yields h ¡1 p( s) / cosh b (b s ) D max Q (j ) N
s ( 0,
j
i s2 ) ,
where s 2 D tan hj(bj ) and Q (j ) D cosh ¡ b ( bj ) exp As b ! 1, then, h i p( s) / exp( ¡|s|) D sup Q (j ) N s ( 0, s 2 ) , 1
(A.7)
n
(A.8) j 2
tan h ( bj )
op
2p s 2 .
(A.9)
j
p where now s 2 D |j | and Q (j ) D exp( ¡ 12 |j |) 2p |j |. The multivariate form of equation 3.2 follows due to the independence assumption on the basis coefcients. Appendix B
A standard relative likelihood term for the linear model under consideration is Q ( µt C 1 | µt ) D
N Z X nD 1
© ª p( s | xn , µ t ) log p( x n , s | µ t C 1 ) ds.
(B.1)
Learning Sparse and Overcomplete Representations
2531
© ª The parameters are µ D A , § , f» n gnD 1,...,N , and the standard linear model update equations, 3.7 and 3.8, follow from setting the gradient of Q ( µ t C 1 | µ t ) with respect to A and § to zero. Similarly, the xed point for the variational parameters follows: ( ) Z L o 1 X y (jln ) n 2 @Q @ 2 D p ( s | x n , µt ) sln ¡ jln ¡ logfcosh( bjln ) g ds 2jln b @jln @jln lD 1 ( ) µ ¶ Z s2ln y 0 (jln ) y (jln ) 0 C y (jln ) ¡ jln y (jln ) ds ¡ D p ( s | x n , µt ) 2 jln j2 µ 0 ¶ Z 1 y (jln ) y (jln ) ¡ fs2ln ¡ jln2 gds . D p( s| xn, µ t ) 2 jln j2 R Setting the gradient to zero and noting that p( s | x n , µ t ) ds D 1, then Z 2 jln D (B.2) p ( s | x n , µ t ) s2ln ds , and so the vector form of equation 3.9 follows. Acknowledgments
This work has been carried out under the support of the Finnish National Technology Agency TEKES. Thanks to A KÂaban for helpful comments and Harri Valpola for stimulating discussions and assistance with the simulations reported. I am also grateful to the anonymous reviewers for their insightful and helpful comments regarding an earlier version of this article. References Amari, S. I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Bell, A., & Sejnowski, T. J. (1995). An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bermond, O., & Cardoso, J. F. (1999). Approximate likelihood for noisy mixtures. In Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation (pp. 325–330). Aussois, France. Buhmann, J. M. (1998). Data clustering and data visualisation. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer Academic. Chen, S., Donoho, D. L., & Saunders, M. A. (1996). Atomic decomposition by basis pursuit (Tech. Rep.). Stanford, CA: Department of Statistics., Stanford University.
2532
M. Girolami
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38. Jaakkola, T. S. (1997). Variational methods for inference and estimation in graphical models. Unpublished doctoral dissertation, MIT. Jaakkola, T. S., & Jordan, M. I. (1997). Bayesian logistic regression: A variational approach. In Proceedings on the 1997 Conference on Articial Intelligence and Statistics (pp. 283–294). Fort Lauderdale, FL. Kreutz-Delgado, K., & Rao, B. D. (2000). FOCUSS-based dictionary learning algorithms. In Wavelet Applications in Signal and Image Processing VII: Proc. of SPIE (vol. 4119). Lappalainen, H., & Honkela, A. (2000). Bayesian nonlinear independent component analysis by multi-layer perceptrons. In M. Girolami (Ed.), Advances in independent component analysis (pp. 93–120). Berlin: Springer-Verlag. Lee, T. S. (1996). Image representation using 2-D Gabor wavelets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10), 959–971. Lee, T. W., Lewicki, M., Girolami, M., & Sejnowski, T. (1999). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters, 6, 87–90. Lee, T. W., Lewicki, M., & Sejnowski, T. (2000). ICA mixture models for unsupervised classication of non-gaussian sources and automatic context switching in blind signal separation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1–12. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaption and comparison of image codes. Journal of the Optical Society of America: Optics, Image Science, and Vision, 16(7), 1587–1601. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Rockafellar, R. (1970). Convex analysis. Princeton, NJ: Princeton University Press. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., & Heeger, D. J. (1992). Shiftable multiscale transforms. IEEE Transactions on Information Theory, 38(2), 587–607. Zibulevsky, M., & Pearlmutter, B. A. (2001). Blind source separation by sparse decomposition in a signal dictionary. Neural Computation, 13(4). Received July 10, 2000; accepted January 25, 2001.
LETTER
Communicated by Todd Leen
Random Embedding Machines for Pattern Recognition Yoram Baram Department of Computer Science, Technion, Israel Institute of Technology, Haifa 32000, Israel Real classication problems involve structured data that can be essentially grouped into a relatively small number of clusters. It is shown that, under a local clustering condition, a set of points of a given class, embedded in binary space by a set of randomly parameterized surfaces, is linearly separable from other classes, with arbitrarily high probability. We call such a data set a local relative cluster. The size of the embedding set is shown to be inversely proportional to the squared local clustering degree. A simple parameterization by embedding hyperplanes, implementing a voting system, results in a random reduction of the nearestneighbor method and leads to the separation of multicluster data by a network with two internal layers. This represents a considerable reduction of the learning problem with respect to known techniques, resolving a long-standing question on the complexity of random embedding. Numerical tests show that the proposed method performs as well as state-of the-art methods and in a small fraction of the time. 1 Introduction
Finding a separable and generalizable representation for class-labeled data is an open problem, central to the areas of pattern recognition, classication, neural networks and learning theory. It has been motivated by Rosenblatt’s (1958) perceptron, consisting of a single linear threshold unit and a learning algorithm, which is known to nd a solution in nite time when the classes are linearly separable (Minsky & Papert, 1988). Rosenblatt further proposed using a layer of randomly parameterized linear threshold units and separating the resulting representations by the perceptron learning rule. A similar approach, replacing the perceptron learning rule by a stabilized version of it, was reported to perform well experimentally on complex problems by Gallant and Smith (1987), who, for a training set size M, proposed an empirical estimate of M / 4 embedding cells. Yet while exact embedding by hyperplanes has been addressed in learning theory (Pitt & Warmuth, 1990), a theory justifying the use of random embedding appears to have been lacking. The effectiveness of networks having a single internal layer of linear threshold units has been questioned in several works. Gori and Scarselli (1998) have argued that such networks are unable to produce closed sepac 2001 Massachusetts Institute of Technology Neural Computation 13, 2533– 2548 (2001) °
2534
Yoram Baram
ration surfaces reliably. Bartlett and Ben-David (1999) have recently shown that the difculty of nding a separating set of more than a single linear threshold unit increases exponentially with the input dimension, even when a solution exists. Such complexity seems to contradict the underlying motivation for the introduction of articial neural networks: fast and simple learning algorithms that may be biologically plausible. Indeed, more recent efforts in the development of pattern recognition tools have diverted from that original vision, aiming mainly at increased accuracy, seemingly giving up on speed and simplicity. We propose that many of the theoretical and the practical shortcomings of the original concepts can be resolved by observing that most pattern recognition problems deal with highly structured data and that, as witnessed in many instances (see, e.g., Motwani & Raghavan, 1995, and the simulated annealing method, Kirkpatrick, Gelatt, & Vecchi, 1983), random parameterization reduces complexity and increases computation speed. We propose local grouping of the data about a subset of points, sequentially selected at random from the remaining set of ungrouped points. We call these points clustering points and the associated data groups local relative clusters. We introduce a measure of the relevant structural property, which we call the local clustering degree, and show that, under a condition on this measure, a local relative cluster can be linearly separated and generalized, with arbitrarily high probability, by a system of randomly parameterized separation surfaces, whose number is inversely proportional to the squared local clustering degree. We consider, in particular, two-class problems embedded in binary space by hyperplanes, but the analysis similarly applies to multiclass problems and to embedding by other surfaces. We show that, under local relative clustering, linear separation is implemented by a simple voting mechanism. This naturally suggests the separation of multiclustered data by a network having two internal layers of linear threshold units, which implements a reduced version of the nearest-neighbor classier. Performance on real data is nally compared to those of the nearest-neighbor method and support vector machines. 2 Local Relative Clustering
We consider two disjoint sets U C and U ¡ in the nth-dimensional real space Rn and their union set U. Suppose that the parameters v 2 Rn , b 2 R1 of a hyperplane ( v, u ) D b, where ( v, u ) denotes inner product, obey some probability law, which we call the embedding rule. Let the probability that a given pair of points u ( i) , u ( j) 2 U are on the opposite sides of the hyperplane be p ( u ( i) , u ( j ) ) . Given an embedding rule, a subset U0 ½ U C is said to be locally relatively clustered of degree 2 at a point u¤ 2 Rn with respect to U ¡ , if there exists some 2 > 0, such that ± ² ± ² (2.1) p u¤ , u ( j) ¡ p u¤ , u ( k) ¸ 2 for any u ( j) 2 U ¡ , u ( k) 2 U 0 .
Random Embedding Machines for Pattern Recognition
2535
Figure 1: A clustering point and an embedding hyperplane.
A point u¤ that satises equation 2.1 is said to be a clustering point of U0 . The relevance of these denitions to the notion of clustering is quite obvious. The condition states that each of the points of the locally clustered set is more likely to be on the same side of a random hyperplane as the clustering point than any of the points of the other class. A random hyperplane separates a local relative cluster of a given class from the other class in a probabilistic sense. Figure 1 shows two sets of points, U C and U ¡ , and a point ¤. Let the embedding rule select a point of U ¡ at random (that is, with probability 1 / M ¡, where M ¡ is the cardinality of U ¡), connect it to ¤ by a straight line segment, and pass a hyperplane perpendicular to this line segment at the midpoint. By this embedding rule, ¤ is a clustering point of U C . In fact, by this embedding rule, every point of U C is a clustering point of U C . Now suppose that N random hyperplanes, parameterized by v( i) , b ( i) , i D 1, . . . , N, are independently generated according to the embedding rule. A member u of U will be transformed into (or embedded in) the binary space f § 1gN by a layer of N linear threshold units, the ith of which performs the function ( ¡ ¢ 1 if v ( i) , u ¸ b( i) ¡ ¢ (2.2) xi D ¡1 if v ( i) , u < b( i) , Example 1.
where ( v( i) , u ) denotes the inner product between the two vectors. The resulting vector x is the internal representation of the input vector u. An internal representation x will be transformed into a scalar output y by ( yD
1
if
¡1
if
( w, x) ¸ 0
( w, x) < 0.
(2.3)
2536
Yoram Baram
-
-
+ + - - + + + + (a) -
+ + + - + - - - + + + + + + + + + -- - + -
(c)
+ + + + + + + -
(b) ++ + - -- + + + + + + ++ - -- + + + - + (d)
Figure 2: Examples of local relative clusters.
w is a vector of output weights. An input u will be said to be classied correctly if y D 1 when u 2 U C and y D ¡1 when u 2 U ¡ . How can a clustering point and embedding hyperplanes be selected? There are many possibilities. The following seems to be a natural choice: Select a data point u¤ from a given class as a clustering point. Select a set of points u ( i) , i D 1, . . . , N of the other class at random (independently, with the probability that a given point is selected being the reciprocal of the number of points). Pass an embedding hyperplane between each of these points and the clustering point, intersecting the line segment connecting them perpendicularly at the midpoint. The equation of the ith hyperplane is ( v( i) , u ) D b( i) , where v ( i) D u ( i) ¡ u¤ and b( i) D 0.5( u ( i) C u¤ ) ( u ( i) ¡ u¤ ). What is a local relative cluster U0 2 U C associated with a given clustering point u¤ ? It is a subset U0 2 U C for which condition 2.1 holds for some 2 . Clearly, every data point has an associated local relative cluster of some degree 2 > 0 of the same class. This set is not empty, since it contains at least the point itself (for which the clustering degree is 1). The data set is assumed to be given, so there is nothing random about it. Randomness is introduced through the selection of the data points used in the selection of the clustering point and in the parameterization of the embedding set. Consider the four cases depicted in Figure 2. In Figure 2a, it is quite obvious that the entire set U C is locally relatively clustered (for Example 2.
Random Embedding Machines for Pattern Recognition
2537
some 2 > 0) with respect to U ¡ at each of the points of U C . In Figures 2a through 2d, local relative clusters of degree 2 > 0 of the encircled points are surrounded by dotted lines. Clearly, the denition of a local relative cluster is not restricted to embedding sets consisting of hyperplanes alone. Other surfaces, open and closed (e.g., spheres, rectangles), may be used. The analysis of the following sections holds in this general context. 3 Linear Separability of the Internal Representations Corresponding to a Local Relative Cluster
The internal representations x( i) , i D 1 . . . , M, form two sets, X C and X¡ in the space f §1gN , whose members correspond to U C and U ¡ , respectively. Two sets X1 and X2 in f § 1gN are said to be linearly separable if there exists some hyperplane in RN having X1 on one of its sides and X2 on the other. Practical separation will require some margin between the separating hyperplane and each of the two sets, as provided by the following result. Theorem 1. ¤
point u 2 N >
Rn
If a set U 0 2 U C is locally relatively clustered of degree 2 at some with respect to U ¡ and
2 ln(1 / P ) 2 2
,
(3.1)
then, with probability greater than (1 ¡ P ) 2 , there exists some d > 0 such that ± ² x¤ , x( k) ¡ 0.5N · ¡dN
for all x ( k) 2 X ¡
(3.2)
and ± ² ( ) x¤ , x j ¡ 0.5N ¸ dN
for all x
( j)
2 X0 ,
(3.3)
where x¤ is the internal representation of u¤ and X0 and X ¡ are the sets of internal representations corresponding to U0 and U ¡ , respectively. Theorem 1 states that the sample complexity of the separation problem is inversely proportional to the squared local clustering degree. An immediate consequence of theorem 1 is Corollary 1: Suppose that the set U 0 2 U C is locally relatively clustered at each of its points with respect to the set U ¡ . Then, under condition 3.1, the sets X0 and X ¡ are linearly separable by any of the internal representations of U 0 , with probability greater than (1 ¡ P ) 2 . Corrollary 1.
2538
Yoram Baram
Corollary 1 suggests that under local relative clustering, one of the internal representations of the set U 0 be used as the output weights vector w. Clearly, by the embedding method proposed in the previous section (passing the embedding hyperplanes perpendicularly to the line segments connecting the local clustering point to points of U ¡ at the midpoints), the internal representation of the local clustering point is simply a vector of dimension N whose components are all ¡1. This is a considerable simplication over a possible application of known learning paradigms, such as Rosenblatt’s rule, to the internal representations. Good performance on the labeled data is of little use if it cannot be extended to new, unlabeled points. Such extension has been called generalization. According to Vapnik and Chervonenkis (1971), a classier generalizes if, for any positive scalar 2 , o n lim sup > 2 D 0, P (h ) ¡ PO (h )
M!1
(3.4)
h
where h is the parameter vector, P (h ) is the underlying probability of misclassication, and PO is the empirical probability of misclassication calculated from a training set. We have the following result: The classier dened by equations 2.2 and 2.3, with N linear threshold units, N satisfying equation 3.1, generalizes a local cluster, provided that the local clustering degree remains bounded above zero as the data size increases. Theorem 2.
The situation assumed in theorem 2 can be visualized by increasing the number of points of U C in each of the local clusters surrounded by the dotted lines in Figure 2. 4 Random Embedding Machines for the Classication of Multicluster Data
A local classier, consisting of a single internal layer of linear threshold units and a single external such unit, is dened by a clustering point, randomly selected from U C , and by the set of embedding hyperplanes. The input weights of the internal units are the parameters of the embedding hyperplanes, and the input weights of the external unit, which we call a local clustering unit are all ¡1. Now suppose that there are training points of U C that are not assigned correctly by this classier. Of these points, select one, make it a second clustering point, and construct its embedding set. If this new local classier does not correctly classify at least a specied number of the misclassied points, it is eliminated (“perform or die”). More clustering points, and hence, clustering units, can be added with embedding sets, until only a certain fraction of the data points remains misclassied. The clustering points represent a second internal layer of linear threshold
Random Embedding Machines for Pattern Recognition
2539
Figure 3: Random embedding machine with two internal layers. The rst layer represents the embedding set, the second the clustering points.
units, connected to an external unit that performs the “or ” operation on their outputs (which is achieved by a threshold value 1). Such a network is depicted in Figure 3. Consider again the class arrangements depicted in Figure 2. Suppose that a clustering point is selected at random from the set U C and that an embedding set is dened by hyperplanes perpendicular to the lines connecting the clustering point to randomly selected data points of U ¡ at the midpoints. Case a will be solved by a network with a single internal layer of linear threshold units and a single external such unit. Case b is also likely to be resolved by such a network, but may require a second clustering point, depending on the choice of the rst. Case c will require several clustering points, hence, a second internal layer. Case d is likely to be resolved by two clustering points, one for each of the local clusters of U C , in the following manner: The rst clustering point is picked from one of the the two local clusters of U C , say, the upper left. It is represented by the upper unit of the second internal layer in Figure 3. Most of the points of the lower right local cluster of U C will not be assigned correctly by the rst local clustering unit. A second clustering point will be selected at random from those points. It is represented by the lower unit of the second internal layer in Figure 3. Most of the points of that local cluster will be assigned correctly by this unit. The external unit, connected to the second layer by unity weights and having a unity threshold, will correctly assign most of the data set. A network of the type depicted in Figure 3 may be constructed for each class (a multiclass problem can be treated similarly to two-class problems). The external unit for each class may output the maximal value of the outer product between the representation in the rst internal layer and the weights vectors of the local clustering units. The output units for the different classes will then compete, the one with the largest output value winning. If a point is assigned to both classes or to none, it will remain unclassied. Indecision can be benecial (Baram, 1998). Since each of the local classiers employs
2540
Yoram Baram
random parameters and embeds the input vector in N-dimensional binary space, we call the network a random embedding machine. 4.1 A Value for N. A question of obvious practical importance in designing a random embedding machine is what value to choose for N, the embedding set size. Clearly, we would like to choose the minimal N allowed by the bound, equation 3.1. Consider rst the numerator of the right-hand side of equation 3.1. For a wide range of P values, say, 10¡6 < P < 10¡2 , the value of ln(1 / P ) varies within a relatively small range, 4.6 < ln(1 / P ) < 13.8. An intermediate value, ln( 1 / P ) D 10, corresponding to some P < 10¡4 , yields a linear separability probability of value ( 1 ¡ P ) 2 > 0.9998. Turning to the denominator of equation 3.1, we note that, for a given clustering point, each value of 2 denes a local relative cluster, as discussed in section 2. Clearly, 0 · 2 · 1. A large value of 2 will yield, through equation 3.1, a small value for N. However, the corresponding local relative cluster may be too small, and an unnecessarily large number of such clusters may be required in order to contain most of the training set, leading to an unnecessarily large classier. On the other hand, a small value for 2 may yield, through equation 3.1, an unnecessarily large N. Yet the corresponding local relative clusters may still be small, and the resulting classier may again be unnecessarily large. A sensible choice for 2 is, then, an intermediate value, say, 2 D 0.5, which, together with the requirement P < 10¡4 , yields
N D 20.
(4.1)
On a scale of 10, 20, 30, and so on, the value N D 20 produced the best results for both synthetic and real-life examples, presented in section 5. 4.2 On the Computational Complexity. It should be quite obvious that the computational complexity of the algorithm described in the previous section depends not only on the size of the embedding set but also on the number of clustering points that will be generated, which depends, in turn, on the structure of the data. In order to formalize a correspondence between the structure of the data and the number of clustering points, we introduce the following denition: A data set is uniformly locally clustered of degree 2 and order K if there exist some 2 and some K such that the data set is contained in the union of K local relative clusters of degree 2 . In each of the cases depicted in Figure 2, dene the groups of points of U C , which are not contained in the areas surrounded by the dotted lines, as additional local relative clusters of degree 2 , where 2 is no greater than the clustering degrees of the surrounded sets. Then Figures 2a and 2b show uniformly locally clustered data sets U C of degree 2 and orders 1 and 2, respectively, and such data sets of order 3 are shown, respectively, in Figures 2c and 2d. An immediate consequence of theorem 1 is that if the data are uniformly locally clustered of degree 2 and order K, the complexity of the training algo-
Random Embedding Machines for Pattern Recognition
2541
rithm, which ±guarantees²correct classication with probability greater than ( 1 ¡ P ) 2 , is O
2 ln(1 / P)
complexity) is O
±2
2
M and that of the testing algorithm (the operational ² K . 2
2 ln(1/ P ) 2
4.3 Voting by Weak Local Experts and Reduced Nearest Neighbor.
Given a new input u, each embedding unit nds if u is on the same side of the corresponding hyperplane as the associated clustering point u¤ . For embedding hyperplanes that cut the line segments between randomly selected points of U ¡ and u¤ perpendicularly at the midpoint, the internal representation of u¤ is clearly a vector of ¡1 components. This means that the external unit corresponding to a relative local cluster performs a majority vote among the embedding units. Each of the embedding units is equipped with the quality that it classies correctly at least one pair of points of the opposite classes, that is, the pair that denes the corresponding embedding hyperplane. This qualies the unit as a “weak local expert,” with a presumed probability of success slightly higher than 0.5 with respect to a local relative cluster. Theorem 1 tells us that the combined operation of a sufciently large number of these units by voting brings the probability of success to an arbitrarily high value. The minimum-distance local separator of a point u of U C from U ¡ is the intersection of the half-spaces dened by the hyperplanes cutting the line segments connecting u to each of the points of U ¡ perpendicularly at the midpoint. The nearest-neighbor criterion divides Rn into minimumdistance local separators of the data points. The hyperplanes associated with the separator of a point u embed it in a space of dimension equal to the size of U ¡ . The reduced nearest-neighbor classier (Baram, 2000) employs a subset of the local separators of the points of U C , which covers U C (the latter property makes the classier consistent). Clearly, only a few of the embedding hyperplanes end up contributing to the shape of a minimumdistance local separator. Now suppose that instead of starting with all the embedding hyperplanes, only a randomly selected subset of them is used in the construction of a minimum-distance local separator. Also, as in the reduced nearest-neighbor classier, instead of employing the local separators of all the points of U C , only a few are used, provided that their union covers U C (hence, the classier is still consistent). This is the proposed random embedding machine. It can be viewed as a random approximation of the reduced nearest-neighbor classier. Finally, we note that a random embedding machine is fundamentally different from the networks proposed by Gallant (1986), Gallant and Smith (1987), MÂezard and Nadal (1989), Marchand, Golea, and Rujan (1990), and Frean (1990), all of which involve variations of the perceptron learning rule and consequently are rather slow. The random embedding machine achieves its remarkably fast performance by a different “learning” concept,
2542
Yoram Baram
8
6
6
4
4 2
2 0
0
2
2
4 4
6 8 8
6
4
2
0
2
4
6
8
6 6
5
5
4
4
3
4
2
0
2
4
6
8
3
2
2
1 1 0 0 1 1
2
2
3
3
4 5 6
4
2
0
2
4
6
4 5
4
3
2
1
0
1
2
3
4
Figure 4: Data for example 1.
applying very simple operations to the data points. 5 Examples 5.1 Example 1. Two-class data in an “XOR” arrangement were synthetically generated by independently sampling four gaussian densities of variance 1, as shown in Figure 4. In the four cases, the centers of the densities were at distances 10, 6, 4, and 2, respectively, from each other. The training set and the test set consisted of 500 points each. Ten independent samples of the data sets were generated, and 10 independent selections of the embedding set associated with each clustering point were generated for each sample. The results were averaged over these 100 cases, consisting of 1000 points each. A random embedding machine consisting of two internal layers with four local clustering units (one per local cluster) evolved through the mechanism described in the previous section (starting with one such unit per class, adding units if points of that class are misclassied). Embedding sets of sizes 10, 20, 30, . . . , 100 were examined. The number of classication er-
Random Embedding Machines for Pattern Recognition
2543
Table 1: Classication Results for Example 1.
REM results (N D 20) Success Flops Nearest-neighbor results Success Flops
Case a
Case b
Case c
100% 123,580
99.35% 107,093
95.15% 109,671
67.95% 93,401
99.32% 93.2% 2,250,668
65.88%
100%
SVM results (gaussian RBF kernels) Success Flops
99.38% 1,224,860
98.80% 1,198,435
92.74% 1,190,847
Case d
66.67% 1,833,650
rors was minimal for N D 20 in the harder cases, (Figures 4c and 4d), while it was invariant to the embedding set size in the easier cases, (Figures 4a and 4b). The classication results for the random embedding machine (REM), the nearest-neighbor classier, and a support vector machine (SVM, Vapnik, 1995) employing gaussian radial basis functions with optimized parameters (Joachim, 1998) are given in Table 1. The REM performed with high precision when the classes were highly clustered, and as could be expected, performance degraded as the classes became highly mixed. The results achieved by the REM were better than those obtained by the nearest-neighbor classier and by the SVM in the harder cases, and similar in the easier cases. The number of operations required by the REM was considerably smaller than those required by the other two methods in all cases. Testing was practically instantaneous for both the REM and the SVM classiers. 5.2 Example 2. The Pima Indians Diabetes Database (UCI, 1999) has 768 instances of eight real–valued measurements, corresponding to 268 ill patients and 500 healthy ones. Seven exclusive test sets were selected for cross-validation, each consisting of 100 instances, the rest of the data serving in each case as the training set. As in the case of the previous example, best results were obtained for N D 20. A REM, having four local clustering units in the second internal layer, evolved through the construction mechanism described in the previous section. The classication results are given in Table 2, in comparison to those of the nearest-neighbor method and the optimized SVM employing radial basis functions. The REM required a conTable 2: Classication Results for Example 2.
Success Flops
REM (N D 20)
Nearest-Neighbor
SVM
65.51% 276,465
66.76% 2,204,540
65.57% 1,723,967
2544
Yoram Baram
siderably smaller number of operations, yet its accuracy is comparable to that produced by the other two algorithms. 6 Conclusion
We have shown that separation and generalization can be achieved with arbitrarily high probability by embedding the data in binary space employing a system of elementary separation surfaces, such as hyperplanes, whose number depends on a local clustering property. This leads to the classication of multicluster data by a random embedding machine, whose performance is remarkably faster than that of known methods with little cost in accuracy. Appendix A.1 Proof of Theorem 1. Let x( j) be the point closest to x¤ in X ¡ and
let x ( l) be the point farthest from x¤ in X0 . Let us dene N variables j k, k D 1, . . . , N such that ( ( j) 0 when x¤k D x k jk D (A.1) ( j) 6 x , 1 when x¤k D k and N variables fk , k D 1, . . . , N such that ( ( ) 0 when x¤k D x kl fk D
1
when
( )
l 6 x . x¤k D k
(A.2)
Then the Hamming distance between x¤ and x ( j) is N ± ² X ( ) j k, dh x¤ , x j D
(A.3)
kD 1
and the Hamming distance between x¤ and x ( l ) is N ± ² X fk . dh x¤ , x ( l ) D
(A.4)
kD 1
Note that j k, k D 1, . . . , N, are statistically independent, due to the statistical independence between the embedding hyperplanes (statistical independence between the hyperplanes follows from the statistically independent choices of the points of U ¡ used, together with u¤ , in dening the hyperplanes, as described in section 2). Similarly, fk, k D 1, . . . , N, are statistically independent. Let ± ² (A.5) pk x¤ , x ( j) D Pfjk D 1g
Random Embedding Machines for Pattern Recognition
2545
and N ± ² ± ² 1 X ( ) ( ) p x¤ , x j D p k x¤ , x j . N kD 1
(A.6)
Also let ± ² p k x¤ , x ( l) D Pffk D 1g
(A.7)
N ± ² ± ² 1 X ( ) ( ) p x¤ , x l D p k x¤ , x l . N kD 1
(A.8)
and
Clearly, the relative clustering condition implies that there exists some d > 0 (in particular, d D ( p ( x¤ , x ( j) ) C p( x¤ , x( l) ) ) / 2), such that p ( x¤ , x ( k) ) < d ¡ 2 / 2 for all x ( k) 2 X0 and p( x¤ , x( k) ) > d C 2 / 2 for all x( k) 2 X ¡. The probability that the entire set X ¡ is at a Hamming distance no smaller than dN from x¤ satises n ± ² o © ª (A.9) P dh ( x¤ , X ¡ ) ¸ dN D 1 ¡ P dh x¤ , x ( j) < dN . Employing Chernoff’s bound (Kearns & Vazirani, 1994), PfS < ( p ¡ c ) Ng · e¡2c
2
N
,
(A.10)
PN with S D kD1 Âk , where Âk are Bernoulli trial (coin tossing) variables, p D E ( Âk ) D Pr ( Âk D 1) and 0 < c · 1, we obtain n ± ² o ¤ ( j) ) 2 ( ) P dh x¤ , x j < dN < e¡2(p(x ,x ¡d) N ,
yielding, since p( x¤ , x( j) ) ¡ d > 2
(A.11)
/ 2,
n ± ² o 1 P dh x¤ , x ( j ) < dN < e¡ 2 2
2
N
.
(A.12)
The requirement, n ± ² o P dh x¤ , x ( j ) < dN < P,
(A.13)
is satised by N >
2 ln 1 / P 2 2
,
(A.14)
2546
Yoram Baram
which will give © ª P dh ( x¤ , X ¡ ) ¸ dN > 1 ¡ P.
(A.15)
The probability that the entire set X0 is at a distance no greater than dN from x¤ satises n ± ² o © ¡ ¢ ª (A.16) P dh x¤ , X0 · dN D 1 ¡ P dh x¤ , x ( l) > dN . Employing the other side of Chernoff’s bound (Kearns & Vazirani 1994), PfS > ( p C c ) Ng · e ¡2c
2
N
,
(A.17)
and following arguments similar to those used above, we obtain © ¡ ¢ ª P dh x¤ , X0 · dN > 1 ¡ P
(A.18)
if N >
2 ln( 1 / P ) 2 2
.
(A.19)
From equation A.15, we have, with probability greater than 1 ¡ P, ®2 ± ² 1® ® ® dh x¤ , x ( k) D ®x¤ ¡ x( j) ® ¸ dN, 4
for all x ( k) 2 X ¡ ,
(A.20)
yielding, with probability greater than 1 ¡ P, ±
x¤ , x
( k)
² ¡ 0.5N · ¡dN
for all x ( k) 2 X ¡.
(A.21)
Similarly, it follows from equation A.18 that with probability greater than 1 ¡ P, ®2 ± ² 1® ® ® dh x¤ , x ( k) D ®x¤ ¡ x( k) ® · dN, 4
for all x( k) 2 X0 ,
(A.22)
yielding, with probability greater than 1 ¡ P, ±
² x¤ , x ( k) ¡ 0.5N ¸ dN
for all x( k) 2 X0 .
(A.23)
Since the events A.21 and A.23 are independent, the probability that both occur is (1 ¡ P ) 2 .
Random Embedding Machines for Pattern Recognition
2547
A.2 Proof of Theorem 2. Vapnik and Chervonenkis (1971) showed that for any distribution and a VC-dimension d,
r |P(h ) ¡ PO (h ) | <
d log( M / d ) C log(1 /d) , M
(A.24)
with probability greater than d. Baum and Hausler (1989) showed that the VC-dimension of a network of K linear threshold units and W weights is bounded as d · 2W log2 ( eK ) ,
(A.25)
where e is the base of the natural logarithm. Since, by theorem 1, the classier has N units and O ( N2 ) weights, where N is independent of M, the bound in equation A.25 is independent of M and the bound in equation A.24 vanishes as M ! 1. Acknowledgments
I thank Ran El-Yaniv, Mark Zlochin, and the referees for their very helpful comments. References Baram, Y. (1998). Partial classication: The benet of deferred decision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8), 769–776. Baram, Y. (2000).A geometric approach to consistent classication. Pattern Recognition, 33, 177–184. Bartlett, P., & Ben-David, S. (1999). Hardness results for neural network approximation problems. Paper presented at EuroCOLT 99. Baum, E. B., & Hausler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Frean, M. (1990). The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation, 2(2), 198–209. Gallant, S. I. (1986). Optimal linear discriminants. In Proc. Eighth International Conference on Pattern Recognition (pp. 28–31). Paris. Gallant, S. I., & Smith, D. (1987). Random cells: An idea whose time has come and gone . . . and come again? In Proc. of the IEEE First Int. Conf. on Neural Networks (pp. 671–678). San Diego, CA. Gori, M., & Scarselli, F. (1998). Are multilayer perceptrons adequate for pattern recognition and verication. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20, 1121–1132. Joachim, T. (1998). Available online at www-ai.informatik.uni-dortmund.de /FORSCHUNG /VERFAHREN/SVM LIGHT/svm light.eng.html. Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory. Cambridge, MA: MIT Press.
2548
Yoram Baram
Kirkpatrick, S., Gelatt Jr., C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220, pp. 671–680. Marchand, M., Golea, M., & Rujan, P. (1990). A convergence theorem for sequential learning in two-layer perceptions. Europhysics, 11(6), 487–492. MÂezard, M., & Nadal, J. P. (1989). Learning in feedforward layered networks: The tiling algorithm. J. of Physics, A 22, 2191–2204. Minsky, M. L., & Papert, S. A. (1988). Perceptrons. Cambridge, MA: MIT Press. Motwani, R., & Raghavan, P. (1995). Randomized algorithms. Cambridge: Cambridge University Press. Pitt, L., & Warmuth, M. (1990). Prediction preserving reducibility. J. of Computer and System Sciences, 41, 430–467. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of the relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. UCI. (1999). Machine learning databases. Available online at: www.ics.uci.edu/ AI/ ML/Machine-Learning.html. Received December 15, 1999; accepted January 15, 2001.
LETTER
Communicated by David MacKay
Manifold Stochastic Dynamics for Bayesian Learning Mark Zlochin Yoram Baram Department of Computer Science, Technion — Israel Institute of Technology, Technion City, Haifa 32000, Israel We propose a new Markov Chain Monte Carlo algorithm, which is a generalization of the stochastic dynamics method. The algorithm performs exploration of the state-space using its intrinsic geometric structure, which facilitates efcient sampling of complex distributions. Applied to Bayesian learning in neural networks, our algorithm was found to produce results comparable to the best state-of-the-art method while consuming considerably less time. 1 Introduction
In the Bayesian framework, predictions are made by integrating the function of interest over the posterior parameter distribution, which is the normalized product of the prior distribution and the likelihood. Since, in most problems, the posterior distributions are too complex for the integrals to be calculated analytically, approximations are needed. Early work in Bayesian learning for nonlinear models (Buntine & Weigend, 1991; MacKay, 1992) used gaussian approximations to the posterior parameter distribution. However, the gaussian approximation may be poor, especially for multimodal distributions. The Hybrid Monte Carlo (HMC) method (Duane, Kennedy, Pendleton, & Roweth, 1987), introduced to the neural network community by Neal (1993), deals more successfully with multimodal distributions, but it can be rather time-consuming, at least partly due to the anisotropic character of the posterior distribution—the density changes rapidly in some directions while remaining almost constant in others. We present a novel algorithm, the Manifold Stochastic Dynamics, which overcomes the problem of anisotropic posterior distribution by using the intrinsic geometric structure of the model space. The article is structured as follows. Section 2 contains a brief description of the Bayesian approach. In section 3 we give an overview of Markov Chain Monte Carlo (MCMC) methods and describe some particular algorithms: Gibbs sampling, the Metropolis algorithm, and Hybrid Monte Carlo, the last being perhaps the most efcient MCMC algorithm known so far. In section 4 we dene some differential geometric concepts related to the learning c 2001 Massachusetts Institute of Technology Neural Computation 13, 2549– 2572 (2001) °
2550
Mark Zlochin and Yoram Baram
problem, introduce the new oncept of Bayesian geometry, and describe the Manifold Stochastic Dynamics algorithm. Section 5 reports some empirical results. 2 Bayesian Learning 2.1 The General Approach. We assume a parametric class of distributions P ( z|h ) of a random variable z, a prior distribution P (h ) over the parameters, and observations z (1) , . . . , z ( n) . By Bayes’ rule, the posterior distribution of the parameters given the observations is
P (h | z(1) , . . . , z ( n ) ) D
P ( z(1) , . . . , z ( n ) | h ) P (h ) P ( z (1) , . . . , z( n) )
/ L (h | z (1) , . . . , z ( n) ) P (h ) ,
(2.1)
where L (h | z (1) , . . . , z( n) ) D P ( z (1) , . . . , z( n) | h ) is the likelihood function. In the Bayesian framework, estimation is performed by integrating the quantity of interest over the posterior distribution. For example, if we are interested in the distribution of y( n C 1) given the observations and x ( n C 1) , where z( n C 1) D ( x( n C 1) , y ( n C 1) ) is an input-output pair drawn from the same distribution as the training pairs D D fz ( i) D ( x ( i) , y ( i ) ) , i D 1, . . . , ng, the Bayesian approach yields the predictive distribution: Z ( n C 1) ( n C 1) ( ) |x (2.2) P y ,D D P ( y( n C 1) | x( n C 1) , h ) P (h | D ) dh, where P (h |D) is given by equation 2.1. If a single-valued guess is needed, the prediction depends on the loss function being used. For example, under squared error loss, the mean of the predictive distribution is an optimal estimate of the quantity of interest in a sense that it minimizes the expected loss. 2.2 Bayesian Regression. The problem at hand is to approximate an unknown function based on observed input-output pairs D using some given parameterized function class f ( x, h ). An example of such a function class is the collection of all neural networks with a xed architecture, where the parameters are the connections’ weights values. The function f ( x, h ) may be used to dene a conditional distribution of the output y given the input x and the parameter h. One common choice is gaussian, with y having a mean f ( x, h ) and a covariance matrix s 2 I,
P ( y | x, h ) D p
1 2p s
exp( ¡bky ¡ f ( x, h ) k2 / 2) ,
where k ¢ k denotes the Euclidean norm and b D s ¡2 .
(2.3)
Manifold Stochastic Dynamics for Bayesian Learning
2551
When the distribution of the inputs P ( x ) is not modeled, the denition of the likelihood takes the form (Neal, 1996) L (h | ( x(1) , y (1) ) , . . . , ( x ( n) , y ( n) ) ) D
n Y
P ( y ( i) | x ( i) , h ).
(2.4)
iD 1
Given the observations D and the input x, the expected value of y under equations 2.2 and 2.3 is Z
Z yP ( y | x, D) dy D
yP ( y | x, h ) P (h | D ) dh dy Z
D
f ( x, h ) P (h | D ) dh.
(2.5)
In p the gaussian case presented by equation 2.3, if the normalizing constant 2p s is discarded, the minus log-likelihood can be seen to be proportional to the total squared error:
ED
l X iD 1
( y ( i) ¡ f ( x ( i) , h ) ) 2 .
(2.6)
If another error function is used instead of E , then the error e D y ¡ f ( x, h ) can be given a different distribution. One common choice is to take p ( e ) / expf¡ Err ( e )g, where Err is the error function. This reduces to the gaussian case for Err D E . 3 Markov Chain Monte Carlo Methods
The calculation of the expected value of the output y according to equation 2.5 is a special case of evaluating the expectation of some random variable a(h ) with respect to a distribution Q(h ) : Z E[a] D a(h ) Q (h ) dh. In cases where the integration cannot be performed analytically or by means of standard numerical methods, we can use the approximation (Fishman, 1996) E[a] ¼
N 1 X ( ) a(h t ) , N t D1
where h (1) , . . . , h ( N ) are generated by an ergodic source so that the marginal distribution p (h ( t ) ) is equal to Q(h ( t ) ).
2552
Mark Zlochin and Yoram Baram
If Q(h ) is a “simple” (e.g., gaussian, gamma) distribution, a sample can be generated rather easily. However, in more complicated models like neural networks, this can be hard. For more complex distributions, the Markov chain methods can be used for generating the sample. 3.1 Markov Chains. A Markov chain (Gilks, Richardson, & Spiegelhalter, 1996) is dened by an initial distribution for the rst state of the chain, h (1) , and a set of transition probabilities T (h ( t C 1) | h ( t ) ) for a new state, h ( t C 1) , to follow the current state, h ( t ) . An invariant (or stationary) distribution, Q, is one that persists once it is established. This invariance property can be written as follows: Z (3.1) Q (h 0 ) D T (h 0 | h ) Q (h ) dh.
Invariance with respect to Q is implied by a stronger condition of detailed balance: T (h 0 | h ) Q(h ) D T (h | h 0 ) Q(h 0 ) .
(3.2)
A chain satisfying the detailed balance condition is said to be reversible. A Markov chain is called ergodic if it has a unique invariant distribution, its equilibrium distribution, to which it converges from any initial distribution (Gilks et al., 1996). If an ergodic Markov chain, having Q as its equilibrium distribution, can be simulated, a sample can be generated by simply taking consecutive states of the chain. In the resulting sample, there are dependencies between the h ( t ) , and it can be shown (Neal, 1996) that a high correlation leads to a large variance of the Monte Carlo estimate, necessitating a large sample in order to achieve reasonable accuracy. In order to estimate an expectation with respect to some distribution, Q, we need to construct a Markov chain that (1) has Q as an invariant distribution, (2) is ergodic, (3) converges to Q as rapidly as possible, and (4) has a low correlation between the consecutive states. The rst property is easily attainable; the others are often much more challenging. The last two properties are interrelated and their satisfaction will be henceforth referred to as rapid exploration of the state-space. Two simple methods for constructing a Markov chain, for which at least the invariance condition holds, are Gibbs sampling (Geman & Geman, 1984) and the Metropolis algorithm (Metropolis, Rosenbluth, Teller, & Teller, 1953). 3.2 Gibbs Sampling. Gibbs sampling, known in the physics literature as the heatbath method, applies to sampling a parameter vector hN D (h1 , . . . , hp )
Manifold Stochastic Dynamics for Bayesian Learning
2553
from a distribution Q (hN ) . Presumably Q(hN ) is too complex to be sampled directly, but a value of one component of hN can be sampled from the conditional distribution (under Q), given values for all the other components. It is then possible to simulate a Markov chain in which hN ( t C 1) is generated from hN ( t) as follows: ( C 1)
from Q (h1 | h2t , h3t . . . , hp t ).
( C 1)
from Q (h2 | h1t
( C 1)
from Q (hj | h1t
( C 1)
C C t C 1) ). from Q (hp | h1t 1) , h2t 1) . . . , hp¡1
Pick h1t Pick h2t
( )
( )
( )
( C 1)
t t , h2 . . . , hp ) .
( C 1)
(
( )
( )
.. . Pick hj t
( )
( )
t C 1) t t , . . . , hj¡1 , hj C 1 , . . . , hp ).
.. . Pick hp t
(
(
(
It can be shown (Geman & Geman, 1984) that such transitions leave the desired distribution Q invariant, but the resulting Markov chain is not necessarily ergodic. Moreover, due to the component-wise updating, the exploration of the state-space can be slow. Although it is relatively simple, Gibbs sampling is not applicable to neural networks because of the rather complex form of the associated posterior distribution. It can be used, however, as a component in a more complex algorithm. 3.3 Metropolis Algorithm. In the Markov chain dened by the Metropolis algorithm, a new state, h ( t C 1) , is generated from the previous state, h ( t) , in the following manner:
1. Generate a candidate state, h ¤ , from a proposal distribution, with a conditional density S (h ¤ | h ( t ) ) , that satises, for all h, h 0 : S (h | h 0 ) D S(h 0 | h ) .
2. Accept h ¤ as the next state h ( t C 1) with probability q D minf1, Q (h ¤ ) / Q (h ( t) ) g, or reject h ¤ with probability 1 ¡ q and set h ( t C 1) D h ( t ) . The form of the acceptance probability ensures that the Markov chain, produced by the Metropolis algorithm, is reversible with respect to the distribution Q(h ) (Neal, 1996). The choice of the candidate distribution is crucial to the ergodicity of the resulting Markov chain and the efciency of the sampling. One common choice is a gaussian distribution centered at the current state. This means that a random walk is performed in the state-space. The random walk nature of the algorithm leads to a slow exploration of the state-space. For practical purposes, a proposal distribution, better adapted to the target distribution, Q (h ) , is needed. Next we discuss proposal distributions based
2554
Mark Zlochin and Yoram Baram
on dynamical methods, which lead to a more systematic exploration of the state-space. 3.4 Dynamical Methods
3.4.1 Canonical Distribution. The problem at hand can be rephrased in terms of sampling from the canonical distribution for the state, q, of a “physical” system, dened in terms of a “potential energy” function E ( q) (Neal, 1996): P ( q) / exp( ¡E(q) ) .
(3.3)
Note that any probability density that is nowhere zero can be put in this form by simply dening E ( q ) D ¡ log P ( q) ¡log Z, for any convenient Z. For the regression problem described above, one can take E ( q) D b E ¡ log p( q) , where p ( q) is the prior distribution. The sampling from the canonical distribution can be carried out as follows. A “momentum” variable, p, is introduced, with the same dimension as q. The canonical distribution over the extended phase-space is dened as P ( q, p ) / exp( ¡H( q, p) ) ,
(3.4)
where H ( q, p ) D E ( q ) C K ( p ) is the Hamiltonian, which represents the total energy. K ( p ) is the kinetic energy due to momentum, dened as K ( p) D
n X p2i , 2mi i D1
(3.5)
where pi , i D 1, . . . , n are the momentum components and mi is the mass associated with ith component, so that different components can be given different weights. In this canonical distribution, p and q are independent, q having the marginal distribution / exp( ¡E( q) ) and p—the gaussian distribution (the ith component, pi , is distributed according to N ( 0, mi ) and different components are independent). Now we can proceed by dening a Markov chain that converges to the canonical distribution for q and p and then simply ignore the p values (Neal, 1996). 3.4.2 Stochastic Dynamics Method. Sampling from the canonical distribution can be done using the stochastic dynamics method (Andersen, 1980), in which the task is split into two subtasks: sampling uniformly from values of q and p with a xed total energy, H ( q, p ), and sampling states with different values of H. The rst task is performed by simulating the Hamiltonian
Manifold Stochastic Dynamics for Bayesian Learning
2555
dynamics of the system: dq i @H pi D D dt @pi mi dpi @H @E D ¡ D ¡ , dt @qi @qi
(3.6)
where t is the time variable. It can be shown (Arnold, 1984) that the dynamical transitions leave the canonical distribution invariant. In practice, Hamiltonian dynamics cannot be simulated exactly, but can be approximated by some discretization using nite time steps. One common approximation is leapfrog discretization (Neal, 1996), where state variables are updated in the following way: ± ² 2 2 @E ( q (t ) ) pi t C D pi ( t ) ¡ 2 2 @qi ¡ ¢ pi t C 2 2 qi ( t C 2 ) D qi ( t ) C 2 mi ± ² 2 2 @E ( q( t C 2 ) ) . ¡ pi ( t C 2 ) D pi t C 2 2 @qi Different energy levels are obtained by occasional stochastic Gibbs sampling of the momentum. Since q and p are independent, p may be updated without reference to q by drawing a value with probability density proportional to exp( ¡K( p ) ), which, in the case of equation 3.5, can be easily done, since the pi ’s have independent gaussian distributions. 3.4.3 The Hybrid Monte Carlo Method. In the HMC method (Duane et al., 1987), stochastic dynamic transitions are used to generate candidate states for the Metropolis algorithm. In order to ensure the symmetry of the proposal distribution, momentum is negated after every deterministic dynamic transition. Adding the random rejection option of the Metropolis algorithm eliminates certain drawbacks of the stochastic dynamics such as systematic errors due to leapfrog discretization, since the algorithm ensures that every transition keeps the canonical distribution invariant. However, an empirical comparison between the uncorrected stochastic dynamics and HMC in application to Bayesian learning in neural networks showed that with an appropriate discretization step size, there is no notable difference between the two methods (Neal, 1996). A modication proposed in Horowitz, (1991) to the Gibbs sampling of momentum is to replace p each time by p ¢ cos(h ) C f ¢ sin(h ) , where h is a small angle and f is distributed according to N ( 0, diagfm1 , . . . , mn g) . While
2556
Mark Zlochin and Yoram Baram
keeping the canonical distribution invariant, this scheme, called momentum persistence, is reported to improve the rate of exploration. 4 Manifold Stochastic Dynamics
The energy landscape in many regression problems is anisotropic. This degrades the performance of HMC in two respects: The dynamics may not be optimal for efcient exploration of the posterior distribution as suggested by recent studies of gaussian diffusions (Hwang, Hwang-Ma, & Shen, 1993). The resulting differential equations are stiff (Gear, 1971), leading to large discretization errors , which in turn necessitates small time-steps, implying that the computational burden is high. Both problems disappear if instead of the parameter-space Hamiltonian dynamics used in HMC we simulate dynamics over a “function” space for which the energy is isotropic and quadratic. This point is best demonstrated by the following simple example. Let us consider the following regression model,
³ ´ y1 y2
³ D
h1 x1 h2 x2
´
³ ´ C
2 1 2 2
,
(4.1)
where 2 i are zero-mean unit-variance gaussian noise variables. Given a uniform (improper) prior over (h1 , h2 ), an input ( x1 , x2 ) D ( 1, 10) , and an output ( y1 , y2 ) , the log-posterior is (up to a constant) E (h1 , h2 ) D
1 ( (h1 ¡ y1 ) 2 C 100(h2 ¡ y2 ) 2 ). 2
(4.2)
For the sake of simplicity, let us assume that the output is ( 0, 0); hence the log-posterior simplies to E (h1 , h2 ) D
1 2 (h C 100h22 ). 2 1
(4.3)
With the Hamiltonian H D E (h1 , h2 ) C 12 ( p21 C p22 ), the dynamics take the form dhi D pi dt dpi @E D ¡ , dt @hi where t is the time variable.
(4.4)
Manifold Stochastic Dynamics for Bayesian Learning
2557
0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 1.5
1
0.5
0
0.5
1
1.5
Figure 1: Exact trajectory (thick line) and leapfrog discretization with 30 steps (stars) for the Hamiltonian dynamics with E (h1 , h2 ) D 12 (h12 C 100h22 ). The dashed lines show constant energy contours.
As can be seen from Figure 1, a typical trajectory induced by the parameter-space Hamiltonian dynamics is highly oscillatory, and even with relatively small discretization steps, the errors are still large. An overall effect is that it takes a large number of iterations to move to a substantially different region. If instead of variables h1 , h2 , the transformed “function-space” version, ³ ´ ³ ´ ³ ´ h 1 x1 h1 f1 D D , h 2 x2 10h2 f2
(4.5)
is used, then the energy in this function space is simply ² ¡ ¢ 1± 2 EQ f1 , f2 D f1 C f22 . 2
(4.6)
In these new coordinates, the Hamiltonian can be dened as ² ¡ ¢ 1± 2 HQ D EQ f1 , f2 C pQ1 C pQ22 , 2
(4.7)
and the Hamiltonian dynamics are described by the following system of differential equations: dfi D pQi dt
2558
Mark Zlochin and Yoram Baram
dpQi @EQ D ¡ D ¡ fi . dt @fi
(4.8)
These dynamics can be translated back to the coordinates h1 , h2 using equation 4.5: dh1 df1 D D pQ1 , dt dt dpQ1 D ¡ f1 D ¡h1 , dt
1 df2 1 dh2 D D pQ2 10 dt 10 dt dpQ2 D ¡ f2 D ¡10h2 . dt
(4.9)
By setting p1 D pQ1 , p2 D 10pQ2 , it can be easily veried that equation 4.9 is equivalent to the Hamiltonian dynamics resulting from ³ ´ 1 2 1 2 HO D E (h1 , h2 ) C p1 C p2 . 2 100
(4.10)
The only difference between the Euclidean Hamiltonian and equation 4.10 is the form of the kinetic energy. Figure 2 shows a typical trajectory induced by the dynamics in the function space. The trajectory is smooth, and the discretization errors are small; as a result, it takes only ve iterations to cross the region of low energy. In this example, explicit forward and backward transformations between the “coordinate” representation, h1 , h2 , and the function-space representa0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 1.5
1
0.5
0
0.5
1
1.5
Figure 2: Exact trajectory (thick line) and leapfrog discretization with ve steps (stars) for the dynamics over the function space.
Manifold Stochastic Dynamics for Bayesian Learning
2559
tion, f1 , f2 , were possible. In section 4.4 we show that it is possible to simulate the function-space dynamics without those explicit transformations, but rather by using an appropriate kinetic energy form. 4.1 Riemannian Geometry. Before we proceed any further, let us introduce some differential geometric concepts that are necessary for describing the Manifold Stochastic Dynamics algorithm. A Riemannian manifold (Chavel, 1993) is a set H µ Rn equipped with a metric tensor G, which is a positive semidenite matrix dening the inner product between innitesimal increments as
hdh1 , dh2 iG D dh1T ¢ G ¢ dh2 .
(4.11)
This inner product naturally gives us the norm kdhk2G D hdh, dh iG D dh T ¢ G ¢ dh.
(4.12)
The Jeffrey prior over H, which is a generalization of the uniform distribution, is dened by the density function p (h ) /
p
|G(h ) |,
(4.13)
where | ¢ | denotes determinant (Amari, 1997). It should be noted that the Jeffrey prior may be improper; it does not necessarily integrate to unity (in fact, it can even integrate to innity). 4.2 Riemannian Geometry of Functions. The choice of the metric tensor is problem dependent, and the question is whether there are any rules for choosing a metric tensor, yielding “natural” geometric structure. We start with some natural distance measure and proceed by dening a metric tensor, which is consistent with this distance measure. A natural measure of tness between the model and the data, adopted in statistics, is the minus log-likelihood function (Barndorff-Nielsen, 1988). In regression, under the gaussian assumption the minus log-likelihood is proportional to the total squared error,1 which is simply the squared Euclidean distance between the target point, ( y (1) , . . . , y( l) ) , and the candidate P function evaluated at the sample: E D liD 1 ( f ( x( i) , h ) ¡ y ( i) ) 2 . More generally, a natural distance measure between models is the Euclidean distance between the corresponding functions restricted to the sample points:
dist (h (1) , h (2) ) D k fh (1) ¡ fh (2) kl 1 As mentioned in section 2.2, the normalizing constant is discarded, since we are interested only in log-likelihood differences, not absolute values.
2560
Mark Zlochin and Yoram Baram
v u l uX ( f ( x ( i) , h (1) ) ¡ f ( x( i) , h (2) ) ) 2 . D t 4
(4.14)
iD 1
Hence, the metric tensor must satisfy hdh, dh iG D ( df, df ) D ( J ¢ dh ) T ¢ ( J ¢ dh ) D dh T ¢ JT ¢ J ¢ dh, h
where J D
@f ( x(i) ) @hj
(4.15)
i
is the Jacobian matrix and ( ¢, ¢) denotes the regular inner
product. The resulting metric tensor is2 G D JT ¢ J D
l n o X rh f ( x ( i ) , h ) , ¢rh f ( x( i) , h ) T
(4.16)
i D1
where rh denotes the gradient. For vector-valued functions, a similar argument can be applied. The expression for the metric tensor is analogous to the second part of equation 4.16, the only difference being that the summation is now over both sample points and function coordinates. 4.3 Bayesian Geometry. A Bayesian approach would suggest that the manifold geometry should be affected by the prior assumptions about the parameters. As a simple example, let us consider a gaussian prior distribution N ( 0, I / a) . The minus log-posterior given the data D can be written as
¡ log p(h |D) D ¡ log p( D|h ) ¡ log p (h ) , where, again, we keep only the terms involving h and discard the normalizing constant p ( D ). Substituting explicit expressions for the gaussian log-likelihood and the log-prior yields ( ¡ log p(h |D) D
1 2
b
l X i D1
( f ( x ( i) , h ) ¡ y ( i) ) 2 C a
n X kD 1
) (h k ¡ 0) 2 ,
(4.17)
where b is the inverse error variance. If we represent the model h by the vector ( f ( x (1) , h ) , . . . , f ( x ( l ) , h ) , h1 , . . . , hn ) and the data by the vector ( y (1) , . . . , y ( l ) , 0, . . . , 0) , then the minus log-posterior is simply half of the squared weighted distance between the two representations. Generalizing the expression for the distance between 2
This is the only symmetric matrix that satises condition 4.15.
Manifold Stochastic Dynamics for Bayesian Learning
2561
a model and the data to the distance between two models, the following metric is obtained: dist(h 1 , h 2 ) v u l n u X X (1) (2) ( f ( x ( i) , h (1) ) ¡ f ( x ( i) , h (2) ) ) 2 C a (hk ¡ hk ) 2 D tb i D1
(4.18)
kD 1
with the metric tensor: GB D JOT ¢ JO D b ¢ G C a ¢ I, where JO is the extended Jacobian: ¡ ¢ 8 @f x ( i) l,
(4.19)
(4.20)
where di, j is the Kroneker’s delta. The derivation of the metric tensor is similar to the one in the previous section. Note that as a ! 0, GB ! bG, that is, as the prior becomes vaguer, we approach a non-Bayesian paradigm. If, on the other hand, a ! 1 or b ¢ G ! 0, the Bayesian geometry approaches the Euclidean geometry of the parameter space. These are the qualities that we would like the Bayesian geometry to have—if the prior is “strong” in comparison to the likelihood, the prior should also have a stronger inuence on the form of the metric tensor. The denitions above can be applied to any log-concave prior distribution with the inverse Hessian of the log-prior, ( rh rh log p (h ) ) ¡1 , replacing aI in equation 4.19. The framework is not restricted to regression. For a general distribution class, it is natural to use the Fisher information matrix, I D Efrh log p ( z, h ) ¢ rh log p ( z, h ) T g, as a metric tensor (Amari, 1997). The Bayesian metric tensor then becomes GB D I C ( rh rh log p(h ) ) ¡1 .
(4.21)
When the Fisher information matrix cannot be calculated, it may be reP placed by the “empirical information” matrix IO D n1 niD 1 frh log p( zi , h ) ¢ rh log p ( zi , h ) T g, where fzi g are the sample points. 4.4 Hamiltonian Dynamics over a Bayesian Manifold. While in the linear regression example the dynamics could be simulated directly by working with the new variables, f1 , f2 , this is not always possible. However, in order to calculate the trajectory induced by the dynamics in a general function space, there is no need to work in the function space explicitly. Instead,
2562
Mark Zlochin and Yoram Baram
we can simulate the Hamiltonian dynamics over the Riemannian manifold equipped with the metric tensor given by equations 4.16 or 4.19 (which are consistent with appropriate function-space metrics). Before presenting the general solution to this problem, let us take another look at the regression example. The denition 4.16 produces the following metric tensor: ³ ´ ³ ´ ³ ´ 1 0 1 0 1 0 T ¢ . (4.22) GD J ¢J D D 0 10 0 10 0 100 Hence, the Hamiltonian, dened by equation 4.10, can be written as ³ ´ 1 1 0 ¢p HO D E (h1 , h2 ) C pT ¢ 0 0.01 2 D E (h1 , h2 ) C
1 T p ¢ G¡1 ¢ p, 2
(4.23)
and the resulting dynamics take the form dh @H ¡1 D D G p dt @p dp @H @E . D ¡ D ¡ dt @h @h
(4.24)
Using the identity hP D G ¡1 p, the kinetic energy can be rewritten as 1 T 1 P p ¢ G ¡1 ¢ p D hP T ¢ G ¢ h, 2 2
(4.25)
meaning that the kinetic energy is measured using the Riemannian metric. Let us now return to the general case with state variable q and metric tensor G ( q ). As the previous example showed, the only difference between the Euclidean Hamiltonian and the function-space Hamiltonian is that the kinetic energy in the latter case is measured in the function space rather than in the Euclidean parameter space. The same idea can be applied to any Riemannian space (Arnold, 1984). Let us dene the Hamiltonian as in the second part of equation 4.23: H ( q, p ) D E ( q) C
1 T p ¢ G¡1 ¢ p. 2
(4.26)
The dynamics are now governed by the Hamilton equations dq @H D D G¡1 p dt @p 1 dp @H @E @G¡1 ¡ pT ¢ ¢ p, D ¡ D ¡ 2 dt @q @q @q
(4.27)
Manifold Stochastic Dynamics for Bayesian Learning
2563
and, as in the previous example, the kinetic energy term can be rewritten as KD
1 T 1 P p ¢ G¡1 ¢ p D qP T ¢ G ¢ q, 2 2
(4.28)
that is, the kinetic energy is half of the squared velocity length, measured in the Riemannian metric. In the regression case, this means that the velocity magnitude is measured in the function space, as intended. In the canonical distribution P ( q, p ) / exp( ¡H(q, p) ) , where H ( q, p) is dened by equation 4.26, the conditional distribution of p given q is a zeromean gaussian with the covariance matrix G ( q) and the marginal distribution over q is proportional to exp( ¡E(q) )p ( q ). Since, in general, p ( q) is not constant, the resulting distribution is not the required one. One of the ways to deal with this problem is to use a corrected energy function, E0 ( q ) D E ( q) C log p ( q) ,
(4.29)
where p ( q) is the Jeffrey prior dened by equation 4.13. With this new energy function, the marginal distribution over q is proportional to exp( ¡E0 ( q) )p ( q) D exp( ¡E( q) ) , as desired. 4.5 Manifold Stochastic Dynamics Algorithm. Solving the Hamilton equations 4.27 requires calculating the derivatives of the inverse metric tensor (or, alternatively the metric tensor itself), which may be time-consuming in general. However, in the particular case of regression, we obtain from the denition GB D JOT ¢ JO an alternative second-order equation in a matrix form,
³
´
d2 q @JO ¡1 0 C JOT ¡G r D E qP , B dt 2 dt
(4.30)
in which all the quantities can be efciently calculated. Note that the modication of the energy function necessitates calculation of the r log p ( q) . For this end, the following identities are used: @ log p ( q) @qi
³ ´ 1 @ log det GB 1 ¡1 @G D ¡ trace GB D 2 2 @qi @qi ³ ´ ³ ´ O ¡1 O @J ¡1 @J . D ¡ trace GB J D ¡b trace GB J @qi @qi D
(4.31)
When presented in this form, r log p ( q) can be calculated using a modication of the Rfbackpropg algorithm of Pearlmutter (1994). This algorithm was developed for fast multiplication of an arbitrary vector by the Hessian of a general class of error functions, and it can be shown that the right-most wing of equation 4.31 can be represented as a sum of such vector-Hessian
2564
Mark Zlochin and Yoram Baram
products. The details of the implementation are beyond the scope of this article. Because of the wide dispersion (about six orders of magnitude in the empirical study reported below) of the eigenvalues of the metric tensor GB (and, correspondingly, the Hessian of the error function), the direct discretization of equation 4.30 is numerically unstable. The stability problem can be solved by representing the velocity vector qP in local GB -orthogonal basis rather than regular coordinates.3 This leads to an equivalent pair of equations: dq D R ¡1 ¢ pO dt (4.32) ¡ T ¢¡1 dpO @Q 0 T O rE ¡ Q D ¡ R p, dt dt P where JO D QR, with Q orthogonal and R upper triangular, and pO D Rq. The matrix R can be calculated using the Cholesky decomposition (Golub & van Loan, 1996) of the metric tensor GB . This change of basis leads to a signicant improvement of the numerical stability of the simulation. The O and the marginal distribution kinetic energy now becomes simply 12 pOT ¢ p, of pO under the canonical distribution is N ( 0, I ) . Sampling from the canonical distribution is performed by repeating the following two steps: 1. Simulate the Hamiltonian dynamics, equation 4.32, over a Bayesian manifold using the following variation of the midpoint method (Gear, 1971): ± ² 2 2 q tC D q( t ) C Rt¡1 pO(t ) 2 2 ± ² ± ² 2 2 2 0 pO t C D pO(t ) ¡ RT t r Et ¡ D t, t C 2 2 2 ± ² 2 ¡1 (4.33) q( t C 2 ) D q( t ) C 2 Rt C 2 pO t C 2 2 ± ² 2 pO(t C 2 ) D pO(t ) ¡ RTt r E0t C RTt C2 r E0t C2 2 ¡ D (t , t C 2 ) , where ±
² Qt C Qt C 22 ± T Qt C 2 ¡ QTt p ( t ) 2 2 2 ² ± ² Qt C Qt C2 ± T 2 D ( t, t C 2 ) D Qt C2 ¡ QTt p t C 2 2
D t, t
C
2
²
D
3 In the differential geometry literature, this approach is called the method of moving frames (Chavel, 1993).
Manifold Stochastic Dynamics for Bayesian Learning
2565
R tC2 R t C2 t0 t0 O( t 0 ) dt 0 and t QTt 0 dQ O(t 0 ) dt 0 , respecapproximate t 2 QTt 0 dQ dt 0 p dt 0 p tively. When G is constant, the algorithm, 4.33, is equivalent to the leapfrog algorithm. 2. Resample pO using the momentum persistence scheme described in section 3.4.3. The current value of pO is replaced by pO ¢ cos(h ) C f ¢ sin(h ), where h is a small angle and f is distributed according to N ( 0, I ). It should be noted that similarly to the stochastic dynamics described in section 3.4.2, the Manifold Stochastic Dynamics (MSD) algorithm is not exact due to the discretization errors. However, as noted in section 3.4.3, the empirical comparison in Neal (1996) showed no signicant difference between the HMC algorithm and the uncorrected stochastic dynamics. Similarly, we may expect that with a reasonable choice of discretization step size, the systematic error in MSD will be negligible. 4.6 Scaling with the Number of Parameters and the Sample Size. We would like to compare the time complexities of MSD and HMC for the problem of Bayesian learning with multilayered perceptrons. A single iteration of MSD involves several matrix operations (such as the Cholesky decomposition) on matrices of size n £ n and multiplication of matrices of size m £ n. It can be shown (Golub & van Loan, 1996) that the overall complexity of a single iteration is O ( n2 ( m C n) ) . On the other hand, a single iteration of HMC using a trajectory of length k involves k calculations of the error gradient, yielding an overall complexity of O( kmn) . Neal (1992) showed that when the energy is quadratic (i.e., the distribution is gaussian), k should be chosen to be O( smax / smin ) , where smax , smin are the square roots of the maximal and the minimal eigenvalues of the energy Hessian. Similar q reasoning implies that in the general case, c max ( ) should be k , where k and c max , c min are the maximal and k O D c min
the minimal absolute values of the eigenvalues of the Hessian. Note that the Hessian is no longer constant; hence, k may vary over the parameter space. To the best of our knowledge, there are no theoretical results on the scaling of k (which is known in matrix theory as the condition number). Therefore, we have estimated the dependence of k on the sample size and the number of parameters in the following experiment. We have considered multilayer perception (MLP) with one input node, one output node, and a single hidden layer consisting of H nodes. For every m and H in the range f2i , i D 2, . . . , 8g, 100 parameter vectors were generated from the gaussian prior G( 0, I ) . For every generated network, a data set was created by taking m inputs distributed according to G( 0, 1) , propagating the inputs through the network and adding gaussian noise G ( 0, 0.01). Then the Hessian of the log-posterior was evaluated using the Rfbackpropg algorithm (Pearlmutter, 1994), and k was calculated.
2566
Mark Zlochin and Yoram Baram
14
6
H=2
2
2
8
H=2
12
lo g (k )
12
log (k )
14
8
m=26 m=24 m=2 2 m=2
10
8
6 2
10
4
H=2
8
4
log2(H)
6
8
6 2
2
H=2 4
log (m)
6
8
2
Figure 3: Average (over 100 trials) log2 k as a function of log2 H (left) and log2 m (right).
As Figure 3 shows, log2 k seems to be an afne function of both log2 H and log2 m (the uctuations are due to the randomness in this experiment). The slopes of the graphs on the left are close to 1, which implies that for a xed m, the condition number k is O ( H ) D O ( n ) . Similarly, the slopes of the graphs on the right are around 0.3; i.e. for a xed H, the condition number k is roughly O ( m 0.3 ) . Since the trajectory length, k, should be O (k ) , our experiment suggests k D O ( nm 0.3 ) ; hence, the complexity of the iteration of HMC is O ( n2 m1.3 ) , as opposed to the O( n2 ( m C n ) ) complexity of MSD. In order to be able to compare the time complexities of these two algorithms, the joint behavior of m and n needs to be specied. Let us consider three particular cases of interest: 1. First, let us assume that the number of parameters, n, remains constant and the sample size, m, goes to innity. In this case, the complexities of MSD and HMC are O( m ) and O ( m1.3 ) , respectively, i.e. MSD is preferable. 2. Next, let us consider the case where the number of parameters grows linearly with the sample size: n D O ( m ) . For this scenario, the complexities are O ( m3 ) and O ( m3.3 ) and MSD is still slightly better. 3. Finally, we may also assume that the sample size is xed and the number of parameters goes to innity. This approach is advocated in Neal (1996), where it is proposed that the network size should be taken to be as large as is computationally possible. In this case, the complexities are O ( n3 ) and O( n2 ), respectively and HMC is asymptotically preferable.
Manifold Stochastic Dynamics for Bayesian Learning
2567
4.7 Relation to Previous Work. The idea of using in MCMC a metric different from the Euclidean one is not entirely new. For example, Bennett P (1975) 4 proposed to use a kinetic energy of the form 12 N i, j D 1 qPi Mij qPj , where M is some positive-denite symmetric matrix. However, that article is mainly concerned with a case where the matrix M is constant and its treatment of the nonconstant case is ad hoc. In addition, Neal’s (1996) choice of the discretization step sizes, using a diagonal approximation to the Hessian matrix, can be considered an approximation to the MSD, using diagonal metric tensor. Still, this work appears to be the rst where the assumption of working in a general Riemannian space is made explicit and a general method for the construction of the appropriate metric is presented. 4.8 Zero Temperature Limit. It is interesting to consider the limiting behavior of the usual stochastic dynamics and the MSD algorithms on the regression problem when the inverse noise variance b goes to innity (the physical analog is the temperature going to zero). For simplicity, we restrict our attention to the case of a single leapfrog step per iteration, Gibbs resampling of the momentum with no persistence (for both algorithms), and mi D 1 (for the stochastic dynamics). In the usual stochastic dynamics, the q is updated according to ± ² 2 qi ( t C 2 ) D qi ( t ) C 2 pi t C 2 2 @E ( q (t ) ) . D qi ( t ) C 2 ( pi ( t ) ¡ 2 @qi
Using E D
b 2 C a 2E 2q ,
qi ( t C 2 ) D qi ( t ) ¡
it follows that 5 c @E ( q( t ) ) C oQ (1) , 2 @qi
where from scaling considerations we set 2 D Similarly, in MSD, the updating rule is qi ( t C 2 ) D qi ( t ) ¡ G¡1
(4.34) p
c / b.
c @E ( q (t ) ) C oQ( 1) , 2 @qi
(4.35)
where G is dened by equation 4.16 and c D 2 2 . It canbe seen that in the zero temperature limit, usual stochastic dynamics reduces to the gradient-descent algorithm, whereas MSD reduces to the well-known Gauss-Newton algorithm for solving least-squares problems. 4 5
We thank an anonymous referee for drawing our attention to this article. The meaning of a D oQ (1) is limb !1 a D 0 with propability 1.
2568
Mark Zlochin and Yoram Baram
5 Empirical Comparison 5.1 Robot Arm Problem. We compared the performance of the MSD algorithm with that of the standard HMC algorithm. The comparison was carried out using MacKay’s robot arm problem, a common benchmark for Bayesian methods in neural networks (MacKay, 1992; Neal, 1996). The task in the robot arm problem is to learn the mapping from the two input variables, x1 and x2 , representing the joint angles of a robot arm, to the two target values, y1 and y2 , representing the resulting arm position. The data were generated as follows:
y1 D 2.0 cos x1 C 1.3 cos( x1 C x2 ) C e1 y2 D 2.0 sin x1 C 1.3 sin(x1 C x2 ) C e2 , where e1 , e2 are independent zero-mean gaussian noise variables with standard deviation 0.05. The data set that Neal and MacKay used contained 200 examples in the training set and 400 in the test set. For the learning task, we used a neural network with two input units, one hidden layer containing eight tanh units, and two linear output units. 1 The hyperparameter b was set to its correct value of 0.05 2 D 400, and a was chosen to be 1. 5.2 Algorithms. We compared MSD with two versions of HMC: with 100 and with 200 leapfrog steps per iteration, henceforth referred to as HMC100 and HMC200, respectively. In both versions of HMC, all the masses were set to 1. MSD was run with a single discretization step per iteration using equation 4.33. In all three algorithms, the momentum was resampled using persistence with cos(h ) D 0.95. The energy gradient was calculated using backpropagation, and the gradient of log p ( q) was calculated using a modication of the Rfbackpropg algorithm of Pearlmutter (1994) as outlined in section 4.5. The preliminary experiments with MSD showed that it is not necessary to use the whole sample for calculating the metric tensor. It was found that an approximation using only 20% of the training set yielded results comparable to the case where the whole training set was used. Accordingly, in the following simulations, the metric tensor was calculated using a random 20% subsample of the training set. We have tried the heuristic choice of HMC step sizes described in Neal (1996) and found that when the hyperparameters are xed, this choice actually slows HMC down with respect to a single step size for all components. This is probably due to the fact that these heuristics make a diagonal approximation to the Hessian, while the latter in practice is far from diagonal. In addition, even the estimates of the diagonal elements of the Hessian are quite poor, as they are based on the values of the hyperparameters alone and
Manifold Stochastic Dynamics for Bayesian Learning
2569
ignore the values of the weights. Consequently, in the experiment, reported below, these heuristics were not used. 6 Following Rasmussen (1996), the discretization step size for HMC was chosen so as to keep the rejection rate below 5%. An equivalent criterion of average error in the Hamiltonian below 0.05 was used for the MSD . In the preliminary runs, it was found that the discretization errors of both HMC and MSD are typically much higher in the regions of large energy (hence, low probability). As a result, while in the region of high probability the rejection rate of HMC was below 5%, if the simulation was stuck in the low-probability region for a long time, the rejection rate was occasionally over 90%. Consequently, in the test runs reported below, the initial state for the Markov chain was produced using simple energy minimization algorithm (the scaled conjugate gradient algorithm; Møller (1993), run for 100 iterations, starting from a point, sampled from the prior. This ensured that the Markov chains started close to the region of high probability.7 All three sampling algorithms were run 10 times, each time for 3000 iterations, with the rst 1000 samples discarded in order to allow the Markov chains to converge to the desired distribution. Each run used a different random seed and started from a random state common to all three algorithms. The subsample used for the evaluation of the metric tensor in MSD was replaced every 100 iterations in order to reduce the possible inuence of “bad” subsample. 5.3 Results. One appropriate measure of the rate of state-space exploration is weights autocorrelation (Neal, 1996). As shown in Figure 4, the behavior of MSD was clearly superior to that of HMC100 and comparable to HMC200. Another value of interest is the average squared errors over the training and the test sets. The predictions for both sets were made as follows. A subsample of 100 parameter vectors was generated by taking every twentieth sample vector starting from 1001. The predicted function value was the average over the empirical function distribution of this subsample. The statistics of the average squared errors are presented in Table 1. The average test and training errors for all three algorithms are similar (which is not surprising, since they were intended to sample from the same distribution). However, the standard deviation for MSD is lower than that for HMC (though the difference relative to HMC200 is not as signicant as the one relative to HMC100),meaning that the estimates obtained using MSD are more reliable. 6 It should be noted, however, that if the hyperparameters are allowed to change and different groups of weights are given separate hyperparameters (which may vary signicantly), these heuristics can be useful, as they provide automatic scaling of the weights. 7 Naturally, robustness to “bad” initialization could be achieved by using smaller discretization steps, but this would lead to slower mixing.
2570
Mark Zlochin and Yoram Baram
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0 0
0.2
HMC100 HMC200 MSD 5
10
15
20
25
30
0 0
HMC100 HMC200 MSD 5
10
15
20
25
30
Figure 4: Average (over the 10 runs) autocorrelation for the square root of the average squared magnitude of input-to-hidden (left) and hidden-to-output (right) weights. The horizontal axis gives the lags, measured in number of iterations. Table 1: Statistics (over 10 Runs) for the Average Training and Test Errors. Training Error Average HMC100 HMC200 MSD
Test Error
Standard Deviation ¡4
2.9 ¢ 10 1.8 ¢ 10¡4 1.5 ¢ 10¡4
0.00527 0.00511 0.00522
Average
Standard Deviation
0.00594 0.00588 0.00593
2.1 ¢ 10¡4 1.8 ¢ 10¡4 1.6 ¢ 10¡4
The computational requirements measured in number of oating-point operations per iteration are presented in Table 2. The computational load of MSD was about one-fth of that of HMC100 and 10 times lower than that of HMC200. 6 Conclusion
We have described a new algorithm for efcient sampling from complex distributions such as those appearing in Bayesian learning with nonlinear Table 2: Computational Load (in Floating-Point Operations per Iteration). HMC100 6
4.8 ¢ 10
HMC200 6
9.6 ¢ 10
MSD 1.0 ¢ 106
Manifold Stochastic Dynamics for Bayesian Learning
2571
models. The empirical comparison shows that our algorithm achieves results superior to the best obtained by existing algorithms in considerably shorter computation time. Acknowledgments
Part of this work was presented at NIPS’99. References Amari, S. (1997). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Andersen, H. C. (1980). Molecular dynamics simulations at constant pressure and/or temperature. Journal of Chemical Physics, 3, 589–603. Arnold, V. I. (1984). Mathematical methods of classical mechanics. New York: Springer-Verlag. Barndorff-Nielsen, O. E. (1988). Parametric statistical models and likelihood. New York: Springer-Verlag. Bennett, C. H. (1975). Mass tensor molecular dynamics. Journal of Computational Physics, 19, 267–279. Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643. Chavel, I. (1993). Riemannian geometry: A modern introduction. Cambridge: Cambridge University Press. Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195, 216–222. Fishman, G. S. (1996). Monte Carlo: Concepts, algorithms, and applications. New York: Springer-Verlag. Gear, C. W. (1971). Numerical initial value problems in ordinary differentialequations. Englewood Cliffs, NJ: Prentice Hall. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Recognition and Machine Intelligence, 6, 721–741. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J., editors (1996). Markov chain Monte Carlo in practice. London: Chapman & Hall. Golub, G. H., & van Loan, C. F. (1996). Matrix computation. Baltimore: Johns Hopkins University Press. Horowitz, A. M. (1991). A generalized guided Monte Carlo algorithm. Physics Letters B, 268, 247–252. Hwang, C.-R., Hwang-Ma, S.-Y., & Shen, S.-J. (1993). Accelerating gaussian diffusion. Ann. Appl. Prob., 3, 897–913. MacKay, D. J. C. (1992). Bayesian methods for adaptive models. Unpublished doctoral dissertation, California Institute of Technology. Metropolis, N., Rosenbluth, A., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.
2572
Mark Zlochin and Yoram Baram
Møller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525–533. Neal, R. M. (1992). Bayesian training of backpropagation networks by the hybrid Monte Carlo method. (Tech. Rep. No. CRG-TR-92-1). Toronto: Department of Computer Science, University of Toronto. Neal, R. M. (1993). Bayesian learning via stochastic dynamics. In C. L. Giles, S. J. Hanson, & J. D. Cowan, (Eds.), Advances in neural information processing systems, 5, (pp. 475–482). San Mateo, CA: Morgan Kaufmann. Neal, R. M. (1996). Bayesian learning for neural networks. Berlin: Springer-Verlag. Pearlmutter, B. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1), 147–160. Rasmussen, C. (1996). A practical Monte Carlo implementation of Bayesian learning. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo, (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Received July 30, 1999; accepted January 22, 2001.
LETTER
Communicated by Joachim Buhmann
Resampling Method for Unsupervised Estimation of Cluster Validity Erel Levine Eytan Domany Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel We introduce a method for validation of results obtained by clustering analysis of data. The method is based on resampling the available data. A gure of merit that measures the stability of clustering solutions against resampling is introduced. Clusters that are stable against resampling give rise to local maxima of this gure of merit. This is presented rst for a onedimensional data set, for which an analytic approximation for the gure of merit is derived and compared with numerical measurements. Next, the applicability of the method is demonstrated for higher-dimensional data, including gene microarray expression data. 1 Introduction
Cluster analysis is an important tool for investigating and interpreting data. Clustering techniques are the main tool used for exploratory data analysis, when one is dealing with data about whose internal structure little or no prior information is available. Cluster algorithms are expected to produce partitions that reect the internal structure of the data and identify “natural” classes and hierarchies present in it. A wide variety of clustering algorithms have been proposed. Some have their origins in graph theory, whereas others are based on statistical pattern recognition, self-organization methods, and more. More recently, algorithms rooted in statistical mechanics have been introduced. Comparing the relative merits of various methods is made difcult by the fact that when applied to the same data set, different clustering algorithms often lead to markedly different results. In some cases such differences are expected, since different algorithms make different (explicit or implicit) assumptions about the structure of the data. If the particular set that is being studied consists, for example, of several clouds of data point, with each cloud spherically distributed about its center, methods that assume such structure (e.g., k-means) will work well. On the other hand, if the data consist of a single nonspherical cluster, the same algorithms will fare miserably, breaking it up into a hierarchy of partitions. Since for the cases of interest one does not know which assumptions are satised by the data, a researcher c 2001 Massachusetts Institute of Technology Neural Computation 13, 2573– 2593 (2001) °
2574
Erel Levine and Eytan Domany
may run into severe difculties in interpreting results. By preferring one algorithm’s clusters over those of another, he or she may reintroduce his or her biases about the underlying structure—precisely those biases that he or she hoped to eliminate by employing clustering techniques. In addition, the differences in the sensitivity of different algorithms to noise, which inherently exists in the data, also yield a major contribution to the difference between their results. The ambiguity is made even more severe by the fact that even when one sticks exclusively to one’s favorite algorithm, the results may depend strongly on the values assigned to various parameters of the particular algorithm. For example, if there is a parameter that controls the resolution at which the data are viewed, the algorithm produces a hierarchy of clusters (a dendrogram) as a function of this parameter. One then has to decide which level of the dendrogram reects best the “natural” classes present in the data. Needless to say, one wishes to answer these questions in an unsupervised manner, making use of nothing more than the available data. Various methods and indicators that come under the name “cluster validation” attempt to evaluate the results of cluster analysis in this manner (Jain & Dubes, 1988). Numerous studies suggest direct and indirect indices for evaluation of hard clustering (Jain & Dubes, 1988; Bock, 1985), probabilistic clustering (Duda & Hart, 1973), and fuzzy clustering (Windham, 1982; Pal & Bezdek, 1995) results. Hard clustering indices are often based on some geometrical motivation to estimate how compact and well separated the clusters are; an example is Dunn’sindex (Dunn, 1974) and its generalizations (Bezdek & Pal, 1995); others are statistically motivated, for example, comparing the withincluster scattering with the between-cluster separation (Davies & Bouldin, 1979). Probabilistic and fuzzy indices are not considered here. Indices proposed for these methods are based on likelihood-ratio tests (Duda & Hart, 1973), information-based criteria (Cutler & Windham, 1994), and more. Another approach to cluster validity includes some variant of crossvalidation (Fukunaga, 1990). Such methods were introduced in the context of both hard clustering (Jain & Moreau, 1986) and fuzzy clustering (Smyth, 1996). The approach presented here falls into this category. In this article, we present a method to help select which clustering result is more reliable. The method can be used to compare different algorithms, but it is most suitable to identify, within the same algorithm, those partitions that can be attributed to the presence of noise. In these cases, a slight modication of the noise may alter the cluster structure signicantly. Our method controls and alters the noise by means of resampling the original data set. In order to illustrate the problem we wish to address and its proposed solution, consider the following example. A scientist is investigating mice and comes to suspect that there are several types of them. She therefore measures two features of the mice (such as weight and shade), looks for clusters in this two-dimensional data set, and indeed nds two clusters.
Resampling Method for Cluster Validity
2575
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.2
0.4
0.4
0.6
0.6 0.8
0.8 0.5
0
0.5
0.5
0
0.5
Figure 1: Two samples, drawn from an eight-shaped uniform distribution. Sample a is somewhat nontypical.
She can therefore conclude that there are two types of mice in her lab. Or can she? Imagine that the data that the scientist collects can be represented by the points on Figure 1a. This data set was in fact taken from a shaped uniform distribution (with a relatively narrow “neck” at the middle). Hence, no partition really exists in the underlying structure of the data, and unless one makes explicit assumptions about the shape of the data clouds, one should identify a single cluster. The particular cluster algorithm used breaks the data into two clusters along the horizontal gap seen in Figure 1a. This gap happens to be the result of uctuations in the data or noise in the sampling (or measurement) process. More typical data sets, such as that of Figure 1b, do not have such gaps and are not broken into two clusters by the algorithm. If more than a single sample had been available, it would have been safe to assume that this particular gap would not have appeared in most samples. The partition into two clusters in that case would be easily identied as unreliable.1 In most cases, however, only a single sample is available, and resampling techniques are needed in order to generate several ones. In this article we propose a cluster validation method based on resampling (Good, 1999; Efron & Tibshirani, 1993): subsets of the data under investigation are constructed randomly, and the cluster algorithm is applied to each subset. The resampling scheme is introduced in section 2, and a gure of merit is proposed to identify the stable clustering solutions, which are 1 By “unreliable” we mean that one cannot infer whether the partition into two clusters is only the consequence of noise and therefore cannot answer the question of whether it is likely to reappear in another random sample.
2576
Erel Levine and Eytan Domany
less likely to be the results of noise or uctuations. The proposed procedure is tested in section 3 on a one-dimensional data set, for which an analytical expression for the gure of merit is derived and compared with the corresponding numerical results. In section 4 we demonstrate the applicability of our method to two articial data sets (in d D 2 dimensions) and to real (very high-dimensional) DNA microarray data. 2 The Resampling Scheme
Let us denote the number of data points in the set to be studied by N. Typically, application of any cluster algorithm necessitates choosing specic values for some parameters. The results yielded by the clustering algorithm may depend strongly on this choice. For example, some algorithms (e.g., the c-shell fuzzy-clustering in Bezdek, 1981, and the iso-data algorithm in Cover & Thomas, 1991) take the expected number of clusters as part of their input. Other algorithms (e.g., Valley-Seeking, of Fukunaga, 1990) have the number of neighbors of each point as an external parameter. In particular, many algorithms have a parameter that controls the resolution at which clusters are identied. In agglomerative clustering methods, for example, this parameter denes the level of the resulting dendrogram at which the clustering solution is identied (Jain & Dubes, 1988). For the K-nearest-neighbor algorithm (Fukunaga, 1990), a change in the number of neighbors of each point, K, controls the resolution. For deterministic annealing (DA) (Rose, Gurewitz, & Fox, 1990) and the superparamagnetic clustering algorithm (SPC) (Blatt, Wiseman, & Domany, 1996; Domany, 1999), this role is played by the temperature T. As this control parameter is varied, the data points get assigned to different clusters, giving rise to a hierarchy. At the lowest resolution, all N points belong to one cluster, whereas at the other extreme, one has N clusters, of a single point in each. As the resolution parameter varies, clusters of data points break into subclusters, which break further at a higher level. Sometimes the aim is to generate precisely such a dendrogram. In other cases, one would like to produce a single partitioning of the data, which captures a particular important aspect. In such a case, we wish to identify that value of the resolution control parameter at which the most reliable, natural clusters appear. In these situations, the resolution parameter plays the role of one (probably the most important) member of the family of parameters of the algorithm that needs to be xed. Let us denote the full set of parameters of our algorithm by V. Any particular clustering solution can be presented in the form of an N £ N cluster connectivity matrix Tij , dened by
Tij D
» 1 0
points i and j belong to the same cluster otherwise.
(2.1)
Resampling Method for Cluster Validity
2577
In order to validate this solution, we construct an ensemble of m such matrices and make comparisons among them. This ensemble is created by constructing m resamples of the original data set. A resample is obtained by selecting at random a subset of size f N of the data points. We call 0 · f · 1 the dilution factor. We apply to every one of these subsets the same clustering procedure used on the full data set, using the same set of parameters V. This way we obtain for each resample m , m D 1, . . . , m, its own clustering results, summarized by the f N £ f N matrix T (m ) . We dene a gure of merit M ( V ) for the clustering procedure (and the choice of parameters) that we used. The gure of merit is based on comparing the connectivity matrices of the resamples, T (m ) (m D 1, . . . , m ), with the original matrix T :
M ( V ) D hhdT
(m ) ij ,T ij
iim .
(2.2)
The averaging implied by the notation hh¢iim is twofold. First, for each resample m , we average over all those pairs of points ij that were neighbors in the original sample and have both survived the resampling.2 Second, this average value is averaged over all the m different resamples. Clearly, 0 · M · 1, with M D 1 for perfect score. The gure of merit M measures the extent to which the clustering assignments obtained from the resamples agree with those of the full sample. An important assumption we have made implicitly in this procedure is that the algorithm’s parameters are “intensive,” that is, their effect on the quality of the result is independent of the size of the data set. We can generalize our procedure in several ways to cases when this assumption does not hold (Levine, 2000). 3 After calculating M ( V ), we have to decide whether we accept the clustering result, obtained using a particular value of the clustering parameters. For very low and very high values of M, the decision may be easy, but for midrange values, we may need some additional information to guide our decision. In such a case, the best way to proceed is to change the values of the clustering parameters and go through the whole process once again. Having done so for some set parameter choices V, we study the way M ( V ) varies as a function of V. Optimal sets of parameters V¤ are identied by locating the maxima of this function. It should be noted, however, that some of these maxima are trivial and should not be used. Other maxima reect different clustering solutions, which may all be considered valid. 2
For various denition of neighbors, see Fukunaga (1990). For example, we can dene our gure of merit on the basis of pairwise comparisons of our resamples and nd an optimal set of parameters V1¤ in the way explained below. Next, we look for parameters V2¤ for which clustering of the full sample yields closest results to those obtained for the resamples (clustered at V1¤ ). 3
2578
Erel Levine and Eytan Domany
For example, samples out of hierarchical distributions may have several meaningful solutions, corresponding to different levels of the hierarchy. Here we demonstrate the case of a single robust solution (examples with several solutions were given in Levine, 2000). Let us denote by A( N ) the computational complexity of the clustering algorithm over a sample of size N with a given parameter set. The computational complexity of the resampling scheme just outlined can be estimated by vmA( f N ) , where v is the number of different parameter sets among which one is searching for a best solution. Our procedure can be summarized in the following algorithmic form: Step 0. Choose values for the parameters V of the clustering algorithm. Step 1. Perform clustering analysis of the full data set. Step 2. Construct m subsets of the data set by randomly selecting f N of the
N original data points. Step 3. Perform clustering analysis for each subset. Step 4. Based on the clustering results obtained in Steps 1 and 3, calculate
M( V ) , as dened in equation 2.2. Step 5. Vary the parameters V, and identify stable clusters as those for
which a local maximum of M is observed. 3 Analysis of a One-Dimensional Model
To demonstrate the procedure outlined above, we consider a clustering problem that is simple enough to allow an approximate analytical calculation of the gure of merit M and its dependence on a parameter that controls the resolution. Consider a one-dimensional data set that consists of points xi , i D 1, . . . , N, selected from two identical but displaced uniform distributions, such as the one shown in Figure 3. The distributions are characterized by the mean distance between neighboring points, d D 1 / l, and the distance or gap between the two distributions, D . Distances between neighboring points within a cluster are distributed according to the Poisson distribution, P ( s ) D le¡ls ds.
(3.1)
The results of a clustering algorithm that reects the underlying distribution from which the data were selected should identify two clusters in this data set. Consider a simple nearest-neighbor clustering algorithm, which assigns two neighboring points to the same cluster if and only if the distance between the two is smaller than a threshold a. Clearly, a D la is the dimensionless parameter of the algorithm that controls the resolution of the clustering. For very small a ¿ 1, no two points belong to the same cluster, and the number of clusters equals N, the number of points; at the other extreme, a À 1,
Resampling Method for Cluster Validity
2579
all pairs of neighbors are assigned to the same cluster, and hence all points reside in one cluster. Starting from a À 1 and reducing it gradually, one generates a dendrogram. At an intermediate value of a, we may get any number of clusters between one and N. Hence, if we picked some particular value of a at which we obtained some clusters, we must face the dilemma of deciding whether these clusters are “natural” or the result of the uctuations (i.e., noise) present in the data. In other words, we would like to validate the clustering solution obtained for the full data set for a given value of a. We do this using the resampling scheme described above. A resample is generated by independently deciding for each data point whether it is kept in the resample (with probability f ) or is discarded (with probability 1 ¡ f ). This procedure is repeated m times, yielding m resamples. All length scales of the original problem get rescaled by the resampling procedure by a factor of 1 / f ; the mean distance between neighboring points in the resampled set is d0 D 1 / lf , and the distance between the two uniform distributions is D 0 D D / f . Clustering is therefore performed with a rescaled threshold a0 D a / f on any resample; the resolution parameter keeps its original value, a0 D a0 / d0 D a. We rst wish to get an approximate analytical expression for the gure of merit M (a) described above. To do this, we consider the gaps between data points rather than the points themselves. Let us denote by b the distance between the data point i (of the original sample) and its nearest left neighbor, b D xi ¡ xi¡1 , with the two points on the same side of the gap D . We rst assume that this edge is not broken by the clustering algorithm, b < a. Given a resample that includes point i, we dene b0 in the same fashion. The new, resampled left neighbor of i resides in the same cluster as i if b0 < a0 ; the probability that this happens is given by (Levine, 2000) P1 ( b ) D
1 X f 2 ( 1 ¡ f ) m¡1 c ( m, a/ f ¡ b ) , ( m ¡ 1) ! mD1
(3.2)
where the dimensionless variable b D lb was introduced. Here c ( n, z) is Euler ’s incomplete gamma function, Z z c ( n, z ) D (3.3) e ¡t tn¡1 dt, 0
except that in our convention, we take c ( n, z < 0) D c ( n, 0). Similarly, if points i and i ¡ 1 were not assigned to the same cluster in the original sample, then the probability that the same would happen in a resample is (Levine, 2000) P2 ( b ) D
1 X f 2 ( 1 ¡ f ) m¡1 C ( m, a/ f ¡ b ) , ( m ¡ 1) ! mD1
(3.4)
2580
Erel Levine and Eytan Domany
where C ( n, z ) is the other incomplete gamma function, Z C ( n, z) D
1
e¡t tn¡1 dt,
(3.5)
z
and we take C ( n, z < 0) D C ( n, 0) D ( n ¡ 1)!, so Bm D 1 for a · fb. We now calculate the index M in an approximate manner by averaging P1 and P2 over all edges. This calculation matches the denition 2.2 of M in spirit. For pairs residing within a true cluster, the averaging is done by integrating over all possible values of b: Z A (a) D
a 0
Z e¡b P1 ( b ) db C
1
e¡b P2 ( b ) db.
(3.6)
a
For pairs that lie on different sides of the gap, we should compare a only with the size of the gap D , B(d , a) D
» P1 (d) P2 (d)
a¸d a < d,
(3.7)
where the dimensionless variable d D lD was introduced. Clearly, in the one-dimensional example, there are many fewer edges of the second kind than of the rst. This, however, is not the case for data in higher dimensions, so we give equal weights to the two terms A and B,4
M ( a) D
1 [A( a) C B(d , a) ] . 2
(3.8)
We now plot M as a function of the resolution parameter a for both f D 1 / 2 and f D 2 / 3 (see Figure 2a), assuming the intercluster distance d D 5. A clear peak can be observed in both curves at a ’ 2.5 and a ’ 3.3, respectively. Similarly, for d D 10 clear peaks are identied at a ’ 5 and a ’ 7 (see Figure 2b). As we will see, these peaks correspond to the most stable clustering solution, which indeed recovers the original clusters. The trivial solutions of a single cluster (of all data points) and the opposite limit of N single-point clusters are also stable and appear as the maxima of M( a) at a ¿ 1 and a À 1, respectively. In order to test how good our analytic approximate evaluation of M is, we clustered a one-dimensional data set of Figure 3 (top) and calculated the index M as dened in equation 2.2. The data set consists of N D 300 data points sampled from two uniform distributions of mean nearest-neighbor distance 1 / l D 1 and shifted by D D 10. The dendrogram obtained by varying a is shown in Figure 3 (bottom). It clearly exhibits two stable clusters, 4 The ratio between the two terms is of the order (d¡1) / d, where d is the dimensionality of the problem. For high-dimensional problems, this ratio is close to 1.
Resampling Method for Cluster Validity
2581
1
0.9
M
0.8
0.7
0.6 f=1/2 f=2/3
0.5 0
2
4
6
a
8
10
1
0.9
M
0.8
0.7
0.6 f=1/2 f=2/3
0.5 0
5
a
10
15
Figure 2: Mean behavior of M as a function of the geometric threshold, according to equation 3.8, for two clusters. The function is evaluated for the intercluster distances (a) D l D 5 and (b) D l D 10, with dilution parameters f D 1 / 2 and f D 2 / 3.
with stability indicated by the wide range of values of a over which the two clusters “survive.” Next, we generated 100 resamples of size 150 (i.e., f D 1 / 2) and applied the geometrical clustering procedure described above to each resample.
2582
Erel Levine and Eytan Domany
Figure 3: Resampling results for a one-dimensional data set. Three hundred points were chosen uniformly from two clusters, separated by D l D 10. The histogram of the data is given in the top panel. We performed 100 resamples of 150 points (i.e., f D 1 / 2) and 100 resamples of 200 points ( f D 2 / 3), to calculate two versions of M. In the middle panel, we plot M as a function of the resolution parameter a. The peak between a ¼ 4 and a ¼ 7 corresponds to the correct two-cluster solution, as can be seen from the dendrogram shown in the bottom panel.
Resampling Method for Cluster Validity
2583
Figure 4: Three-ring problem. Fourteen hundred points are sampled from a three-component distribution (described in the text).
By averaging over the different resamples, the gure of merit M was calculated for different values of the dimensionless resolution parameter a, as shown in Figure 3 (middle). The peak between a ¼ 4 and a ¼ 7, corresponding to the most stable, “correct” clustering solution, is clearly identied. The whole procedure was also done with samples of size 200 ( f D 2 / 3), giving similar results. In this case, as in Figure 2, the nontrivial peak extends to somewhat higher a-values when using larger f . In both cases, however, the location of the peak corresponds to the same, intuitively correct solution. The agreement between our approximate analytical curve of Figure 2b for M ( a) and the numerically obtained exact curve of Figure 3 (middle) is excellent and most gratifying. 4 Applications 4.1 Two-Dimensional Toy Data. The analysis of the previous section predicts a typical behavior of M as a function of the parameters that control resolution; in particular, it suggests that one can identify a stable, “natural” partition as the one obtained at a local maximum of the function M. This prediction was based on an approximate analytical treatment and backed up by numerical simulations of one-dimensional data. Here we demonstrate that this behavior is also observed for a toy problem, which consists of the two-dimensional data set shown in Figure 4. The angular coordinates of the data points are selected from a uniform distribution, h » U[0, 2p ].
2584
Erel Levine and Eytan Domany
Figure 5: Clustering solution of the three-ring problem as a function of the resolution parameter (the temperature).
The radial coordinates are normal distributed, r » N[R, s], around three different radii R. The outer “ring” (R D 4.0, s D 0.2) consists of 800 points, and the inner “rings” (R D 2.0, 1.0, s D 0.1) consist of 400 and 200 points, respectively. The algorithm we chose to work with is the super-paramagnetic clustering (SPC) algorithm, recently introduced by Blatt et al. (1996; Domany, 1999). This algorithm provides a hierarchical clustering solution. A single parameter T, called “temperature,” controls the resolution: higher temperatures correspond to higher resolutions. A variation of T generates a dendrogram. The outcome of the algorithm depends also on an additional parameter K, described in section 4.3.1. The data of Figure 4 were clustered with K D 20. The results of the clustering procedure are presented in Figure 5. For T < 0.02, we get two clusters: the two inner rings constitute one cluster. A stable phase, in which the three rings are clearly identied as three clusters, appears in the temperature range 0.03 · T · 0.08. In order to identify the value of T that yields the “correct” solution, we generated and clustered 20 different resamples from this toy data set, with a dilution factor of f D 2 / 3. The resolution parameter (temperature) of each resample was rescaled so that the transition temperature at which the single large cluster breaks up agrees with the temperature of the same transition in the original sample. The function M ( T ) , plotted in Figure 6, exhibits precisely the expected behavior, with two trivial maxima and an additional one at T D 0.05. This value indeed corresponds to the “correct” solution of three clusters. Note that this maximum is not attained at the transition to the three-cluster solution (T ’ 0.03) but at a slightly higher temperature (T ’ 0.05). This shift
Resampling Method for Cluster Validity
2585
Figure 6: M as a function of the temperature for the three-ring problem.
reects the fact that close to the transition, the two-cluster solution is still obtained for some resamples. In order to compare our gure of merit with some known index, we applied several different clustering algorithms to the three-ring data set. For the K-means and DA algorithms, where the number of expected clusters must be provided, we searched for three clusters. A typical three-cluster solution generated by these two methods is shown in Figure 7; the three rings are not identied as clusters. Among the hierarchy of clustering solutions generated by SPC and average linkage algorithm (described in the next section), there appeared some that were very close to the intuitively correct one; we took these as the cluster assignement that is to be validated. We used our resampling scheme with 100 resamples and f D 2 / 3 for each of the methods. Although the solutions of SPC and average linkage yielded high M values (above 0.9), the results obtained by K-means and DA gave low values (below 0.6). We then calculated another index, R, dened by RD
1 X Sin c C c Sout c
Sin c D
1 X dij , |c| 2 i, j2c
Sout c D min 0 c
X 1 dij , 0 |c||c | i2c., j2c0
(4.1)
where C is the number of clusters, dij is the distance between points i and j, and the summation is over all clusters (Jain & Dubes, 1988). This index measures the mean inner distance of the clusters relative to the distance from their nearest neighbor. Clearly this index looks for compact, well-separated clusters. It is therefore not surprising that of the clustering solutions proposed by the different methods, the one with the minimal value of R is that of Figure 7 obtained by the K-means algorithm (R ’ 0.56, to be compared
2586
Erel Levine and Eytan Domany
Figure 7: Clustering solution of the three-ring problem obtained by the K-means algorithm with K D 3.
with R ’ 0.87 for the three-ring solution of SPC). We nd that our gure of merit M assigns the highest score to the intuitively expected solution, whereas the R score is minimal for an “incorrect” solution. 4.2 A Single Cluster—Dealing with Cluster Tendency. A frequent problem of clustering methods is the so-called cluster tendency, that is, the tendency of the algorithm to partition any data, even when no natural clusters exist. In particular, agglomerative algorithms, which provide a hierarchy of clusters, always generate some hierarchy as a control parameter is varied. We expect this hierarchy to be very sensitive to noise, and thus unstable against resampling. We tested this assumption for the data set of Figure 1a. The test was performed using two clustering methods: the SPC algorithm and the averagelinkage clustering algorithm, an agglomerative hierarchical method. The average-linkage algorithm starts with N distinct clusters, one for each point, and forms the hierarchy by successively merging the closest pair of clusters, and redening the distance between all other clusters and the new one. This step is repeated N ¡ 1 times until only a single element remains. The output of this hierarchical method is a dendrogram. (For more details, see Jain & Dubes, 1988.) We performed our resampling scheme with m D 20 resamples and a dilution factor of f D 2 / 3, for different levels of resolution. The results are shown in Figure 8. The only stable solutions identied by both algorithms
Resampling Method for Cluster Validity
2587
Figure 8: M as a function of the resolution parameter for the single-cluster problem. Clustering is performed using the SPC algorithm (left) and the averagelinkage algorithm (right). The only natural “partitions” are a single cluster or N (single point) clusters.
are the trivial ones, of either a single cluster or N clusters. In this case, obviously the single-cluster solution is also the natural one. 4.3 Clustering DNA Microarray Data. We turn now to apply our procedure to real-life data. We present here validation of cluster analysis performed for DNA microarray data. Gene-array technology provides a broad picture of the state of a cell by monitoring the expression levels of thousands of genes simultaneously. In a typical experiment, simultaneous expression levels of thousands of genes are viewed over a few tens of cell cultures at different conditions. (For details see Chee et al., 1996; Brown & Botstein, 1999.) The experiment whose analysis is presented here is on colon cancer tissues (Alon et al., 1999). Expression levels of n g D 2000 genes were measured for nt D 62 different tissues, out of which 40 were tumor tissues and 22 were normal ones. Clustering analysis of such data has two aims:
Searching for groups of tissues with similar gene expression proles. Such groups may correspond to normal versus tumor tissues. For this analysis, the nt tissues are considered as the data points, embedded in an n g -dimensional space. Searching for groups of genes with correlated behavior. For this analysis, we view the genes as the data points, embedded in an nt -dimensional space, and we hope to nd groups of genes that are part of the same biological mechanism. Following Alon et al., we normalize each data point (in both cases) such that the standard deviation of its components is one and its mean vanishes.
2588
Erel Levine and Eytan Domany
Figure 9: Figure of merit M as a function of K for the colon tissue data.
This way the Euclidean distance between two data points is trivially related to the Pearson correlation between the two. 4.3.1 Clustering Tissues. The main purpose of this analysis is to check whether one can distinguish between tumor and normal tissues on the basis of gene expression data. Since we know in what aspect of the data’sstructure we are interested, the working resolution in this problem is determined as the value at which the data rst break into two (or more) large clusters (containing, say, more than 10 tissues). There may be other parameters for the clustering algorithm, which should be determined. For example, the SPC algorithm has a single parameter K, which determines how many data points are considered as neighbors of a given point. The algorithm places edges or arcs that connect pairs of neighbors i, j, and assigns a weight Jij to each edge, whose value decreases with the distance | xE i ¡ xEj | between the neighboring data points. Hence, the outcome of the clustering process may depend on K, the number of neighbors connected to each data point by an edge. We would like to use our resampling method to determine the optimal value for this parameter. We clustered the tissues, using SPC, for several values of K. For each case, we identied the temperature at which the rst split to large clusters occurred. For each case, we performed the same resampling scheme, with m D 20 resamples of size 23 nt , and calculated the gure of merit M ( K ) . The results obtained for several values of K are plotted in Figure 9. Very low and very high values of K give similarly stable solutions. In the lowK case, each point is practically isolated, and the data break immediately into microscopic clusters. The high-K case is just the opposite: each point is considered “close” to very many other points, and no macroscopic partition can be obtained.
Resampling Method for Cluster Validity
2589
Figure 10: Clustering of the tissues, using K D 8. Leaves of the tree are colored according to the known classication to normal (dark) and tumor (light). The two large clusters marked by arrows recover the tumor versus normal classication.
At K D 8 we observe, however, another peak in M. At this K, the clustering algorithm yields two large clusters, which correspond to the two “correct” clusters of normal and tumor tissues, as shown on Figure 10. Such solutions appear also for some higher values of K, but in these cases the clusters are not stable against resampling. 4.3.2 Clustering Genes. Cluster analysis of the genes is performed in order to identify groups of genes that act cooperatively. Having identied a cluster of genes, one may look for a biological mechanism that makes their expression correlated. Trying to answer this question, one may, for example, identify common promoters of these genes (Getz, Levine, Domany, & Zhang, 2000). One may also use one or more clusters of genes to reclassify the tissues (Getz, Levine, & Domany, 2000), looking for clinical manifestations associated with the expression levels of the selected groups of genes. Therefore, in this case, it is important to assess the reliability of each particular gene cluster separately. The SPC clustering algorithm was used to cluster the genes. A resulting dendrogram is presented in Figure 11, in which each box represents a cluster
2590
Erel Levine and Eytan Domany
Figure 11: Dendrogram of genes. Clusters of size seven or larger are shown as boxes. Selected clusters are circled and numbered, as explained in the text.
of genes. Our resampling scheme has been applied to these data using m D 25 resamples of size 1200 ( f D 0.6). This time, however, we calculated M ( C) for each of the stable clusters C of the full original sample, one at a time. Since the structure of the dendrogram is sensitive to even slight changes in the data, we focused our attention only at the stable clusters emerging from clustering the full set of genes. For each stable cluster C, M was calculated at the temperature at which C was identied. We therefore get a stability measure for each cluster. Next, we focus our attention on clusters of the highest scores. First, we considered the top 20 clusters. If one of these clusters is a descendant of another one from the list of 20, we discard it. After this pruning, we were left with 6 clusters, which are circled and numbered in Figure 11 (the numbers are not related to the stability score). We are now ready to interpret these stable clusters. The rst three consist of known families of genes. Number 1 is a cluster of ribosomal proteins. The genes of cluster 2 are all cytochrome C genes, which are involved in energy transfer. Most of the genes of cluster 3 belong to the HLA-2 family, which are histocompatibility antigens. Cluster 4 contains a variety of genes, some related to metabolism. When
Resampling Method for Cluster Validity
2591
trying to cluster the tissues based on these genes alone, we nd a stable partition to two clusters, which is not consistent with the tumor versus normal labeling, but rather with a change in an experimental protocol (Getz, Levine, & Domany, 2000). Clusters 5 and 6 also contain genes of various types. The genes of these clusters have the most typical behavior: All the genes of cluster 5 are highly expressed in the normal tissues but not in the tumor ones, and all genes of cluster 6 are the other way around. To summarize, clustering stability score based on resampling enabled us to zero in on clusters with typical behavior, which may have biological meaning. Using resampling enabled us to select clusters without making any new assumption, a major advantage in exploratory research. The downside of this method, however, is its computational burden. We had to perform clustering analysis 20 times for a rather large data set. This would be the typical case for DNA microarray data. 5 Discussion
This work proposes a method to validate clustering analysis results, based on resampling. It is assumed that a cluster that is robust to resampling is less likely to be the result of a sample artifact or uctuations. The strength of this method is that it requires no additional assumptions. Specically, no assumption is made about the structure of the data, the expected clusters, or the noise in the data. Only the available data are used. We introduced a gure of merit, M, which reects the stability of the cluster partition against resampling. The typical behavior of this gure of merit as a function of the resolution parameter allows clear identication of natural resolution scales in the problem. Using this gure of merit, one can easily choose among different clustering solutions the one that is most robust. However, this gure of merit does not reect the generalization performance of this clustering solution to new data. The question of natural resolution levels is inherent in the clustering problem and thus emerges in any clustering scheme. The resampling method introduced here is general; it is applicable to any kind of data set and any clustering algorithm. Note, however, that if the data are so sparse that dilution can eliminate some of the underlying modes, our resampling scheme should not be used for cluster validation. For a simple one-dimensional model, we derived an analytical expression for our gure of merit and its behavior as a function of the resolution parameter. Local maxima were identied for values of the parameter corresponding to stable clustering solutions. Such solutions can be either trivial (at very low and very high resolution) or nontrivial, revealing the genuine internal structure of the data. Resampling is a viable method provided the original data set is large
2592
Erel Levine and Eytan Domany
enough, so that a typical resample still reects the same underlying structure. If this is the case, our experience shows that a dilution factor of f ’ 2 / 3 works well for both small and large data sets. Acknowledgments
This research was partially supported by the Germany-Israel Science Foundation. We thank Ido Kanter and Gad Getz for discussions. References Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, 6745–6750. Bezdek, J. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press. Bezdek, J., & Pal, N. (1995). Cluster validation with generalized Dunn’s indices. Proc. 1995 2nd NZ Int’l. (pp. 190–193). Blatt, M., Wiseman, S., & Domany, E. (1996). Super-paramagnetic clustering of data. Physical Review Letter, 76, 3251–3255. Bock, H. (1985).On signicance tests in cluster analysis. J. Classication, 2, 77–108. Brown, P., & Botstein, D. (1999). Exploring the new world of the genome with DNA microarrays. Nature Genetics, 21, 33–37. Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X. C., Stern, D., Winkler, J., Lockhart, D. J., Morris, M. S., & Fodor, S. P. A. (1996). Accessing genetic information with high-density DNA arrays. Science, 274, 610– 614. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: WileyInterscience. Cutler, A., & Windham, M. (1994). Information-based validity functionals for mixture analysis. In Proc. 1st US/Japan Conf. on Frontiers in Statistical Modeling (pp. 149–170). Davies, D., & Bouldin, D. (1979).A cluster separation measure. IEEE Trans. PAMI, 224–227. Domany, E. (1999). Super-paramagnetic clustering of data—the denitive solution of an ill-posed problem. Physica A, 263, 158. Duda, R., & Hart, P. (1973). Pattern classication and scene analysis. New York: Wiley-Interscience. Dunn, J. (1974). A fuzzy relative isodata process and its use in detecting compact well-separated clusters. J. Cybernetics, 3, 32–57. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. London: Chapman & Hall. Fukunaga, K. (1990). Introduction to statistical pattern recognition. San Diego: Academic Press. Getz, G., Levine, E., & Domany, E. (2000). Coupled two-way clustering of DNA
Resampling Method for Cluster Validity
2593
microarray data. Proc. Natl. Acad. Sci. USA, 97, 12079. Getz, G., Levine, E., Domany, E., & Zhang, M. (2000). Super-paramagnetic clustering of yeast gene expression proles. Physica A, 279, 457. Good, P. (1999). Resampling methods. New York: Springer-Verlag. Jain, A., & Dubes, R. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall. Jain, A., & Moreau, J. (1986). Bootstrap technique in cluster analysis. Pattern Recognition, 20, 547–569. Levine, E. (2000). Unsupervised estimation of cluster validity—Methods and applications. Unpublished M.Sc. thesis, Weizmann Institute of Science. Pal, N., & Bezdek, J. (1995). On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Systems, 3, 370–376. Rose, K., Gurewitz, E., & Fox, G. (1990). Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65, 945–948. Smyth, P. (1996). Clustering using Monte Carlo cross-validation. In KDD-96 Proceedings, Second International Conference on Knowledge Discovery and Data Mining (pp. 126–133). Windham, M. (1982).Cluster validity for the fuzzy c-means clustering algorithm. IEEE Trans. PAMI, 4, 357–363. Received May 31, 2000; accepted January 4, 2001.
LETTER
Communicated by Erkki Oja
The Whitney Reduction Network: A Method for Computing Autoassociative Graphs D. S. Broomhead Department of Mathematics, University of Manchester Institute of Science and Technology, Manchester M60 1QD, U.K. M. J. Kirby Department of Mathematics,ColoradoState University,Fort Collins, CO 80523,U.S.A. This article introduces a new architecture and associated algorithms ideal for implementing the dimensionality reduction of an m-dimensional manifold initially residing in an n-dimensional Euclidean space where n À m. Motivated by Whitney’s embedding theorem, the network is capable of training the identity mapping employing the idea of the graph of a function. In theory, a reduction to a dimension d that retains the differential structure of the original data may be achieved for some d · 2m C 1. To implement this network, we propose the idea of a good -projection, which enhances the generalization capabilities of the network, and an adaptive secant basis algorithm to achieve it. The effect of noise on this procedure is also considered. The approach is illustrated with several examples. 1 Data Parameterization
The application of analytical transforms, such as the Fourier or wavelet transform, is an established technique for investigating low-dimensional data sets. Analogously, constructing custom empirical transforms (and their inverses) is an attractive means for extracting and manipulating information in large, high-dimensional data sets. The subject of this investigation concerns the construction of (invertible) dimensionality-reducing tranformations of data sets for the case where the initial, or ambient, dimension is much greater than its intrinsic (topological) dimension. There are two fundamental approaches for representing data in a reduced coordinate system: data parameterization and data encapsulation. The distinction between the two approaches lies in the differing assumptions concerning how the data set A occupies the data space. Data encapsulation is based on the assumption that the data are a subset of a linear subspace and seek a coordinate change that makes this apparent. Data parameterization, on the other hand, assumes that the data set is produced by a discrete sampling of an m-dimensional submanifold M; in this case, we c 2001 Massachusetts Institute of Technology Neural Computation 13, 2595– 2616 (2001) °
2596
D. S. Broomhead and M. J. Kirby
seek to represent the data as a graph from a suitable linear subspace to its orthogonal compliment. The parameterization of a data set A ½ Rn may be achieved by determining a new coordinate system that is the result of a dimensionality-reducing mapping y D G( x) 2 B ½ Rd where d < n. The associated reconstruction mapping x D H ( y ) 2 A ½ Rn takes the data and maps them back to the original ambient coordinates. Thus, H provides a global d-dimensional parameterization of the data. By contrast, data encapsulation seeks to map, typically via a unitary transformation, the data into a subspace of reduced dimension. If the reduced data may be reconstructed via another linear transformation, then the new coordinate system may be viewed as encapsulating the entirety of the data. In this setting, every data point in A lies in the span of the encapsulating basis; if a component of a data point does not lie in this span, it appears as the residual, or error, when the data are reconstructed. For data that reside on a submanifold M, the number of parameters required to encapsulate, or span, the data set is typically signicantly larger than the number of dimensions required to parameterize it nonlinearly. Therefore, except in special cases, even an optimal linear parameterization (i.e., data encapsulation) will not produce a representation of the data that reects the intrinsic dimensionality of M. This article presents a new method for determining nonlinear parameterizations of data sets, which are subsets of high-dimensional spaces. At the center of our approach is Whitney’s theorem, which motivates the architecture of what we will refer to as the Whitney reduction network (WRN). In this architecture, the reduction mapping G is a projection designed to retain the (differential) structure of the data. In addition, this projection is empirically constructed to be good in the sense that its inverse will be well conditioned, a critical feature of networks with good generalization properties. A natural way to quantify the idea of a well conditioned nonlinear inverse is to require that H is Lipschitz with a small Lipschitz constant. Since G is also Lipschitz by virtue of being a projection, a useful consequence of this is that the topological dimension of A and its reduced form B are equal since bi-Lipschitz mappings are dimensionality preserving (see Falconer, 1990). Since the choice of G determines the Lipschitz constant of the inverse, we propose an adaptive basis algorithm that optimizes this. (Note that if the data set is not smooth but fractal, then it is necessary to replace the topological dimension by the Hausdorff dimension in the preceding discussion.) In section 2 a discussion of Whitney’s embedding theorem is presented; in section 3 the network architecture and implementation are presented; in section 4 the distinguishing features of the WRN are summarized; in section 5 we analyze the effects of noise on the procedure; in section 6 we present three illustrative examples of the methodology. Finally, in section 7, the main results are summarized, and some future work is outlined. More details concerning the mathematical theory of the reduction network proposed are presented in Broomhead and Kirby (2000).
Whitney Reduction Network
2597
2 Whitney’s Embedding Theorem
This article is motivated by the Whitney embedding theorem, which shows that it is always possible to represent a compact, nite-dimensional, differentiable manifold as a submanifold of a vector space (Hirsch, 1976). Roughly speaking, given an m-dimensional differentiable manifold M, we can nd a mapping to the Euclidean space R2m C 1 , which is diffeomorphic onto its image. (A diffeomorphism is a differentiable map with a differentiable inverse.) In this sense it might be said that R2m C 1 is large enough to contain a “diffeomorphic copy” of every m-dimensional differentiable manifold (Guillemin & Pollack, 1974). The proof given in Hirsch (1976) has two stages. The rst, and preliminary, stage shows that there is some sufciently large q for which there exists an embedding in Rq . So we concentrate on the situation where we have an m-dimensional submanifold, M, of Rq and note that here q may be very large. The central idea of the rest of the proof is to show that there exists an “admissible” projection from Rq to Rq¡1 and then apply this argument recursively until we reach a value of q, beyond which the argument fails. Since we are dealing with compact manifolds, “admissible” here means that the projection is an injective immersion when restricted to M. A projection that is an injective immersion has a smooth inverse when restricted to its image; it is this that is interesting to us since it suggests the possibility of a lossless data compression technique. This proof of the Whitney embedding theorem demonstrates that for each q > 2m C 1, there is an open dense set of such projections, which is to say that in a suitable topology, every neigborhood of a projection that is not admissible contains a projection that is (the dense part); and, moreover, every sufciently small perturbation of an admissible projection is also admissible (the open part). There is an appealing way to visualize these ideas. Imagine M as a submanifold of Rq and connect each pair of points in M with a straight line segment. We shall refer to these as the secants of M. Now consider S , the set of unit vectors parallel to the secants—the unit secants. These can be thought of as points on the q ¡ 1–dimensional unit sphere since this is the set of all unit vectors in Rq . We can also associate projections from Rq to Rq¡1 with points on the q ¡ 1–dimensional unit sphere by labeling each projection with the unit vector, which it maps to zero. Clearly, the set of unit secants constitutes the set of inadmissible projections in the sense that a projection that is also a unit secant must identify at least two distinct points in M and consequently is not invertible. The method of proof is then to show that the set of unit secants is nowhere dense in the q ¡ 1– dimensional unit sphere provided that q > 2m C 1. A set that is nowhere dense has no subsets that are open, so that any point in a nowhere dense set must have neighboring points not in the set. This is enough to establish the existence of a projection that is an injection. A similar approach
2598
D. S. Broomhead and M. J. Kirby
is adopted to establish a similar result for the set of projections parallel to tangents of M. These correspond to projections that are not immersions. A subset that is nowhere dense is, from a topological point of view, a very small set. Although this remark should be treated with some caution—there are well-known examples of nowhere dense sets with positive measure— it suggests our approach to compression, since it says that an arbitrarily selected projection from Rq to Rq¡1 will have a smooth inverse whenever q > 2m C 1. 3 The WRN Architecture
The architecture of our network is driven by the decomposition of a data point x 2 A ½ M under the action of a projector P, x D Px C ( I ¡ P) x,
(3.1)
where, by denition, P2 D P. If we let p D Px and q D Qx (where Q D I ¡ P), then we view any element x as being the sum of the portion of x in the range of P, that is, p 2 R ( P) and the portion in the null space of P, that is, q 2 N ( P ). If the rank of P is d where d > 2m, then Whitney’s theorem ensures the existence of a global map from the range of the projector to its null space, that is, q D f ( p ). This provides a parameterization of the data set A in terms of p as x D p C f ( p) .
(3.2)
The inverse of P takes a projected data point Px 2 PA and maps it back to x. 3.1 The Reduction Mapping. To begin, we need a good projector P of rank d. This will parameterize the data. The corresponding orthogonal projector Q gives the residual of the linear approximation and hence provides the target data for the nonlinear function approximation of f . Given an appropriate orthonormal basis for Rn , U D [u1 |u2 | ¢ ¢ ¢ |un ]. The T rank d projector is dened by P D UO1 UO1 where the reduced n £ d matrix UO 1 D [u1 |, . . . , |ud ]. Similarly, the complementary projector is dened Q D T UO2 UO2 where UO 2 D [ud C 1 |, . . . , |un ]. We note that the quantity Px 2 R ( P ) is an n-tuple in the ambient basis, that is, Px D ( uT1 x ) u1 C ¢ ¢ ¢ C ( uTd x ) ud . It is the expansion coefcients that provide the d-dimensional representation, and these are given as pO D UO 1T x 2 Rd . We eschew principal component analysis (PCA) because it is based on minimizing mean-square projection residuals rather than achieving a well-
Whitney Reduction Network
2599
conditioned inverse. Rather, we propose to optimize our projector P by requiring that the inequality kPx ¡ Pyk ¸ k¤ kx ¡ yk
(3.3)
be satised for all x, y 2 A and some xed tolerance k¤ > 0. Note that this criterion is applied pointwise, whereas PCA is based on optimizing an average quantity. The tolerance k¤ is a measure of the maximum permissible shortening of the distance between any two projected data points. Equivalently, as shown below, k¤ is a lower bound on the norm of the projected secants. Note that by construction, 0 < k¤ · 1 with k¤ D 1 when P represents a unitary transformation. Thus, the goal of the rst learning phase is to determine a basis that is good in the sense that the dimension d of the range of P is as small as possible for a given tolerance k¤ . We employ the fact that given a projector P and a data set A, the minimum projected interpoint distance k¤ may be calculated directly from ® ® kPx ¡ Pyk ® O® , ®Pk® D kx ¡ yk
(3.4)
where a unit secant is dened as kO D ( x ¡ y ) / kx ¡ yk for any x, y 2 A; (see Broomhead & Kirby, 2000, for details). Given an n-dimensional basis, we may use the above formula for the minimum projected secant norm to determine the number of dimensions d required to satisfy the condition ® ® ® O® ®P k® ¸ k¤
(3.5)
for all kO 2 S , which is equivalent to the criterion of equation 3.3. We now propose a procedure for determining a good projection in the sense of equation 3.3 or 3.5. Our approach is adaptive and is initialized using the principal components of the matrix of secants K. This is an n £ N matrix where N D P ( P ¡ 1) / 2 is the total number of secants and n is the ambient dimension (in practice, a subset of the secants is sufcient). We project the columns of K onto d-dimensional subspaces spanned by the principal components of K and look for the smallest value of d for which equation 3.5 is satised for all the secants. We denote the set of secants that do not satisfy equation 3.5 for d ¡ 1 as “bad” secants. The matrix S whose columns consist of bad secants is used iteratively to update the initial covariance matrix H D KKT . The update involves reweighting the bad secants by an amount proportional to an adjustable parameter a: ± a² a H0 D 1 ¡ H C SST , N m
(3.6)
2600
D. S. Broomhead and M. J. Kirby
where m is the number of columns in the matrix S. This iteration acts to O ¸ k¤ for all reduce the number of dimensions d required such that kPd kk kO 2 S . (Here, for clarity, we have indicated the dimension d of the projector as Pd .) Adaptive Basis Algorithm
1. Compute the initial basis via PCA of secants. O ¸ k¤ for all kO 2 S . 2. Determine the smallest dimension d such that kPd kk O < k¤ g. 3. Find the matrix S of bad secants dened by fkO 2 S : kPd¡1 kk 4. Update the covariance matrix via equation 3.6.
5. Compute the new basis consisting of the eigenvectors of H0 . 6. Stop if basis satisfactory; else return to 2. 3.2 The Reconstruction Mapping. Given that the data have been parameterized via a good projection, the problem is now to determine a mapping for reconstructing the data. The need for an inverse mapping is in analogy with analytical transform methods. Our objective is the construction of an empirical dimension preserving transform of the data to facilitate its analysis. The existence of a (Lipschitz) inverse assures us that the projection has preserved the dimension of the data. The reconstruction phase rebuilds a data point x 2 Rn from its projection p D Px by learning the associated value q D Qx, that is, the projection onto the orthogonal complement. The superposition of these values then rebuilds a data point; the identity mapping is I D P C Q or
x D Px C Qx. It is most natural to do these computations in the appropriate bases; the representations pO D UO 1T x,
qO D UO 2T x
have dimensions d and n ¡ d, respectively. The (strictly) nonlinear portion of the reconstruction involves tting a O as the input and the orthogonal function f with the projected data set fpg complement data set fOqg as the output. (It is this function f that is guaranteed to exist by Whitney’s theorem.) This mapping f is approximated by fQ, O providing an estimate qQO for q: ¡ ¢ pO ! qQO D fQ pO .
O are required for the second During the training phase, the target points fqg internal layer. The dotted connections from the input layer to the second
Whitney Reduction Network
2601
internal layer represent the mapping x ! qO D UO 2T x 2 Rn¡d . This portion of the network is required for training purposes only. Finally the data are reconstructed (i.e., transformed to the original ambient coordinates) in two parallel stages x D p C q. A mapping from the rst internal layer to the output layer accomplishes the strictly linear reconstruction: O p D UO 1 p. This mapping is not learned by the network but comes from the linear inversion of the projection. The nonlinear component of the reconstruction is the result of the parameterization, which produces an approximation O that is, qQO D fQ( pO) . Thus, the desired reconstruction q D UO 2 qO is now to q, approximated by a mapping from the second internal layer to the output layer: QO qQ D UO 2 q. Hence, x is approximated as x ¼ xQ D p C q. Q In summary, the network approximates the identity mapping as Q PQ ¡1 ± P: x ! x, where this mapping may be written ¡ ¢ xQ D UO 1 pO C UO 2 fQ pO .
(3.7)
The (well-conditioned) inverse qO D f ( pO) may be t using the radial basis function approach (Broomhead & Lowe, 1988). Here we employed the thin plate spline radial basis function (RBF) w ( r) D r2 ln r with randomly selected centers. The least-squares problem was solved using the singular value decomposition. 4 Features of the Whitney Reduction Network
Various features of the WRN that distinguish it from other techniques, such as linear and nonlinear principal components. 4.1 Decomposition of Reconstruction into Linear and Nonlinear Parts.
The reconstruction of the data achieved via equation 3.2 is a decomposition into separate linear and nonlinear pieces. The term p is the linear reconstruction, while q D f ( p ) is the nonlinear term t by RBFs. Note that p is immediately available once a basis has been selected for projection since it is just the linear reconstruction of the projected data. In other words, it is
2602
D. S. Broomhead and M. J. Kirby
not necessary to use a neural algorithm to learn the weights associated with the linear component of the reconstruction. Another interpretation of this decomposition is that the WRN automatically identies dependent and independent variables in the function approximation process. This approach is distinct from nonlinear (as well as linear) PCA, which uses the original data as the target for the reconstruction. Note also that in nonlinear PCA, the bottleneck variables may change during the course of training the inverse, while the independent variables produced by the WRN are found and then xed before the nonlinear inverse is approximated. 4.2 Projection Optimized for Producing Good Generalization. A mapping is said to be ill conditioned if small perturbations in the domain lead to large changes in the range. If an ill conditioned map is approximated (say, using RBFs or multilayer perceptrons) on a xed training set, it will not generalize well to nontraining data. The inverse mapping of the WRN is well conditioned as a result of the requirement that the projection P satisfy the optimality criterion given by equation 3.3, from which it follows that
kP¡1 x ¡ P¡1 yk 1 · ¤I kx ¡ yk k
(4.1)
that is, P¡1 is Lipschitz with constant 1 / k¤ . Since the left-hand side of equation 4.1 is exactly that ratio of the change in the image to the associated change in the domain, we conclude that the absolute condition number kO (this is dened in Trefethen & David Bau, 1997) is bounded by our k¤ as
kO ·
1 . k¤
(4.2)
Hence by designing the projection such that equation 3.3 is satised for a reasonably large k¤ the inverse mapping will be well conditioned. In addition it can be shown (see Broomhead & Kirby, 2000) that the RBF approximation is also Lipschitz. Neither standard nor nonlinear PCA constrains the inverse mappings to be well conditioned. Indeed, our examples show that the measure of conditioning k¤ attained by the data PCA basis can be signicantly inferior to that for secant bases. 4.3 Linear Optimization Problem. We t the nonlinear portion of the inverse mapping using RBFs since this leads to a linear optimization problem for determining the unique weights. By contrast, the model parameters of the bottleneck network require nonlinear optimization methods, which are inherently nonunique.
Whitney Reduction Network
2603
4.4 Estimate for Dimension of Data Set. Our approach provides an objective means for estimating the dimension of the manifold from which the data are sampled. Standard PCA measures the distribution of data in a subspace of the ambient space and is thus an unreliable estimate of the manifold dimension. Also, the bottleneck, or autoassociative, network architecture for implementing nonlinear PCA does not provide a direct means to estimate the dimension of a data set. 4.5 Estimate for Complexity of the Approximation. The number of parameters required by the network may be estimated as a function of k¤ . To this end, it can be shown (see Broomhead & Kirby, 2000) that q 1 def º D 1 C k2w kWk2 ( nc C 1) > ¤ , k
where kw is the Lipschitz constant of w, nc is the number of RBF centers, and kWk is the two-norm of the weight matrix W. This quantity º is a characteristic of the RBF and may be a useful measure of the complexity of the network. 4.6 Application to Problems of Very High Dimension. Given that the reduction mapping P is linear and the nonlinear component of the inverse is approximated when P has been xed, it follows that the size of ambient dimension that this network can reduce is far greater than fully nonlinear bottleneck networks. In addition, the nonlinear mapping has a d-dimensional domain and an r ¡ d dimensional range where r is the rank of the data matrix. 5 Projecting Noisy Data
Depending on the application, it can be the case that the data set to be compressed has been corrupted by noise. Noise that perturbs data points so that they are no longer on their manifold leads to uncertainty in the directions of the secants. As a result, direct application of the methodology to data with noise needs some care. Here we propose an approach for dealing with noise that amounts to ltering the smallest secants, the ones most affected by the noise. The motivation for this approach is made clear by analyzing the effect of noise on a given secant. If we assume that the noise is isotropic with compact support (as would be the case with quantization noise, for example), then each point may be associated with a hypersphere of radius 2 , which denes the region within which the point with the noise added is assumed to lie. Joining two points, each of which has added noise, forms a set of possible secants consisting of all lines between the hyperspheres associated with each point. For example, in two dimensions, the secant may now be visualized as being generated by a “dumbell” conguration, as shown in Figure 1.
2604
D. S. Broomhead and M. J. Kirby
Figure 1: Let k be the secant between two data points. If the points have uniformly distributed added noise, they may be displaced in the directions a and b, respectively. The new vector k C a ¡ b represents the noise-perturbed secant. The points a, b may be any points in the corresponding hyperspheres, drawn here in two dimensions.
To determine the potentially destructive effect of the noise on our good projection, we estimate the maximum angle between possible secants in the dumbbell. Writing a D b ¡ a, we have the unit secant ( k C a) / kk C ak. The largest possible angle between two secants would arise if a D 22 2 O and a0 D ¡22 2 O , as shown in Figure 2; here 2 O is a unit vector perpendicular to k. From elementary trigonometry, it follows that sin h D
42 kkk kkk2 C 42
2
.
(5.1)
For the small noise case we may assume 2 ¿ kkk and hence h is near zero; there is essentially no uncertainty concerning the true direction of the secant. As the magnitude of the noise increases (i.e., the radius of the hypersphere about each data point gets larger), we see that the maximum angle between potential secants increases (see Figure 3). We conclude from equation 5.1
Figure 2: Maximum perturbation of the secant results may be represented geometrically as a triangle where the added noise vectors a and a0 are in opposite directions.
Whitney Reduction Network
2605
Figure 3: The two pairs of concentric circles represent differing levels of noise for two data points. The smaller noise level, represented by the solid lines, produces a maximum angle h1 between extreme secants. The dotted lines correspond to a larger noise level and show a greater uncertainty for the true direction of the secant. Here the noisy secant could differ from the actual by an angle of h2 .
that in the worst case, that is, when kkk D 22 , the secants could actually be perpendicular. It is apparent from the above arguments that the orientation of the shortest secants generated by the data is most sensitive to additive noise. Thus, we propose a ltering procedure, which amounts to removing the secants shorter than a cutoff value from the set of secants S . One might anticipate that as long as it is large enough, the actual value of the cutoff parameter is not critical given our goal is a good projection rather than an optimal one. 6 Illustrative Examples 6.1 Data on the Boundary of a Pringle. We begin with a simple, easily visualizable problem for which the PCA basis produces a bad (not one-toone) projection of the data set. The boundary of a pringle, as shown in the left of Figure 4, is an embedding of the circle in R3 and is dened as the triplet ( sin h, cosh, sin 2h ) . It may be argued analytically that the projection onto the two most energetic secant PCA basis vectors produces the circle in
2606
D. S. Broomhead and M. J. Kirby
1
1
0.5
0.5
1 0 1 1
1
0 1
1
0
0
0
0.5
0.5
1
1 1
0
1
1
0
1
Figure 4: (Left) Example of a one-dimensional manifold embedded in R3 . (Middle) Projection on the best two secant basis vectors, along the z-axis. (Right) Projection onto the best two data PCA vectors, along the y-axis.
the middle of the gure, while the projection onto the two most energetic data PCA vectors produces the lemniscate on the right (Broomhead & Kirby, 2000). Although the PCA projection is not invertible, the secant basis gives an embedding of the pringle in two dimensions. Now we add uniform noise with compact support—points from the interval [¡0.1, 0.1], to the pringle (see top left of Figure 5). The top-right gure shows the set of unit secants for the pringle data with no added noise. These are conned to four segments, the complement of which—the region where permissible projections can be found—consists of two four-fold stars centered at the north and south poles. The bottom-left gure shows the set of unit secants for the noisy pringle data. The effect of adding noise is to spread the unit secants over the whole sphere, lling in the admissible regions for projections in the neighborhood of the poles. Our analysis in section 5 suggests that these troublesome unit secants are derived from secants with small norm. We lter the data by removing all secants with length smaller than 1.8. This number was determined empirically to ensure that the effect—the north and south poles becoming signicantly devoid of secants—was visible in Figure 5. The principal component basis constructed from the ltered unit secants produced an admissible projection in the polar region that was skewed away from the pole due to nonuniformity in the distribution of the unit secants. Next we applied the adaptive secant basis algorithm to determine an improved projection for this data. The adaptation procedure rotated this direction until it was essentially colinear with the z-axis. In this case, the adaptive algorithm actually found the optimum projection. 6.2 Case Study: A Noisy Circle in 32 Dimensions. In this section we present the results of applying the WRN to a synthetic data set consisting of a narrow pulse moving with unit velocity on a circle: g ( x ¡ t ) where g (h ) D g (h C 2p ). In the absence of noise, these data are topologically a circle—that is, a one-dimensional manifold—in a high-dimensional function space. If
Whitney Reduction Network
2607
Figure 5: (Top left) Uniform noise on interval [¡.1, .1] added to curve. (Top right) Unit secant set for noise-free data. (Bottom left) Associated secants for curve with uniform noise added. (Bottom right) Filtered secant set.
we approximate g as a traveling gaussian pulse and add a noise term, we get a set of functions whose space-time translational symmetry has been broken: 2 g ( x, t) D e ¡(t¡x) /c C g(x, t).
The noise term g(x, t) has been added to test the robustness of the WRN for data that do not reside exactly on a manifold. For this purpose, two data sets (labeled I and II) were generated by adding normally distributed noise with variances 0.025 and 0.05 (and zero mean) to the traveling pulse (see Figure 6). Taking c D 25, the function was sampled at 32 points in the x-direction at half-integer intervals and at 128 points in the t-direction at integer intervals. The resulting data matrices have size 32 £ 128. Thus,
D. S. Broomhead and M. J. Kirby
U
2608
S PA C E
T IM E
0 . 0 5 v a ria n c e n o is e
1.2 1
u(x,t)
0.8 0.6 0.4 0.2 0 0.2 30 25 20 15 10 5
20
40
60
80
10 0
1 20
s pa ce
tim e Figure 6: (Top) Right traveling wave data corrupted with normally distributed noise with zero mean and variance 0.025. (Bottom) Right traveling wave data corrupted with normally distributed noise with zero mean and variance 0.05.
the ambient space for the data is R32 , and the data matrix has full rank n D 32, regardless of the level of the added noise. The noisy data sit only approximately on a one-dimensional manifold (m D 1), and we propose to determine the actual number of dimensions required in the WRN for reconstruction to within the level of the noise.1 6.2.1 Learning Phase I: Finding a Good Basis. The application of PCA to translationally invariant data produces sinusoidal eigenvectors. In addition, it has been shown that the principal components of the secants of translationally invariant data are also sinusoidal (Broomhead & Kirby, 2000). Hence, for translationally invariant data, the bases produced by PCA on the data and the secants are identical.
1 In the absence of noise, these data reside on a one-dimensional manifold (m D 1), which may be reduced to R3 (and possibly R2 , although now we cannot expect that suitable projections are open dense) and reconstructed without loss by Whitney’s theorem.
Whitney Reduction Network
2609
When noise is added to translationally invariant data, the result is data that are no longer translationally invariant. As a result, the adaptive secant algorithm now produces basis vectors that differ signicantly from the sinusoids. (Note that the unadapted secant basis is still essentially sinusoidal until mode 7, and the PCA on the data produces similar eigenvectors.) To compare the results of the adapted secant basis algorithm, with minimum secant norm ltering and without, see Figure 7a. This plot shows the smallest dimension for which the minimum norm of the projected secants is above the selected tolerance k¤ D 0.5 as a function of the adaptation iteration. The secant basis was adapted with no secant ltering (top curve) and with a secant cutoff of 0.25 (bottom curve) for data set I. The adapted secant algorithm with ltering reduces the reduction dimension from 14 to 3. Without ltering, the adaptation reduces the required dimension only to 12. The results for data set II (not shown) show a similar improvement through adaptation. The ltering of the secants produced a 5-dimensional basis, while the unltered data set produced a 12-dimensional adapted basis. Thus, while added gaussian noise appears to increase the reduction dimension, the proposed secant ltering procedure does greatly mitigate this problem. Indeed, the 3-dimensional reduction suggested by Whitney’s is obtained with tolerance to k¤ D 0.3. The performances of the PCA data basis, the PCA secant basis, and the adapted PCA secant bases are compared in Figure 7b. The adapted secant basis is clearly superior at maximizing the minimum projected norm. Thus, the projection onto this basis will produce the best-conditioned inverse. The basis produced by the PCA of the data actually will lead to a seriously ill-conditioned inverse with poor generalization capabilities. To examine geometrically the difference between the adapted secant basis and the PCA data (and secant) basis of Fourier vectors, we have plotted the time evolution of the rst four one-dimensional subspaces, or spacetime modes, for the adapted secant basis. Specically, Figure 8 shows the discrete space-time modes obtained by projecting the data onto the onedimensional subspaces associated with the most important adapted secant basis vectors. These adapted space-time modes are seen to have markedly different proles from those obtained with PCA on the data or secants (i.e., Fourier modes). 6.2.2 Learning Phase II: The Well-Conditioned Inverse. In this section, we examine the results of data reconstructions for several different projection dimensions and bases. According to equation 3.7, the reconstruction of the data consists of the sum of two components: the linear inverse, or demapO which is accomplished by reverting the projected data back ping, p D UO 1 p, to their original coordinates, and the nonlinear inverse, which approximates q by the RBF as qQ D UO 2 fQ( pO). The results presented here serve to illustrate this linear-nonlinear decomposition, as well as to highlight the difference
2610
D. S. Broomhead and M. J. Kirby
16
MINIMUM DIMENSION
14 12 10 8 6 4 2
10
20
30 ITERATIONS
40
MINIMUM PROJECTION
0.9 0.8 0.7 ADAPTED SECANT BASIS
0.6 0.5
SECANT BASIS
0.4 0.3 0.2
DATA SVD BASIS
0.1 0
1
2
3
4
5 6 7 DIMENSION
8
9
Figure 7: (a) Minimum projection dimension that achieves tolerance k¤ D 0.5 as a function of adaptation iteration for data set I. (b) Minimum norm of the projected unit secants of data set I as a function of dimension for several bases, For secant bases, a minimum secant cutoff = 0.25 was used. The results were similar for data set II.
Whitney Reduction Network
2611
adapted space time mode 1
adapted space time mode 2
0.5
0.5
0
0
0.5 30
0.5 30 20 10 x
50 t
20
100
10 x
adapted space time mode 3
50 t
100
adapted space time mode 4
0.5
0.5
0
0
0.5 30
0.5 30 20 10 x
50 t
100
20 10 x
50 t
100
Figure 8: Projection of the right traveling wave onto the one-dimensional subspaces of the rst four adapted secant singular vectors.
between the parameterization and encapsulation of the data set. The examples that follow pertain to data set I, as described above. Adapted secant PCA basis d D 2. This example employs the adapted secant basis for which the minimum projected secant norm onto two dimensions is approximately 0.4. Thus, by construction, the dimensionality of the data will be preserved and the inverse mapping will be well conditioned with a Lipschitz contant less than 2.5. The linear reconstruction term p D UO 1 pO is displayed at the top of Figure 9. Given the large deviation from the original
linear reconstruction: relative error 0.99997
2612
D. S. Broomhead and M. J. Kirby
1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 30 25 20 15 10 space
60
80
100
120
40 5 20 nonlinear reconstruction error 0.77954 time
1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 30 25 20 15
space
10
60
10
60
80
100
120
40 5 20 full reconstruction 2D relative error 0.032987 time
1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 30 25 20 15
80
100
120
40 Figure 9: (Top) Two-dimensional 5 linear demapping of the right traveling wave. 20 space time Variance 0.025, secant basis with cutoff D 0.025. (Middle) Two-dimensional nonlinear demapping (only) of the right traveling wave. (Bottom) Two-dimensional full reconstruction of the right traveling wave, that is, the superposition of the linear and nonlinear demapping layers.
data, we see that two dimensions will not span, or encapsulate, the data without signicant residual. The nonlinear component of the inverse—the RBF approximation to q dened as qQ D UO 2 fQ( pO) —is shown in the middle of Figure 9. Here the RBF is a mapping from PX 2 R2 ! R30, so the nonlinear component of the reconstruction is not to the full original ambient space, but rather to the smaller null space of the projector P. The full re-
Whitney Reduction Network
2613
construction xQ D p C qQ is shown at the bottom of Figure 9, and the relative error kx ¡ xk Q / kxk was approximately 0.03, suggesting the data have been reconstructed roughly to the noise level. In fact, for data set I, kgk / kgk was approximately 0.024. No attempts were made here to optimize the location of the centers in the RBF reconstruction. In fact, 12 centers were simply selected at random from the data. Data PCA data basis d D 3. Our second example employs a basis produced by the PCA of the data matrix. Retaining three dimensions results in a linear reconstruction with a 38% relative error. However, the minimum projected secant norm onto this basis is approximately 0.04 (see Figure 7b) and signicantly worse than for the adapted secant basis. This relatively poor conditioning manifested itself in the approximation of the nonlinear term, which was much more difcult to compute than for the adapted secant basis, and considerable effort was needed to achieve an error comparable to that obtained above. 6.3 The Rogues Gallery Problem. As another illustrative example, we consider the application of the WRN to the Rogues Gallery problem, the low-dimensional characterization of snapshots of human faces. The application of the Karhunen-Lo e` ve (KL) procedure, or data PCA, for the representation of digital images of faces was introduced in Sirovich and Kirby (1987). Since this time, this basic work has been extended in several directions, for example, the construction of symmetric eigenpictures (Kirby & Sirovich, 1990), three-dimensional eigenfaces (Atick, Grifn, & Redlich, 1996), and eigenpictures with incomplete data (Everson & Sirovich, 1995). The implicit assumption in these investigations is that the data reside in a subspace of signicantly lower dimension than the ambient space. The purpose of (data) PCA is then to nd an optimal set of spanning eigenvectors, or eigenpictures, to represent the data such that the mean square error is a minimum. It is possible, however, that the points associated with a family of images lie on a submanifold rather than a subspace. In this situation, the representation of the data may be achieved via the WRN. Again, the goal is to construct a nonlinear parameterization for the images by computing a graph of the surface on which the data lie. To test the effectiveness of the Whitney architecture to the problem of the representation of high-dimensional images, we applied the procedure to an ensemble of 200 faces. Note that since we seek to represent the faces as the graph of a function, our approach is fundamentally different from the fully nonlinear approach in Cottrell and Metcalfe (1993). The images were normalized for lighting as in Sirovich and Kirby (1987) and Kirby and Sirovich (1990), and the background was eliminated partially by constructing 310 £ 380 pixel cameos. In Figure 10, we have shown the results, in terms of the original image coordinate system, of the linear, nonlinear, and full reconstruction, and compare them to the original
2614
D. S. Broomhead and M. J. Kirby
Figure 10: (Top left) Linear reconstruction p using a 10-dimensional secant basis projection. (Top right) Nonlinear RBF reconstruction qQ from the 10-dimensional secant subspace to its 190-dimensional orthogonal complement. (Bottom left) Total reconstruction xQ D p C qQ . (Bottom right) Original mean-subtracted image. The data in this example came from the Vail schoolchildren data set (provided by Walter Bender of MIT). This gure origininally appeared in Kirby (2001) and is reprinted with permission.
image. Here the data are all mean subtracted. It is clear that a 10-dimensional projection onto the secant basis when linearly reconstructed is visually a poor approximation to the original data (this statement remains true for the projection onto the rst 10 KL eigenpictures—data PCA). However, by
Whitney Reduction Network
2615
virtue of the fact that we have imposed the constraint equation 3.3 on the projection, we are assured of a lossless reconstruction. Indeed, the nonlinear mapping, again constructed using RBFs, provides the detail as the image of this projection (see the top right of Figure 10). The success of the full reconstruction conrms that the projection is indeed invertible (see the bottom left of Figure 10). The secant PCA basis performs only slightly better than the data PCA basis when compared in terms of the minimum projected secant norm criterion. The minimum projected secant norm for a 10-dimensional projection was 0.30 for the secant PCA basis and and 0.18 for the data PCA basis. These numbers indicate that 10 is a reasonable number of dimensions for the projection, as invertibility is guaranteed for the entire data set. The similar performance of these two bases is very likely due to the fact that such a small data set (200 images) was used. A 10-dimensional subspace is very large for just 200 points, and thus it is easy to nd a projection that is invertible. Signicantly more data are required to compare the performance of the different methods as well as to establish rmly whether the face data actually lie on a surface. Nonetheless, this example successfully illustrates the goal of obtaining an invertible projection in a high-dimensional setting. 7 Conclusion
We propose the WRN as a method for data reduction especially suited to problems where the data are sampled from an m-dimensional manifold residing in an high-dimensional ambient space. Theoretical insight is provided by Whitney’s theorem, and the architecture is motivated by its constructive proof. A key idea is that a linear reduction mapping coupled with a nonlinear inverse is mathematically all that is required to compress the data on a manifold of dimension m to dimension 2m C 1, although the ambient dimension may be very large. In addition, a projection that optimizes the conditioning of the inverse mapping is proposed. This approach ensures good generalization properties of the inverse. The examples demonstrate that the method is effective for globally representing data on a manifold, even in the presence of noise. The preliminary results on digital images of faces indicate that this architecture scales well to problems of higher dimension but that, as expected, correspondingly large amounts of data are required. Acknowledgments
This research has been supported in part by the Engineering and Physical Sciences Research Council, U.K., in the form of a visiting research fellowship to M. K. This research has also been supported by the National Science Foundation under grants DMS-9505863 and INT-9513880. We thank Jerry Huke, Doug Hundley, and Rick Miranda for their input concerning this
2616
D. S. Broomhead and M. J. Kirby
research and Walter Bender for providing the raw data used to compute Figure 10. References Atick, J. J., Grifn, P. A., & Redlich, A. N. (1996). The vocabulary of shape: Principal shapes for probing perception and neural response. Network:Computation in Neural Systems, 7, 1–5. Broomhead, D., & Kirby, M. (2000).New approach for dimensionality reduction: Theory and algorithms. SIAM J. of Applied Mathematics, 60(6), 2114–2142. Broomhead, D., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Cottrell, G. W., & Metcalfe, J. (1993). Empath: Face, emotion and gender recognition using holons. In R. Lippman, J. Moody, & D. Touretzky, (Eds.), Advances in neural information processing systems, 3 (pp. 564–571). San Mateo, CA: Morgan Kaufmann. Everson, R. M., & Sirovich, L. (1995). The Karhunen-Lo e` ve transform for incomplete data. Journal of the Optical Society of America, A, 12(8), 1657–1664. Falconer, K. (1990). Fractal geometry: Mathematical foundations and applications. New York: Wiley. Guillemin, V., & Pollack, A. (1974). Differential topology. Englewood Cliffs, NJ: Prentice Hall. Hirsch, M. W. (1976). Differential topology. Berlin: Springer-Verlag. Kirby, M. (2001). Geometric data analysis. New York: Wiley. Kirby, M., & Sirovich, L. (1990). Application of the Karhunen-Lo e` ve procedure for the characterization of human faces. IEEE Trans. PAMI, 12(1), 103–108. Sirovich, L., & Kirby, M. (1987). A low-dimensional procedure for the characterization of human faces. J. of the Optical Society of America A, 4, 529–524. Trefethen, L. N., & David Bau, I. (1997). Numerical linear algebra. Philadelphia: SIAM. Received July 14, 1998; accepted January 17, 2001.
LETTER
Communicated by Katsushi Ikeuchi and Steven Zucker
Enhanced 3D Shape Recovery Using the Neural-Based Hybrid Reectance Model Siu-Yeung Cho Tommy W. S. Chow Dept. of Electrical Engineering, City University of Hong Kong, Kowloon Tong, Hong Kong [email protected] [email protected] It is known that most real surfaces usually are neither perfectly Lambertian model nor ideally specular model; rather, they are formed by the hybrid structure of these two models. This hybrid reectance model still suffers from the noise, strong specular, and unknown reectivity conditions. In this article, these limitations are addressed, and a new neuralbased hybrid reectance model is proposed. The goal of this method is to optimize a proper reectance model by learning the weight and parameters of the hybrid structure of feedforward neural networks and radial basis function networks and to recover the 3D object shape by the shape from shading technique with this resulting model. Experimental results, including synthetic and real images, were performed to demonstrate the performance of the proposed reectance model in the case of different specular effects and noise environments.
1 Introduction
Shape from Shading (SFS) is one of the main developments of the computer vision that deals with reconstructing the three-dimensional (3D) shape of an object from its two-dimensional (2D) image intensity. The basic idea of the SFS technique was rst suggested by Horn (1970) and further studies with his colleagues (Horn & Brooks, 1989; Horn & Sjoberg, 1990). The classic formulation of the SFS problem is based on using the conventional Lambertian model together with minimizing the cost function using the calculus of variations (Horn & Brooks, 1986). In fact, the success of SFS depends on two major elements: a better reectance model and an effective SFS algorithm. The importance of developing a better reectance model stems from the fact that the models are able to describe the surface shape accurately from the image intensity. Thus far, formulating a new reectance model has been difcult. As the simple Lambertian model was established to describe the relationship between the surface normal and the light source direction, certain c 2001 Massachusetts Institute of Technology Neural Computation 13, 2617– 2637 (2001) °
2618
Siu-Yeung Cho and Tommy W. S. Chow
restrictive assumptions have to be made. First, surface reection is generally assumed to be diffuse reection. Second, a distant-point light source is assumed to be known. Third, images are formed by orthographic projection. Despite its simplicity and immense popularity, the Lambertian model is unable to model the reectance surface with a strong effect of specular components. In order to enable the SFS algorithms to cope with the specular effect, different types of specular models or non-Lambertian models were studied. Healy and Binford (1988) employed the Torrance–Sparrow model (Torrance & Sparrow, 1967) as the specular model assuming that a surface is composed of small, randomly oriented, mirror-like facets. The specular intensity is described as the product of four components: energy of incident light, Fresnel coefcient, facet orientation distribution function, and geometrical attenuation factor. A non-Lambertian model proposed by Lee and Kuo (1997) depends not only on the gradient variable, but also the variables of the position and the distance of light source. The surface depths are directly obtained, but the computation is highly complex. Most recently, a novel approach in using the self-learning feedforward neural network was proposed to model the reectance model, which results in relaxing some of the Lambertian assumptions (Cho & Chow, 1999a). Although based on that approach, knowledge on the illumination is not required by using such a self-learning model (Cho & Chow, 1999b), it is unable to model a strong specular component onto the reectance surface. The recovered surface modeled by the neural network type model is distorted for the unknown specular components formed by the highlighting effects that are reected from the object surface. As a result, the information of the surface orientation is lost for recovering the object shape. Nayar, Ikeuchi, and Kanade (1990) proposed a reectance model that consists of three components: diffuse lobe, specular lobe, and specular spike. The Lambertian model was used to represent the diffuse lobe, while the Torrance–Sparrow model was used to model the other specular components. More generally, Nayar, Ikeuchi, and Kanade (1991) also proposed a more generalized reectance framework based on the geometrical optics approach, which comprises all basic components of the diffuse and specular reections. But all of these conventional SFS techniques fell short in reconstructing the surfaces for most real applications because they do not use a generalized reectance model. There is a need to establish a new model. In this article, a new neuralbased hybrid reectance model is developed to enhance overall SFS performance. This study establishes a nonlinear hybrid structure of both feedforward neural network (FNN) and radial basis function (RBF) type network to generalize the hybrid reectance model. Based on this model design, the fundamental parameters of the Lambertian surfaces will be approximated by the developed FNN approach, whereas the other special parameters of the non-Lambertian surface can be approximated by the RBF network. In our study, the RBF network is selected because of its classication ability. As a result, the developed RBF network-based reectance model exhibits
Enhanced 3D Shape Recovery
2619
an excellent separating capability for the feature classication problem and good solving skill for the ill-posed hypersurface reconstruction problem. Based on this hybrid approach, all components for diffuse and specular reection are generalized by the well-trained hybrid model. Under this novel approach, the illuminate information is no longer required, and the evaluation of the surface depth is iteratively performed. The shape recovery computation will be efcient and robust for real image applications. 2 Neural-Based Hybrid Reectance Model
Lambertian surfaces have only diffuse reectance—surfaces that reect light in all directions. Suppose that the recovering of surface shape, represented by z ( x, y ) , from shaded images depends on the systematic variation of image brightness with surface orientation, where z is the height eld and x and y are the 2D pseudo-plane over the domain V of the image plane. For most conventional SFS algorithms, the Lambertian reectance model used to represent a surface illuminated by a single point light source is given as: Rdiffuse ( p( x, y) , q( x, y ) ) D gnsT ,
8( x, y ) 2 V,
(2.1)
where Rdiffuse is the diffuse intensity,g is the composite albedo, s D ( cos t sin s sin t sin s cos s) is illuminate source direction, and t and s denote the tilt and slant angles, respectively. The surface normal can be represented as
³ nD
p
¡p
p2 C q2 C 1
p
¡q
p2 C q2 C 1
p
1 p2 C q2 C 1
´ ,
where p D @@zx and q D @@yz are the surface gradients. An ideal Lambertian surface requires a known and distant light source according to this model. But in most practical cases, the surface does not often contain the Lambertian surface because the light source is often located at a nite distance and an unknown position. Specular component occurs when the incident angle of the light source is equal to the reected angle and is formed by specular spike and specular lobe. Assume that the surfaces of interest are smooth (i.e., the surface is gently undulating). Under this assumption, both specular spike and lobe are signicant only in a narrow region around the specular direction. In this sense, equation 2.2 is a simple delta function used for describing the specular reection, Rspecular D Bd (hs ¡ 2hr ) ,
(2.2)
where Rspecular is the specular intensity, B is the strength of the specular component, hs is the angle between the light source direction and the viewing direction, and hr is the angle between the surface normal and the viewing
2620
Siu-Yeung Cho and Tommy W. S. Chow
direction. Based on the above condition, such a simple model hardly exists in the general real-world applications. Healey and Binford (1988) then derived a more rened model by simplifying the Torrance–Sparrow model (Torrance & Sparrow, 1967), in which the gaussian distribution is used to model the facet orientation function and the other components are considered as constants. The model is described as Rspecular D p
³ ´ a2 exp ¡ 2 , 2b 2p b 1
(2.3a)
where a is the angle between the surface normal (n ) and the vector (h ) such that a D arccos( n ¢ h ) ,
with h D
sCv
ks C v k
.
The vector h is usually called halfway vector and represents the normalized vector sum between the light source direction s and the viewing direction v . If the incident angle ( s , n ) and the emitting angle ( v, n ) are equal, the vector h takes its maximum value. In this case, the exponent is zero since the vectors h and n coincide. Because of the reciprocal exponential term, the observed reection decreases quickly when the angle ( h , n ) increases. The factor b is the standard deviation, which can be interpreted as a measure of the roughness of the surface. A more sophisticated model based on the geometrical optics approach is also presented as the specular reectance model (Nayar et al., 1991), such that Rspecular D kspec
³ ´ a2 Li dw i exp ¡ 2 , coshr 2b
(2.3b)
where k spec represent the fractions of incident energy determined by the Fresnel coefcients and the geometrical attenuation factor. The term coshr describes the emitting angle that the radiance of the surface in the viewing direction is determined. Although these models are widely used for the approximation of the specular component, the critical parameters (i.e., the light source and the viewing direction) are required a priori. Obviously, this poses major constraints for both the Lambertian and the specular models being applicable to most practical situations, in which knowledge on these parameters is often unavailable. There is a need to establish a new model to relax these constraints. Basically, more practical reectance models are neither only Lambertian nor only specular surfaces. They should be formed as a hybrid structure of a combination on the specular surface and the Lambertian surface. Incorporating more physical reectance parameters and effects is inevitable for generating a much generalized reectance model. This certainly makes the SFS problem very difcult because it requires solving nonlinear equations with a very complicated model. Because
Enhanced 3D Shape Recovery
2621
the Lambertian model can be viewed as an arbitrary continuous function, a feedforward neural network framework was proposed to provide the approximation of the reectance model (Cho & Chow, 1999a). On the other hand, based on the model of equation 2.3, the specular model can be viewed as a form of an arbitrary gaussian basis function. Apparently the use of an FNN network and an RBF network can provide approximations of the Lambertian model and the specular model, respectively, under the theoretical point of view of the universal approximation. As a result, there are advantages of establishing a hybrid neural-based reectance model that combines the FNN and RBF networks. Based on our approach, all the diffuse components are modeled as the nonlinear function approximated by the FNN network, whereas all the specular components are modeled as the gaussiantype facet orientation function, approximated by the RBF network. The variables of both diffuse and specular reectivity are simplied and tuned by the self-learning scheme under the surface properties. These variables are interpreted by the well-trained neural network weights and parameters with the nonlinear activation functions. This neural approach has a signicant advantage over the conventional model in that a proper reectance model is obtained despite the fact that the viewing and illuminating conditions are unknown. Suppose that an FNN network and an RBF network, dened as a mapping function FFNN ( ¢) and FRBF ( ¢), are used for the approximation of the Lambertian and specular models, respectively. Let a D ( p q) T be an input vector, and let Q ( ¢) be a bounded and nonlinear continuous function and Vpq denote the p-q dimensional space. The space of continuous functions on Vpq is denoted by C (Vpq ), then given any function f 2 C( Vpq ) such that the mapping function F ( ¢): Vpq ! <1 , there exist sets of real values of free parameters. Based on the universal approximation of the FNN network (Hornik, Stinchcombe, & White, 1989), a proper function FFNN can be dened as the Lambertian reectance model in the following form: Rdiffuse ( p, q) D FFNN ( a ) ,
³
D Qa
v0 C
(2.4) N X
´ vkQa ( w ka i, j ) ,
(2.5)
kD 1
where w k denotes the weights vector connected between the input and the hidden layer, vk and v0 denote the weights and the bias connected between the hidden layer and the output layer, and N is the number of the hidden units. The activation function is in the form Qa ( x ) D 1 / ( 1 C exp( ¡x) ). Based on the above formulation, the light source vector, s, is not required because the trained network weights are able to represent the proper reectance model based on the object orientation information. Besides, in order to model the specular components, the RBF network is used for the modeling based on the universal approximation of the RBF network (Park
2622
Siu-Yeung Cho and Tommy W. S. Chow
& Sandberg, 1991). The following form can be deduced as Rspecular ( p, q) D FRBF ( a ) , D vo C
N X kD 1
(2.6) vkQb ( ka ¡ c kk),
(2.7)
where fQb ( ka ¡ c kk) | k D 1, 2, . . . , Ng is a set of N arbitrary functions, which are known as radial basis functions, and k ¢ k denotes a Euclidean norm. v k is a linear weight, and c k are RBF centers. The activation function is in the form of gaussian function as Qb ( x k ) D exp( ¡x2k / 2sk2 ) , for x k ¸ 0 and sk > 0, where sk is an RBF width. In this formulation, the viewing and lighting information for the specular model are not required because all free parameters are able to represent the reectivity information of the proper model based on the specular effects of the object orientation information. The surface reectance properties are neither purely Lambertian reectance nor purely specular components. They most likely consist of a mixture of both diffuse and specular properties, which have to be represented by a linear combination of the two components. Thus, in our study, we propose that both the FNN- and the RBF-based reectance models be combined together, much as Nayar et al. (1990) proposed. The proposed hybrid neuralbased reectance model is expressed as Rhybrid D ( 1 ¡ v) FFNN ( a ) C vFRBF ( a ) ,
³
D ( 1 ¡ v) Qa
³
Cv
vo C
v0 C N X kD 1
N X kD 1
´
(2.8)
v kQa ( w k ai, j )
´
v kQb (ka ¡ c k k) ,
(2.9)
where v is the nonzero weighting factor of the specular components used to establish the quantity of the specular component for the hybrid model. Experimental studies were performed by selecting different values of v for reconstructing the synthetic sphere object under different illuminate condition. The results are shown in Figure 1. From the results, if the value is too large (i.e., very close to one), the specular effect may be overemphasized under the reconstruction processing so that corresponding distortions might occur. In our studies, a typical value of 0.7 is chosen to maintain an optimal balance between avoiding the distortion and obtaining the contribution of the specular component modeling for the proposed hybrid model. In this developed reectance model, the fundamental parameters of the Lambertian surfaces are approximated by the FNN network, whereas the other special parameters of the non-Lambertian surfaces are approximated by the RBF network. The reectance properties of the hybrid surfaces can be determined
Enhanced 3D Shape Recovery
2623
0.12
mean absolute depth error
0.1
w =0.01 0.08
w =0.1 w =0.95
0.06
w =0.8 w =0.5
0.04
w =0.7
0.02
illumination condition
3
4
Light: t =315°;s =15° View: t =315°;s =45°
2
Light: t =225°;s =15° View: t =225°;s =45°
1
Light: t =135°;s =15° View: t =135°;s =45°
0
Light: t =45°;s =15° View: t =45°;s =45°
Light: t =0°;s =0° View: t =0°;s =0°
0
5
6
Figure 1: Comparative results of the different values of v under different illumination for the synthetic sphere reconstruction.
without prior knowledge of the relative strengths of the Lambertian and the specular components. Meanwhile, the light source and viewing directions are not required for the SFS computation. 3 Learning Shape from Shading Algorithm by the Hybrid Reectance Model
In solving the SFS problem by the proposed hybrid reectance model, the cost function is commonly used as Z Z ET D
V
( I ¡ Rhybrid ) 2
C l( ( @p / @x) 2 C ( @p / @y ) 2 C ( @q / @x ) 2 C ( @q / @y ) 2 ) dx dy.
(3.1)
2624
Siu-Yeung Cho and Tommy W. S. Chow
The rst term is the intensity error term, and the second term is a smoothness constraint given by the spatial derivatives of p and q. l is a scalar that assigns a positive smoothness parameter. Based on this objective function, the free parameters of the neural-based hybrid reectance model and the object surface gradients are determined by performing a unied learning mechanism. Through the learning process, the FNN weights as well as the RBF parameters and the surface gradients are computed iteratively, where the FNN and RBF parameters are optimized by a specic learning algorithm of the hybrid neural network structure. The surface depths are estimated by a variational-based approach on a discrete grid of points. 3.1 Variational Approach for Solving the SFS Problem. In this approach, based on the method proposed by Ikeuchi and Horn (1981), solving the SFS problem is based on the rst-order nonlinear partial differential equation derived in the form of the Euler-Lagrange equations. The variational method is used to calculate the solutions of the surface gradient ( p, q) and the object surface depth—z on the discrete grid. Using the calculus of variations procedure to minimize the cost function in equation 3.1, a set of discrete recursive equations with the neural-based hybrid reectance model is formed as @Rhybrid e2 ( Ii, j ¡ Rhybrid ) , 4l @p @Rhybrid e2 ( Ii, j ¡ Rhybrid) qi, j ( n C 1) D qN i, j ( n ) C , 4l @q
pi, j ( n C 1) D pNi, j ( n ) C
(3.2) (3.3)
where e denotes the spacing between picture cells. pNi, j and qN i, j are the averages of pi, j and qi, j , respectively, over the four nearest neighbors, which are given by pNi, j D ( pi C 1, j C pi¡1, j C pi, j C 1 C pi, j¡1 ) / 4
and
qN i, j D ( qi C 1, j C qi¡1, j C qi, j C 1 C qi, j¡1 ) / 4. @Rhybrid / @p and @Rhybrid / @q are the spatial derivatives of the hybrid model
with respect to the surface gradients. They can be expressed as @Rhybrid @p @Rhybrid @q
D ( 1 ¡ v) D ( 1 ¡ v)
@FFNN ( a ) @p @FFNN ( a ) @q
Cv Cv
@FRBF ( a )
and
@p @FRBF ( a ) @q
.
In the above case, l acts as a smoothness parameter for the regularization. After calculating the new surface gradients at the n C 1 iterations, the surface
Enhanced 3D Shape Recovery
2625
depth can then be estimated iteratively from the updated surface gradients. For nding a best-t surface z to the given components of the gradients by inserting the integrability constraint into the cost function of equation 3.1, the estimated surface depth is zi, j ( n C 1) D zN i, j ( n ) ¡
e ( fi, j C gi, j ) , 4
(3.4)
where zN i, j is an average of z as the same form of pNi, j ¢ fi, j and gi, j are estimates of the partial derivatives of p and q, respectively, such that fi, j D ( pi C 1, j ¡ pi¡1, j ) / 2 and gi, j D ( qi, j C 1 ¡ qi, j¡1 ) / 2. Note that the subscript i in the discrete case corresponds to x in the continuous case, and the subscript j corresponds to y. 3.2 Learning Algorithm for the Neural-based Hybrid Reectance Model. Instead of determining the physical reectivity parameters of the hybrid
reectance model, the free parameters and weights of the proposed hybrid structure neural network are optimized to generalize the proper reectance model. In accordance with the cost function in equation 3.1, the discrete form is written as X X ( Ii, j ¡ Rhybrid) 2 C l ( rx p2i, j C ry p2i, j C rx q2i, j C ry q2i, j ) , (3.5) ET D i, j2D
i, j2D
where D is the index set of the image. frx pi, j I ry pi, j I rx qi, j I ry qi, j g are discrete estimates of the partial cross derivatives of p and q, such that
rx pi, j D ( pi C 1, j ¡ pi¡1, j ) / 2,
ry pi, j D ( pi, j C 1 ¡ pi, j¡1 ) / 2,
rx qi, j D ( qi C 1, j ¡ qi¡1, j ) / 2,
ry qi, j D ( qi, j C 1 ¡ qi, j¡1 ) / 2.
and
From the cost function of equation 3.5, the discrete form of the smoothness constraint is independent of the neural-based model such that the minimization of the cost function with respect to the network parameters is performed on only the intensity error term. Thus, the cost function of equation 3.5 can be written in the form of ET D
X i, j2D
( Ii, j ¡ ( 1 ¡ v) FFNN ( a i, j ) ) 2 C
X i, j2D
( Ii, j ¡ vFRBF ( ai, j ) ) 2 .
(3.6)
Conventionally, the optimization of both FNN and RBF networks is usually based on the supervised learning process by using a gradient-descent method. But this method has shortcomings: the local minima problem and slow convergence speed. Consequently, in our approach, a modied heuristic learning scheme is suggested for computing the free parameters of this
2626
Siu-Yeung Cho and Tommy W. S. Chow
hybrid neural network structure. This learning scheme is modied from the derived method reported in Cho and Chow (1999b). The algorithm is a hybrid-type algorithm combining the least-squares method and the penalty approach optimization. It is known that this algorithm speeds up the convergence and avoids getting stuck in local minima. Based on the derivation of the proposed algorithm, the weights-updated equations for the FNN model are as follows—for output layer of FNN, 0 1¡1 0 1 ² ² X± X± 1 @ v ( n C 1) D W i, j W Ti,j A @ Qa¡1 ( Ii, j ) W Ti, j A , 1 ¡ v i, j2D i, j2D
(3.7)
and for hidden layer of FNN, ³ w k ( n C 1) D w k ( n ) C gw ¡
@ET @w k
C
´ r Epen ( w k ) ,
1 · k · N,
(3.8)
where Qa¡1 (¢) is an inversion form of the activation function and the vectors 0
1
1T
B Qa ( w 1 a i, j ) C B C W i, j D B C .. @ A . Qa ( w N a i, j ) and v D ( v 0 , v1 , . . . , vN ) T are denoted. g is the learning rate, and @ET / @w k is the error gradient with respect to w k. On the other hand, the proposed learning algorithm can also be modied to apply in computing the free parameters of the RBF model, so that the updated equations for RBF model are given as follows: for linear parameters of RBF, 0 1¡1 0 1 ² ² X± 1 X± @ v ( n C 1) D W i, j W Ti,j A @ Ii, j W Ti,j A , v i, j2D i, j2D
(3.9)
for RBF centers, ³ c k ( n C 1) D c k ( n ) C gc ¡
@ET @c k
C
´ r Epen ( c k ) ,
(3.10)
and for RBF widths, ³ ´ @E T C r Epen ( sk ) , sk ( n C 1) D sk ( n) C gs ¡ @sk
(3.11)
Enhanced 3D Shape Recovery
2627
where the vectors v D ( v 0 , v1 , . . . vN ) T and 0 1T 1 B Qb (ka i, j ¡ c 1 k) C B C W i, j D B C .. @ A . Qb ( kai, j ¡ cN k) are denoted. gc and gs are the learning rates for the RBF centers and widths, respectively. Based on the above formulation, r Epen ( ¢) denes as a penalty term for the penalty optimization approach used for providing an uphill force under the searching space to avoid having the learning process be stuck in local minima. The penalty term is assigned as a gaussian-like function as 2 r Epen ( x ) D ¡ | x ( n ¡ 1) ¡ x ¤ | m ³ ´ ( x ( n ¡ 1) ¡ x ¤ ) T ( x ( n ¡ 1) ¡ x ¤ ) £ exp ¡ , m2
(3.12)
where x ¤ is the suboptimal parameter assumed to be stuck in a local minima, and m is a penalty weighting factor. Note that the selection and adjustment of the penalty weight factor are crucial to the entire performance of the algorithm, such that the corresponding conditions are based on the convergence characteristics. The detailed descriptions about these conditions have been presented in Cho and Chow (1999b). The proposed mechanism can be briey summarized: the parameters v are optimized by the least-squares method, and the other parameters are optimized by the gradient-descent method at the rst few iterations. Then, if the convergence gets stuck in a local minimum, the penalty term is calculated by equation 3.12 and the parameters updated by equations 3.8, 3.10, and 3.11. Afterward, if the error starts to decrease, the m should be increased. The above ow has been looped until it meets the stopping criterion. In accordance with these formulations, the learning mechanism of the proposed approach is illustrated in Figure 2. 4 Experimental Results and Discussion
In this section, the performance of the neural-based hybrid reectance model is investigated by using two synthetic and two real images. The algorithm was written in Matlab scripts and performed on a SUN Ultra5 workstation. In all cases, FNN and RBF models with size of 2-10-1 are used for the hybrid structure of the proposed model. Also, the boundary condition of all reconstruction algorithms is specied by taking the average of the two nearest values of the appropriate gradient component to obtain the nearest value of the depth in the interior of the grid (Ikeuchi & Horn, 1981). A discussion on quantitative analysis and noise sensitivity is included.
2628
Siu-Yeung Cho and Tommy W. S. Chow
Figure 2: An iterative SFS algorithm with the neural-based hybrid reectance0 model. 4.1 Quantitative Results by Synthetic Sphere Object. Quantitative results are obtained by reconstructing a synthetic sphere object. The true depth map of the synthetic sphere object is generated mathematically by (p if x2 C y2 · r2 r2 ¡ x2 ¡ y2 (4.1) z ( x, y ) D 0 otherwise,
where r D 52 and ¡63 · x, y · 64. This synthetic image was generated using the depth map by equation 4.1, and the surface gradients were computed using the forward discrete approximation. The shaded images were generated according to the image reectance model, and the specular components are based on the Torrance–Sparrow model. Different lighting and viewing directions can be generated to the corresponding images for the investigations. Figures 3 show the synthetic sphere images with different specular effects under different illuminate conditions. The comparisons between our proposed neural-based hybrid reectance model with Nayar et al.’sgeneralized hybrid model (Nayar et al., 1991) and the FNN-based model (Cho & Chow, 1999b) are tabulated in Table 1. Clearly, the results indicate that the performance of our proposed neural based model is substantially better than that of the Nayar et al. generalized hybrid reectance model under different illuminated conditions. From the illustration of the errors under unknown specular information, the obtained errors by the proposed
Viewing Direction,
V
t D 0± I s D 0± t D 45± I s D 45± t D 135± I s D 45± t D 225± I s D 45± t D 315± I s D 45± t D 45± I s D 45± Unknown Unknown
Lighting Directions,
s
t D 0± I s D 0± t D 45± I s D 15± t D 135± I s D 15± t D 225± I s D 15± t D 315± I s D 15± Unknown t D 45± I s D 15± Unknown
0.0387 0.0717 0.0729 0.0682 0.0644 0.0828 0.0736 0.1002
0.1896 0.2668 0.2955 0.2846 0.2071 0.3135 0.3543 0.399
Maximum 0.5 3.0 £ 10¡3 3.3 £ 10¡3 3.3 £ 10¡3 1.7 £ 10¡3 2.2 £ 10¡3 2.0 £ 10¡3 2.3 £ 10¡3
£ 10¡3
Variance 0.0593 0.0996 0.0929 0.1031 0.1009 0.1079 0.1144 0.1086
Mean 0.2144 0.5004 0.5045 0.5278 0.5163 0.4018 0.5895 0.4584
Maximum 1.1 4.5 £ 10¡3 4.7 £ 10¡3 5.1 £ 10¡3 4.6 £ 10¡3 3.6 £ 10¡3 5.6 £ 10¡3 3.7 £ 10¡3
£ 10¡3
Variance
Absolute Depth Error
Absolute Depth Error
Mean
Neural-Based Reectance Model (FNN based only) (Cho & Chow, 1999a)
Hybrid Reectance Model (Nayar et al., 1991)
Table 1: Comparative Results for the Reconstruction of the Synthetic Sphere Object.
0.0385 0.05 0.055 0.053 0.0545 0.0659 0.061 0.0791
Mean 0.185 0.308 0.344 0.3435 0.375 0.3438 0.4034 0.4465
Maximum
7.4 £ 10¡4 8.6 £ 10¡4 9.0 £ 10¡4 8.9 £ 10¡4 9.4 £ 10¡4 2.2 £ 10¡3 2.5 £ 10¡3 3.5 £ 10¡3
Variance
Absolute Depth Error
Neural-Based Hybrid Reectance Model (FNN and RBF Model)
Enhanced 3D Shape Recovery 2629
2630
Siu-Yeung Cho and Tommy W. S. Chow
Figure 3: The synthetic sphere images under different specular effects at different locations. (a) tilt D 45 degrees, slant D 45 degrees, (b) tilt D 135 degrees, slant D 45 degrees, (c) tilt D 225 degrees, slant D 45 degrees, (d) tilt D 315 degrees, slant D 45 degrees.
neural hybrid model are less than that of the generalized hybrid model and the FNN-based model in about 40% and 37%, respectively. Clearly, our proposed hybrid reectance model is able to deliver much more accurate results under the unknown specular conditions. 4.2 Noise Sensitivity Results by Synthetic Vase Object. The noise sensitivity results of the proposed neural-based hybrid reectance model are examined in this section. The synthetic vase object was used for the investigations. The true depth map of the synthetic vase object is generated
Enhanced 3D Shape Recovery
2631
Figure 4: The synthetic objects under different noise conditions. (a) No noise, (b) 5% gaussian noise, (c) 5% Rayleigh noise.
mathematically by q z( x, y) D f ( y ) 2 ¡ x2 ,
(4.2)
where f ( x ) D 0.15 ¡ 0.1y( 6y C 1) 2 ( y ¡ 1) 2 ( 3y ¡ 2) , ¡0.5 · x · 0.5, and 0.0 · y · 1.0. Similarly, the synthetic vase image was generated using the depth map of equation 4.2, and the shaded images with specular effects were generated according to the conventional specular reectance model. Two different noise types, gaussian and Rayleigh, were used, and different noise levers were added to the image intensities. Figure 4 presents the synthetic vase images under noise-free, 5% gaussian noise, and 5% Rayleigh noise conditions. The specular component is located at the central of this object. Table 2 compares the performance of the proposed neural-based hybrid reectance model with the Nayar et al. (1991) generalized hybrid model
2632
Siu-Yeung Cho and Tommy W. S. Chow
and FNN-based reectance model. In the high-noise-level condition (10% noise), the mean depth error obtained by the neural-based hybrid model is signicantly lower than those of the other methods. Consistent results on images were also obtained. We can conclude that our approach is able to provide a much more accurate reconstruction under a noisy environment. 4.3 Real Images Results. Two real images, Pepper and Flowerpot, were used for the validations. The performance is compared with that of the conventional Lambertian, Torrance–Sparrow, and Nayar et al.’s generalized hybrid reectance models. In these images, the image size is 128 £ 128. These images are quite difcult to handle because they contain non-Lambertian surfaces with different reectivity properties. On the other hand, the directions of light source are unknown. For conventional Lambertian model, estimating the directions of the light source is required. In this study, Zheng and Chellappa’s method (1991) was used for light source estimation.
4.3.1 Pepper Image. An image of a single pepper is shown in Figure 5a. The estimated 3D shape of the pepper obtained by our proposed approach is shown in Figure 5b after running for 20 iterations only. Also, estimation of light source direction is never required in our proposed neural-based hybrid reectance model. Figure 5c shows the recovered 3D shape using the Lambertian model. The result is far from satisfactory despite using Zheng and Chellappa’s (1991) method for estimating the unknown light source. Figure 5d shows the result obtained by the Lambertian model with an incorrect light source. In this case, the recovered 3D pepper lost lots of information because of the abnormal lighting condition. The recovered 3D shapes by the original hybrid reectance model for non-Lambertian surfaces under known and unknown reectivity parameters are shown in Figures 5e and 5f, respectively. The recovered surfaces obtained by the generalized reectance model (Nayar et al., 1991) with known and unknown illuminated conditions are shown in Figures 5g and 5h, respectively. Although the performance of Nayar et al.’s models is better than that of the Lambertian model, the recovered 3D pepper is still unsatisfactory attributed to the incorrect light source estimate. 4.3.2 Flowerpot Image. Figure 6a shows a real image of a owerpot. This image is quite difcult to handle because it contains non-Lambertian surfaces with a strong specular component, and the reectance model is formed by the unknown and multiple light sources condition. The recovered 3D owerpot by the neural-based hybrid reectance model after running 30 iterations is shown in Figure 6b. Similar to the results with the pepper, the estimation of the lighting and viewing directions is never required. In this illustration, the performance of our proposed approach is very promising compared with the other conventional models. Figures 6c and 6d show the recovered 3D shapes by the Torrance–Sparrow model with estimated light
0.0343
0.051 0.057 0.058
0.061 0.0593 0.0711
0%
2 5 10
2 5 10
No noise
Gaussian
Rayleigh
Mean
Noise Level
Noise Type
0.555 0.553 0.581
0.558 0.558 0.557
0.522
Maximum 0.0486
£ 10¡3 0.0535 0.0543 0.0681 0.0635 0.0774 0.0771
2.4 £ 10¡3 2.7 £ 10¡3 3.0 £ 10¡3
2.4 2.4 £ 10¡3 2.4 £ 10¡3
£ 10¡3
2.2
Mean
Variance
Absolute Depth Error
Nayar et al. (1991) Hybrid model (Lambertian C Torrance–Sparrow)
0.548 0.627 0.677
0.502 0.497 0.519
0.495
Maximum
£ 10¡3
4.2 £ 10¡3 5.9 £ 10¡3 5.1 £ 10¡3
3.7 3.6 £ 10¡3 4.0 £ 10¡3
2.8
£ 10¡3
Variance
Absolute Depth Error
FNN-Based Model (Cho & Chow, 1999a)
Table 2: Noise Sensitivity Investigation for the Synthetic Vase Object Reconstruction.
0.0429 0.0487 0.0486
0.0432 0.0542 0.0484
0.0308
Mean
0.471 0.459 0.459
0.500 0.514 0.502
0.528
Maximum
3.0 £ 10¡3 3.0 £ 10¡3 2.8 £ 10¡3
2.9 £ 10¡3 3.1 £ 10¡3 3.1 £ 10¡3
2.6 £ 10¡3
Variance
Absolute Depth Error
Neural-Based Hybrid Model (FNN + RBF)
Enhanced 3D Shape Recovery 2633
2634
Siu-Yeung Cho and Tommy W. S. Chow
Figure 5: Results for pepper image. (a) Original pepper image. (b) 3D plot of reconstructed surface by the proposed neural-based hybrid reectance model. (c) 3D plot of the reconstructed surface by the Lambertian model with the estimated light source direction by Zheng and Chellappa’s method. (d) 3D plot of the reconstructed surface by the Lambertian model with incorrect light source direction. (e) 3D plot of the reconstructed surface by the original hybrid reectance model with the estimated light source direction by Zheng and Chellappa’s method. (f) 3D plot of reconstructed surface by the original hybrid model with incorrect parameters setting. (g) 3D plot of the reconstructed surface by the generalized reectance model proposed by Nayar et al. (1991). (h) 3D plot of the reconstructed surface by the generalized reectance model proposed by Nayar et al. (1991) under uncertain illuminate condition.
Enhanced 3D Shape Recovery
2635
Figure 6: Results for owerpot image. (a) Original owerpot image. (b) 3D plot of the reconstructed surface by the proposed neural-based hybrid reectance model. (c) 3D plot of the reconstructed surface by the Torrance–Sparrow model with the estimated lighting direction. (d) 3D plot of the reconstructed surface by the Torrance–Sparrow model with unknown light source direction. (e) 3D plot of the reconstructed surface by the original hybrid reectance model with the estimated light source direction. (f) 3D plot of the reconstructed surface by the original hybrid model with incorrect parameters setting. (g) 3D plot of the reconstructed surface by the generalized relectance model proposed by Nayar et al. (1991). (h) 3D plot of the reconstructed surface by the generalized reectance model proposed by Nayar et al. (1991) under uncertain illuminate condition.
source direction and unknown light source direction, respectively. The recovered 3D owerpot obtained by the original hybrid reectance model with known and unknown illuminated conditions is shown in Figures 6e and 6f, respectively. In addition, the recovered surfaces obtained by the gen-
2636
Siu-Yeung Cho and Tommy W. S. Chow
eralized reectance model (Nayar et al., 1991) with known and unknown illuminated conditions are shown in Figures 6g and 6h, respectively. Most of the recovered 3D shapes by both models appear quite good under the known illuminate condition, but they are still unsatisfactory when attributed to the unknown illuminate condition. We can conclude that the proposed neuralbased hybrid reectance model is able to deliver much better results than those yielded by the other models under specular effect and unknown reectivity parameters conditions in the real image applications. 5 Conclusion
In practical applications, most real surfaces are often neither a perfectly Lambertian model or an ideally specular model, but they are formed by the hybrid structure of these two models. A new neural-based hybrid reectance model for real image application is proposed in this article. This new reectance model is based on the learning of the free weights and parameters of the hybrid structure of the FNN and RBF networks, which are able to model the physical parameters of both diffuse reectance and specular reectance models. Through this model, the SFS algorithm has become more robust and effective for most object 3D reconstruction with different lighting and specular conditions. The proposed neural-based hybrid reectance model is very robust under noisy environments. These characteristics enable the hybrid neural model to be applicable to most real-world applications where the reectivity parameters are never given and the image suffers from noisy and specular conditions. The experimental results corroborate the high quality of shape reconstruction and better performance of the neural-based hybrid reectance model. We also conclude that the proposed hybrid structure of FNN-based and RBF-based model outperforms the FNN-based model and other conventional models, such as the Torrance– Sparrow model and Nayar’s model, under strong and unknown specular conditions. References Cho, S. Y., & Chow, T. W. S. (1999a).Shape recovery from shading by a new neural based reectance model. IEEE Transaction on Neural Networks, 10, 1536–1541. Cho, S. Y., & Chow, T. W. S. (1999b). Training multilayer neural networks using fast global learning algorithm—Least squares and penalized optimization methods. Neurocomputing, 25, 115–131. Healy, G., & Binford, T. O. (1988). Local shape from specularity. Computer Vision, Graphics and Image Processing, 42, 62–86. Horn, B. K. P. (1970). Shape from shading: A smooth opaque object from one view. Unpublished doctoral dissertation, MIT. Horn, B. K. P., & Brooks, M. J. (1986). The variational approach to shape from shading. Computer Vision Graphics Image Process, 33, 174–208.
Enhanced 3D Shape Recovery
2637
Horn, B. K. P., & Brooks, M. J. (1989). Shape from shading. Cambridge, MA: MIT Press. Horn, B. K. P., & Sjoberg, R. W. (1990). Calculating the reectance map. Applied Optics, 18, 37–75. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximation. Neural Networks, 2, 359–366. Ikeuchi, K., & Horn, B. K. P. (1981).Numerical shape from shading and occluding boundaries. Articial Intelligence, 17, 141–184. Lee, K. M., & Kuo, C. C. J. (1997). Shape from shading with a generalized reectance map model. Computer Vision and Image Understanding, 67, 143–160. Nayar, S. K., Ikeuchi, K., & Kanade, T. (1990). Determining shape and reectance of hybrid surfaces by photometric sampling. IEEE Transactions on Robotics and Automation, 6, 418–430. Nayar, S. K., Ikeuchi, K., & Kanade, T. (1991). Surface reection: Physical and geometrical perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 611–634. Park, J., & Sandberg, I. W. (1991). Universal approximation using radial basis function networks. Neural Computation, 3, 261–257. Torrance, K. E., & Sparrow, E. M. (1967). Theory for off-specular reection from roughened surfaces. Journal of the Optical Society of America, 57, 1105–1114. Zheng, Q., & Chellappa, R. (1991). Estimation of illuminant direction, albedo, and shape from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 680–702. Received April 6, 2000; accepted February 9, 2001.
Neural Computation, Volume 13, Number 12
CONTENTS Article Synchronization of the Neural Response to Noisy Periodic Synaptic Input A. N. Burkitt and G. M. Clark
2639
Note Specication of Training Sets and the Number of Hidden Neurons for Multilayer Perceptrons L. S. Camargo and T. Yoneyama
2673
Letters Spotting Neural Spike Patterns Using an Adversary Background Model Itay Gat and Naftali Tishby
2681
Intrinsic Stabilization of Output Rates by Spike-Based Hebbian Learning Richard Kempter, Wulfram Gerstner, and J. Leo van Hemmen
2709
Information Transfer Between Rhythmically Coupled Networks: Reading the Hippocampal Phase Code Ole Jensen
2743
Gaussian Process Approach to Spiking Neurons for Inhomogeneous Poisson Inputs Ken-ichi Amemori and Shin Ishii
2763
Dual Information Representation with Stable Firing Rates and Chaotic Spatiotemporal Spike Patterns in a Neural Network Model Osamu Araki and Kazuyuki Aihara
2799
Neurons with Two Sites of Synaptic Integration Learn Invariant Representations Konrad P. K¨ording and Peter K¨onig
2823
Linear Constraints on Weight Representation for Generalized Learning of Multilayer Networks Masaki Ishii and Itsuo Kumazawa
2851
Feedforward Neural Network Construction Using Cross Validation Rudy Setiono
2865
ARTICLE
Communicated by Carl van Vreeswijk
Synchronization of the Neural Response to Noisy Periodic Synaptic Input A. N. Burkitt [email protected] Bionic Ear Institute, East Melbourne, Victoria 3002, Australia
G. M. Clark [email protected] Department of Otolaryngology, Bionic Ear Institute and University of Melbourne, Royal Victorian Eye and Ear Hospital, East Melbourne, Victoria 3002, Australia
The timing information contained in the response of a neuron to noisy periodic synaptic input is analyzed for the leaky integrate-and-re neural model. We address the question of the relationship between the timing of the synaptic inputs and the output spikes. This requires an analysis of the interspike interval distribution of the output spikes, which is obtained in the gaussian approximation. The conditional output spike density in response to noisy periodic input is evaluated as a function of the initial phase of the inputs. This enables the phase transition matrix to be calculated, which relates the phase at which the output spike is generated to the initial phase of the inputs. The interspike interval histogram and the period histogram for the neural response to ongoing periodic input are then evaluated by using the leading eigenvector of this phase transition matrix. The synchronization index of the output spikes is found to increase sharply as the inputs become synchronized. This enhancement of synchronization is most pronounced for large numbers of inputs and lower frequencies of modulation and also for rates of input near the critical input rate. However, the mutual information between the input phase of the stimulus and the timing of output spikes is found to decrease at low input rates as the number of inputs increases. The results show close agreement with those obtained from numerical simulations for large numbers of inputs. 1 Introduction The degree of phase locking in the response of neurons to noisy periodic synaptic input plays an important role in auditory processing and has been studied for a considerable time (Gerstein & Kiang, 1960; Rose, Brugge, Anderson, & Hind, 1967). Gerstein and Kiang presented an auditory stimulus to anesthetized cats and observed a multimodal interspike interval c 2001 Massachusetts Institute of Technology Neural Computation 13, 2639–2672 (2001) °
2640
A. N. Burkitt and G. M. Clark
histogram (ISIH), resulting from spikes that tend to re near the peak of the stimulus and hence cluster around integer multiples of the stimulus period. Rose and colleagues analyzed the phase-locked responses in terms of the phase histogram, in which the responses are plotted in terms of their relation to the phase of the stimulus. Subsequent authors extended this work to examine the relationship between the spike rate and synchrony of the response of auditory neurons to low- and midfrequency single tones (Anderson, 1973; Johnson, 1980). The neural responses to acoustical stimuli in cats indicated that spikes are phase locked up to around 3–5 kHz in mammals (Rose et al., 1967; Johnson, 1980) and up to frequencies of 8 kHz in the barn owl (Koppl, ¨ 1997). This behavior was simulated using a Monte Carlo simulation of the diffusion equation with a periodically varying drift parameter (Gerstein & Mandelbrot, 1964), and ISIHs were obtained that resembled the experimental results. In this same article, Gerstein and Mandelbrot gave an analytical solution to the integrate-and-re neural model in which the decay of the potential across the cell membrane is ignored; this is called the perfect integrator neural model. Although this is clearly an unphysiological approximation, it nevertheless provides a description that captures some of the characteristics of the spontaneous activity in auditory nerve bers (Gerstein & Mandelbrot, 1964; Iyengar & Liao, 1997). However, Monte Carlo simulations were required to compare the theory with experiment when acoustic stimuli were presented. The results of the simulations showed many of the same features as the experimental data, including the multimodal peaks of the ISIHs (Gerstein & Mandelbrot, 1964). One of the simplest and most widely used class of neural models capable of capturing this behavior is the integrate-and-re model, rst introduced by Lapicque (1907). In this class of models, the synaptic inputs produce a postsynaptic response of the membrane potential, which passively propagates to the soma, where a spike is generated when the summed contributions exceed a voltage threshold. The models typically lack any of the specic currents that underlie spiking, and thus the complex mechanisms by which the sodium and potassium conductances cause action potentials to be generated are not part of the model. Pulse-based models have, however, been developed in which the time courses of the rate constants are approximated by a pulse, and such models bridge the divide between the classical integrateand-re models and the Hodgkin-Huxley models (Destexhe, 1997). An important assumption is also that the synaptic inputs interact only in a linear manner, so that phenomena involving nonlinear interaction of synaptic inputs, such as pulse-paired facilitation-depression and synaptic inputs that depend on postsynaptic currents, are neglected. Nevertheless, a comparison of the integrate-and-re model with the Hodgkin-Huxley model when both receive a stochastic input current found that the threshold model provides a good approximation (Kistler, Gerstner, & van Hemmen, 1997). A recent review of the integrate-and-re neural model that provides details of the
Synchronization to Noisy Periodic Input
2641
variations of the model and comparisons with both experimental data and other models is given in Koch (1999). An analysis of the integrate-and-re model with periodic input, using a deterministic differential equation without any noise, found three regions in parameter space (the space dened by the frequency of synaptic inputs, average rate of inputs, synchronization of inputs, and the time constant of the membrane potential decay) with distinctly different behaviors (Keener, Hoppensteadt, & Rinzel, 1981): a phase-locking region, a region with apparently nonperiodic behavior, and a region where the ring eventually dies out. This extended earlier studies on systems with a deterministic response to sinusoidal stimuli (Rescigno, Stein, Purple, & Poppele, 1970; Stein, French, & Holden, 1972), and a study of phase locking in which the threshold was sinusoidally modulated (Glass & Mackey, 1979). Subsequent work on the effect of a periodically varying rate of input to noisy integrate-and-re neurons has focused largely on stochastic counterparts to this approach, using either numerical simulations, such as in the study of the temporal acuity and localization in the barn owl auditory system (Gerstner, Kempter, van Hemmen, & Wagner, 1996), or stochastic differential equations (Bulsara, Elston, Doering, Lowen, & Lindenberg, 1996; LÂansk Ây, 1997). Much of the recent work on the response of neural systems to noisy periodic synaptic input has been motivated by an interest in the applicability of stochastic resonance to these systems, in which the detection of a subthreshold periodic stimulus is enhanced by the presence of noise (Fauve & Heslot, 1983). Stochastic resonance has been extensively studied by analyzing the effect of a noisy weak periodic stimulus on a system in a double-well potential, a review of which is given in Gammaitoni, H¨anggi, Jung, & Marchesoni (1998). The phenomenon of stochastic resonance was demonstrated in neural systems by using external noise applied to craysh mechanoreceptor cells (Douglas, Wilkens, Pantozelou, & Moss, 1993), and this generated considerable interest in the analysis of stochastic resonance in neural models (Longtin, 1993; Stemmler, 1996; Chialvo, Longtin, & Muller-Gerking, ¨ 1997). However, much of this work involved the analysis of the more mathematically tractable situation in which the phase of the input stimulus is reset after each output spike is generated (Plesser & Tanaka, 1997), a situation that is implausible in the neurophysiological context. A method for solving this shortcoming has recently been proposed (Plesser & Geisel, 1999a, 1999b; Shimokawa, Pakdaman, & Sato, 1999) in which the spike output distribution is considered a function of the phase of the stimulus at which both the summation is initiated and the output spike is generated. Consequently, it is possible to nd the stationary spike phase distribution—the average phase distribution—that results over the whole time course of the periodic stimulus (Hohn & Burkitt, 2001). A further impetus to the study of periodic synaptic inputs has been the observation of synchronized periodical activity of neurons in various brain areas, such as the oscillations observed in the visual cortex (Eckhorn et al.,
2642
A. N. Burkitt and G. M. Clark
1988; Gray, Konig, ¨ Engel, & Singer, 1989). This has led to the hypothesis that synchronized activity of neurons may play a role in brain functioning, such as pattern segmentation and feature binding (von der Malsburg & Schneider, 1986; Eckhorn et al., 1988; Gray & Singer, 1989; Singer, 1993; Usher, Schuster, & Niebur, 1993). It has also been proposed that the information contained in a spike sequence could be coded using the relationship between the spike times and the oscillation of the periodic background signal (Hopeld, 1995; Jensen & Lisman, 1996; Tsodyks, Skaggs, Sejnowski, & McNaughton, 1996). In a recent article (Kempter, Gerstner, van Hemmen, & Wagner, 1998), the response of an integrate-and-re neuron to noisy periodic spike input is studied, and a thorough analysis of the relationship between the input and output rates over the range of input vector strengths (also called the synchronization index) is carried out. Their results show how the output rate increases with both the input rate and synchrony and how it depends on the neural parameters, such as the threshold, the number of synapses, and the time course of the postsynaptic response to the inputs. They are also able to identify the conditions under which a neuron can act as a coincidence detector and thus convert a temporal code into a rate code. They identify two parameters—the coherence gain (which provides a measure of the mean output ring rate for synchronized versus random input) and the quality factor for coincidence detection (which provides a measure of the difference in the neural response to random and synchronized inputs)—which largely characterize the performance of a coincidence detector, and they plot these quantities for representative values of the neural parameters over the range of input vector strengths. However, since their analysis concerns only the output rate, they are not able to predict quantities that depend on the details of the timing of individual output spikes. In this article, we provide an important extension of these results by analyzing the time distribution of output spikes in response to noisy periodic synaptic input, that is, the probability of an output spike’s being produced at a particular phase of the periodic input. Our analysis enables us to nd the interspike interval distribution of the output spikes, the synchronization index of the outputs, and the phase distribution of the outputs. In order to examine the amount of information that the output spikes carry about the stimulus, the mutual information between the input phase of the stimulus and the timing of output spikes is calculated. The analysis is carried out in the gaussian approximation, in which the amplitude of the postsynaptic response to an individual input spike is small, similar to the diffusion approximation (Tuckwell, 1988a). For neurons with large numbers of smallamplitude inputs, this approximation proves to be very accurate, as will be discussed in section 3.2, where the results are compared with numerical simulations. In the following section, we give an outline of the model, including the neural model (in section 2.1) and the synaptic input (in section 2.2). In order
Synchronization to Noisy Periodic Input
2643
to calculate the interspike interval (ISI) distribution, it is rst necessary to calculate the probability density of the membrane potential, which is done in section 2.3. The ISI distribution is then calculated in two steps: rst by calculating the conditional ISI distribution (i.e., conditional on the initial phase of the inputs) in section 2.4 and then by forming the appropriate average over the initial phases in section 2.5. The calculation of the mutual information between the input phase and the timing of output spikes is outlined in section 2.6. The typical ISI distributions obtained are illustrated in section 3.1. The synchronization index is analyzed in section 3.2, and the results of the calculation of the mutual information are presented in section 3.3. The results are discussed in section 4. The appendixes contain an exact analysis of the ISI distribution for the perfect integrator model with noisy periodic synaptic input. 2 Methods In this section, we calculate the ISI distribution and the phase distribution for the leaky integrate-and-re neural model. The notation follows that in Burkitt and Clark (1999, 2000), with the straightforward generalization to include the frequency and phase of the periodic synaptic input, as described below. 2.1 Neural Model. In the analysis presented here, we consider a single neuron with N independent inputs (afferent bers), labeled with index k. The membrane potential is assumed to be reset to its initial value at time t D 0, V ( 0) D v 0 after an action potential (AP) has been generated. An AP (spike) is produced only when the membrane voltage exceeds the threshold, Vth , which has a potential difference with the reset potential of h D Vth ¡ v 0. The potential is the sum of the excitatory and inhibitory postsynaptic potentials (EPSPs and IPSPs, respectively), V ( t ) D v0 C
N X
ak s k ( t ) ,
kD 1
sk ( t) D
1 X mD1
u k ( t ¡ tkm ) ,
(2.1)
where the index k D 1, . . . , N denotes the afferent ber and the index m denotes the mth input from the particular ber, whose time of arrival is tkm (0 < tk1 < tk2 < ¢ ¢ ¢ < tkm < ¢ ¢ ¢). The amplitude of the inputs from ber k is a k, which is positive for EPSPs and negative for IPSPs, and the time course of an input at the site of spike generation is described by the synaptic response function u k ( t) for the leaky integrate-and-re model. The results we present here are for the shot-noise model, where the synaptic response function is ( uk ( t ) D
e ¡t / t 0
for t ¸ 0
for t < 0,
(2.2)
2644
A. N. Burkitt and G. M. Clark
where t is the time constant of the membrane potential. Consequently, the membrane potential has a discontinuous jump of amplitude ak upon the arrival of an EPSP and then decays exponentially between inputs. The decay of the EPSP across the membrane means that the contribution from EPSPs that arrive earlier has partially decayed by the time that later EPSPs arrive. The natural unit of time is the membrane time constant, t, and other units of time such as the period of the inputs, T D 2p / v, are given here in units of t . Synaptic response functions with an arbitrary time course may also be analyzed, which may allow a more accurate model of the time course of the incoming EPSP. The above synaptic response function, equation 2.2, is identical for all inputs, but the analysis also allows response functions with different time courses for different input bers. This enables, for example, the effect of propagation of the PSPs along the dendritic tree to be modeled for inputs with synapses at different distances from the site at which the neuron generates an action potential (typically the axon hillock). 2.2 Synaptic Input. The degree of phase locking (or synchronization) is measured by the vector strength r (Goldberg & Brown, 1969), also known as the synchronization index (Anderson, 1973; Johnson, 1980), ± ² 1/ 2 r D r2S C r2C rS D rC D
³ ´ X 1 M¡1 2p m hm sin N mD 0 M
³ ´ X 1 M¡1 2p m hm cos , N mD 0 M
(2.3)
where here hm denotes the number of spikes in the mth bin of a period P histogram with M bins and N denotes the total number of spikes (N D hm ). The vector strength takes values between zero (a at period histogram) and one (all spikes in one bin of the period histogram). We assume that the input rate on each of the input bers is identical. The time-dependent rate of arrival of input spikes at a synapse is periodic with frequency v and initial phase w in , that is, the phase of the input at the time when the summation commences, l( t) D lN in ( 1 C 2rin cos( vt C w in ) ) ,
0 · rin · 0.5,
(2.4)
where lN in is the time-averaged input rate on a single ber and rin is the synchronization index of the input spikes. With this parameterization of the input rate, the value rin D 0.5 represents a highly modulated input, and values of rin closer to zero represent inputs that contain less of the frequency-dependent component (the requirement that the rate be nonneg-
Synchronization to Noisy Periodic Input
2645
ative restricts rin to the interval [0, 0.5] with this parameterization). Frequency and rate are given in units of the membrane time constant (as cycles or spikes per unit of the membrane time constant t ), and the frequency and average rate of the inputs can be varied independently. The use of the term frequency here and throughout the rest of the analysis refers to this periodic modulation of the rate of inputs (it is not synonymous with the average rate of inputs). This input rate function represents an inhomogeneous (i.e., nonstationary) Poisson process (Papoulis, 1991). Other inhomogeneous Poisson rate models of periodic synaptic input have also been examined in the literature, such as modeling the rate with a sum of gaussian distributions with centers separated by the mean period and the width chosen to give the required synchronization index (Kempter et al., 1998; appendix A of their article summarizes the properties of the inhomogeneous Poisson process). 2.3 The Probability Density of the Potential. The probability density of the potential V ( t) is denoted by p( v, t | v 0I v, w ) , which is the probability that the potential has the value v at time t, given the initial condition is V ( 0) D v 0, the frequency of the input is v, and the initial phase is w . This initial condition corresponds to the reset of the membrane potential to v 0 immediately after the previous spike, which is assumed to occur at time t D 0. The probability density is given by the following normalized expectation value, which is the integrated average over the time distribution of synaptic inputs (Burkitt & Clark, 1999, 2000), p ( v, t | v 0I v, w ) D E fd( V ( t) ¡ v ) g ,
(2.5)
where d( x ) is the Dirac delta function and the expectation value is over the (random) time of arrival of the synaptic inputs. The frequency and phase dependence of the probability density arise through this averaging over the arrival times of the synaptic input, which is described here explicitly by l( t) , equation 2.4. Using the Fourier integral representation of the Dirac delta function, the probability density may be written as p ( v, t | v 0I v, w ) ( ³ ´) Z 1 N X dx expfix( v ¡ v0 ) gE exp ¡ix D a k sk ( t) ¡1 2p kD 1 Z
D
¡1
Z D
1
1 ¡1
N Y dx expfix( v ¡ v0 ) g F k ( x, tI v, w ) 2p kD 1 ( ) N X dx exp ix( v ¡ v 0 ) C ln F k ( x, tI v, w ) , 2p kD 1
(2.6)
2646
A. N. Burkitt and G. M. Clark
where the function F k ( x, tI v, w ) is given by © ª F k ( x, tI v, w ) D E exp ( ¡ixa k sk ( t) ) ,
(2.7)
and we have used the independence of the contributions from each afferent ber to obtain the product of F k ’s in equation 2.6. Expanding the exponential in powers of ak, we obtain ln F k ( x, tI v, w ) (
) x2 a2k 2 3 D ln E 1 ¡ ixa k sk ( t) ¡ s ( t ) C O( ak ) 2 k " # x2 a2k n 2 o 3 E s k ( t ) C O( a k ) D ln 1 ¡ ixa k E fsk ( t) g ¡ 2 D ¡ixa k E fs k ( t )g ¡
i x2 a2k h n 2 o E sk ( t ) ¡ ( E fsk ( t) g) 2 C O( a3k ) , 2
(2.8)
where O ( y ) denotes a remainder term that is smaller than C £ |y3 | when y lies in a neighborhood of 0 for some positive constant C. We carry out the analysis in the gaussian approximation and henceforth neglect the O( a3k ) terms. This expansion to second order in powers of a k is an approximation, which allows the resulting equations to be solved formally, and we do not address questions such as the convergence of the expansion. The accuracy of the results obtained will be compared in section 3.2 with those obtained using numerical simulation in order to gain condence in our method and results. We expect the approximation to work best for large values of N and values of the amplitudes of the postsynaptic potential (PSP) that are small (voltage is measured in units of the difference h between the threshold, Vth , and reset, v0 , potentials). In many neurons, this approximation provides an accurate description (Abeles, 1991; Douglas & Martin, 1991). The probability density function is evaluated as p( v, t | v0 I v, w ) » ¼ Z 1 dx x2 exp ix ( v ¡ v 0 ¡ U ( tI v, w ) ) ¡ C ( tI v, w ) D 2 ¡1 2p » ¼ ( v ¡ v 0 ¡ U ( tI v, w ) ) 2 1 exp ¡ D p , 2C ( tI v, w ) 2p C( tI v, w ) with U ( tI v, w ) D
N X kD 1
akE fsk ( t) g ,
(2.9)
Synchronization to Noisy Periodic Input
C( tI v, w ) D
N X kD 1
± n o ² a2k E s2k ( t) ¡ (E fsk ( t) g) 2 .
2647
(2.10)
This expression ensures that C( tI v, w ) > 0 for cases of interest. These expressions allow inputs with arbitrary amplitude distributions, including inhibitory inputs, to be considered. However, for simplicity, we consider here the situation where the postsynaptic potentials from the inputs are all excitatory and equal in both amplitude, ak D a > 0, and time course, u k ( t) D u ( t), and where the rates on each of the N afferent bers are identical, lk ( t) D l( t) . Consequently, we drop the index k in the following section, since all input ber characteristics are identical. In the case of an inhomogeneous Poisson process with rate l( t) on each ber, the expressions for U ( tI v, w ) and C ( tI v, w ) take the forms (Papoulis, 1991) Z U ( tI v, w ) D NaE fs( t) g D Na
t 0
l(t0 ) u ( t ¡ t0 ) dt0 ,
± n o ² C( tI v, w ) D Na2 E s2 ( t) ¡ ( E fs( t) g) 2 Z
D Na2
t 0
l(t0 ) u2 ( t ¡ t0 ) dt0 .
(2.11)
The dependence of the expressions for U ( tI v, w ) and C ( tI v, w ) above on frequency and initial phase is therefore given explicitly through the dependence on l(t) (refractory effects are ignored here, but a method for incorporating them is given in section 4). 2.4 First-Passage Time to Threshold. In order to calculate the probability density of output spikes, it is necessary to nd the time at which a spike is generated—the time at which the summed membrane potential crosses the threshold, Vth D v0 C h, for the rst time (called the rst-passage time). The conditional rst-passage time to threshold density, fh ( tI v, w ) , which depends on the frequency, v, and phase, w, of the synaptic input, is obtained from the integral equation (Plesser & Tanaka, 1997; Burkitt & Clark, 1999) (for v ¸ Vth ) Z
p ( v, t|v 0I v, w ) D
t 0
dt0 fh ( t0 I v, w ) pO( v, t | Vth , t0 , v 0 )
(2.12)
where pO( v2 , t2 | v1 , t1 , v0 ) is an approximation to the conditional probability density of the potential. The conditional probability density, p ( v2 , t2 | v1 , t1 , v 0 ), is dened as the probability that the potential has the value v2 at time t2 , given that V ( t1 ) D v1 and the reset value of the potential after a spike is v 0. Our approximation to the conditional probability density, pO( v2 , t2 | v1 , t1 , v0 ) , has the same denition with the added restriction that
2648
A. N. Burkitt and G. M. Clark
the potential is subthreshold in the time interval 0 · t < t1 . If the uctuations in the input currents are not temporally correlated, then the approximate expression is exact, as is the case for both the perfect integrator model and the shot-noise model (see equation 2.2). However, the two expressions will differ if there are temporal correlations, which occurs if the presynaptic spikes have correlations or the synaptic currents have a nite duration. This conditional probability density clearly depends on v and w , such that in every case, the phase w 0 associated with a particular time t0 is given by w 0 D [vt0 C w ]mod2p , where w is the initial phase (at time t D 0). This gives a direct and straightforward correspondence between times, t, and phases, w , of the input. Therefore, for notational convenience, the dependence of every quantity on the frequency and phase will be left implicit in this section. Note that equation 2.12, makes no assumptions about the stationarity of the conditional probability density (it does not require time-translational invariance). The conditional probability density is given by p ( v 2 , t2 | v 1 , t1 , v 0 ) D
p ( v2 , t2 , v1 , t1 | v 0 ) , p ( v 1 , t1 | v 0 )
(2.13)
and the joint probability density p( v2 , t2 , v1 , t1 | v 0 ) , which is the probability that V ( t ) takes the value v1 at time t1 and takes the value v2 at time t2 (given that the reset value of the potential after a spike is generated is again at v0 ), may be evaluated (Burkitt & Clark, 2000): p( v2 , t2 , v1 , t1 | v 0 ) D E fd( V ( t1 ) ¡ v1 )d( V ( t2 ) ¡ v2 )g Z 1 Z dx2 1 dx1 exp fix2 ( v2 ¡ v 0 ) D ¡1 2p ¡1 2p
C ix1 ( v1 ¡ v 0 ) C N ln F ( x2 , x1 , t2 , t1 ) g ,
(2.14)
where © ª F ( x2 , x1 , t2 , t1 ) D E exp ( ¡ix2 as( t2 ) ¡ ix1 as( t1 ) ) .
(2.15)
The function ln F ( x2 , x1 , t2 , t1 ) is expanded in the amplitude a of the synaptic response function as before (see equation 2.8) and contains cross-terms in x2 x1 : ln F ( x2 , x1 , t2 , t1 ) (
x2 x2 D ln E 1 ¡ iax2 s ( t2 ) ¡ iax1 s ( t1 ) ¡ 2 a2 s2 ( t2 ) ¡ 1 a2 s2 ( t1 ) 2 )
¡ x2 x1 a2 s( t2 ) s ( t1 ) C O ( a3 )
2
Synchronization to Noisy Periodic Input
2649
D ¡ix2 aEfs( t2 ) g ¡ ix1 aEfs( t1 )g
¡
² x2 ± ² x22 2 ± 2 a Efs ( t2 )g ¡ ( Efs( t2 ) g) 2 ¡ 1 a2 Efs2 ( t1 ) g ¡ ( Efs(t1 ) g) 2 2 2
¡ x2 x1 a2 ( Efs( t2 ) s ( t1 ) g ¡ Efs(t2 ) gEfs( t1 ) g) C O ( a3 ).
(2.16)
The integrals in equation 2.14 may be evaluated in the gaussian approximation (Burkitt & Clark, 2000) by making the change of variable y1 D x1 C k ( t2 , t1 ) x2 where k ( t2 , t1 ) is chosen in order to eliminate the crossterm x2 y1 ,
k ( t2 , t1 ) D
Efs( t2 ) s( t1 ) g ¡ Efs( t2 ) gEfs(t1 ) g C ( t2 , t1 ) D , Efs2 ( t1 ) g ¡ ( Efs(t1 ) g) 2 C ( t1 , t1 )
(2.17)
where C ( t2 , t1 ) is the autocovariance of s ( t) . In the case of an inhomogeneous Poisson process with rate l(t) on each ber, the autocovariance of s( t) is (Papoulis, 1991) Z C ( t2 , t1 ) D
t1 0
dt0 l(t0 ) u ( t1 ¡ t0 ) u ( t2 ¡ t0 ) ,
t 2 ¸ t1 .
(2.18)
The y1 and x2 integrals are now independent gaussian integrals and may be evaluated. The y1 integral yields exactly p ( v1 , t1 |v 0 , v, w ) , and the x2 integral gives the conditional probability density, p ( v 2 , t2 | v 1 , t1 , v 0 ) 1 2p c ( t2 , t1 ) » ¼ [( v2 ¡ v 0 ¡U ( t2 ) ¡k ( t2 , t1 ) ( v1 ¡v 0 ¡U ( t1 ) ) ]2 £ exp ¡ , 2c ( t2 , t1 )
D p
(2.19)
where c ( t2 , t1 ) D C ( t2 ) ¡ k 2 ( t2 , t1 ) C ( t1 ) D Na2
C( t2 , t2 ) C( t1 , t1 ) ¡ C2 ( t2 , t1 ) , C ( t1 , t 1 )
(2.20)
which ensures that c ( t2 , t1 ) is positive denite. Equation 2.12, which denes the conditional rst passage-time density, is a Volterra integral equation, which may be numerically evaluated using standard methods (Press, Flannery, Teukolsky, & Vetterling, 1992; Plesser & Tanaka, 1997) or variations on standard methods (details of the actual method used here are found in appendix B).
2650
A. N. Burkitt and G. M. Clark
The solution of the perfect integrator version of the integrate-and-re model, in which the decay of the potential across the membrane is neglected (or equivalently the membrane time constant is effectively innite in comparison to the other timescales of the problem), is given in appendix A. 2.5 The Spike Phase Distribution and Synchronization Index. In the case of homogeneous Poisson inputs, the rst passage-time density is identical with the interspike interval density. However, in the case of inhomogeneous Poisson inputs, this is no longer the case, except in the (unbiological) situation in which the phase of the input resets after each output spike is generated. It is therefore necessary now to examine explicitly the dependence of the conditional rst-passage time density (the ISI distribution) on the phase of the input. In order to nd the interspike interval density r ( tI v) generated by inhomogeneous Poisson inputs with period v, it is necessary to form the appropriate average over the initial phases w of the conditional rst passage-time densities (Plesser & Geisel, 1999a, 1999b), Z r ( tI v) D
2p
dw fh ( tI v, w ) Â ( s ) ( w ) ,
0
(2.21)
where  ( s) ( w ) is the stationary distribution of phases, evaluated as follows (Plesser & Geisel, 1999a). A phase transition density may be dened by Z g (w 0 , w ) D
1 0
dtfh ( tI v, w )d([vt C w] mod2p ¡ w 0 ) ,
(2.22)
which gives the probability density for output spikes with phase w 0 when the initial phase is w. Therefore, dening  ( n) ( w ) to be the output spike phase density for the nth spike, Z  ( n C 1) (w 0 ) D
2p
dw g( w 0 , w ) Â ( n) (w ) ,
(2.23)
0
and the stationary spike phase density, Â ( s ) (w ) , is given by the (nontrivial) solution of Z
2p
(s)
 (w 0 ) D
( ) dw g ( w 0 , w ) Â s (w ) .
(2.24)
0
By discretizing the phase axis, the phase transition density, g( w 0 , w ) , becomes a transition matrix gw 0 ,w and the stationary spike phase distribution vector, ( s) Âw , is the eigenvector corresponding to the eigenvalue 1 (Plesser & Geisel, 1999a).
Synchronization to Noisy Periodic Input
2651
2.6 Mutual Information Calculation. In addition to the interspike interval distribution, we would like to know the amount of information that the neural response (i.e., the spike times) contains about the phase of the periodic stimulus. The distribution of the input phase, w in , can be assumed to be uniform between 0 and 2p . We wish to determine the amount of information gained about the input phase of the stimulus when the output is observed for one cycle, and we consider the situation in which the neuron res at most 1 spike per cycle (in the regime in which we are interested, this will be the case, as will be discussed in section 3). Consequently, it is necessary to evaluate only the probability of ring no spike, p 0, or ring one spike, p1 . In the general case, in which the output rate of ring is described by lout ( t) , the probability p0 is given by (Papoulis, 1991) " Z p 0 D exp ¡
#
T
lout ( t ) dt .
(2.25)
0
For small values of the average output rate, lout (dened as the number of spikes per unit time averaged over one cycle), this gives the probability of ring a spike, p1 , of lout / v. The probability of ring a spike at phase w out , given that the input has a phase w in , is given by ( ) P (w out |w in ) D lout  s (w out ¡ w in ) / ( 2p v).
(2.26)
The mutual information IM is given by (Cover & Thomas, 1991) Z IM D
Z
2p
2p
dw out 0
0
Z ¡
dw in P ( w out |w in ) log2 ( P ( w out |w in ) ) 2p
2p 0
dw out P (w out ) log2 ( P (w out ) ) ,
(2.27)
where P ( w out ) is the probability of ring at phase w out , averaged over w in , P ( w out ) D lout / ( 2p v). This calculation in principle could be extended to take into account situations in which more than one spike is emitted in a cycle. In order to do this, it is necessary to calculate the conditional probability of multiple spikes, at phases w out,1 , w out,2 , and so on, given w in , which can be done using the above methods. However, in the analysis presented here, we are interested only in the regime with low output spike rates and therefore require only one output spike per cycle. 3 Results 3.1 Interspike Interval Distribution. The interspike interval distribution is given by equation 2.21, and the typical form of the distribution is
2652
A. N. Burkitt and G. M. Clark
(a)
r (t;w )
r (t;w )
0.2 2
4
6
8
time (T)
0 0
10 12
rin = 0.25
0.8 0.4 2
4
6
8
time (T)
2
4
3
(c)
1.2
0 0
rin = 0.1
0.4
10 12
r (t;w )
r (t;w )
rin = 0
0.2 0 0
(b)
0.6
0.4
6
time (T)
8
10 12
(d) rin = 0.5
2 1 0 0
2
4
6
time (T)
8
10 12
Figure 1: Plots of the interspike interval distribution, r ( tI v) , against time (in units of the stimulus period, T) for N D 64 inputs, an input frequency of v D 1.0, average input rate of lN in D 1.0, and amplitude of the individual EPSPs a D h / 64, where h is the difference between the threshold and reset values of the potential. Frequency and rate are given as cycles (spikes) per unit of the membrane time constant t . The four plots are for input synchronizations of (a) rin D 0.0, (b) rin D 0.1, (c) rin D 0.25, and (d) rin D 0.5.
illustrated in Figure 1 for three different values of the input synchronization. All three plots are for N D 64 inputs, average rate of input on each input ber of lN in D 1.0, and EPSP amplitudes a D h / 64, where h is the difference between the threshold and reset values of the potential. Frequency, v, and rate, l, are measured here in units of the time constant of the membrane, t , that is, a frequency (rate) of 1.0 corresponds to one cycle (spike) per t units of time, and the input rate is given by the time-varying function l( t) in equation 2.4. Figure 1a shows the ISI distribution where there is no input synchronization, rin D 0, which is the typical ISI distribution in response to a homogeneous Poisson input (Burkitt & Clark, 2000). Figures 1b, 1c, and 1d show the ISI distributions for inhomogeneous Poisson input with input synchronization rin D 0.1, 0.25, 0.5, respectively. These plots show a multimodal response typical of those recorded, for example, in response to low-frequency acoustical stimulation (less than 1 kHz) in the auditory nerve (Rose et al., 1967). As the input synchronization is increased, the peaks in the
Synchronization to Noisy Periodic Input
rin=0.5
0.8
rin=0.25
0.4
r =0.1
0.2
r =0
c (f )
1.0
2653
0
in
in
0
0.5p
p phase f
1.5p
2p
Figure 2: Plots of the period distribution, Â s ( w ) , (the stationary spike phase density, equation 2.24 for N D 64 inputs, an input frequency of v D 1.0, input rate of lN in D 1.0, and the amplitudes of the individual EPSPs a D h / 64. The four plots are for input synchronizations of rin D 0.0, 0.1, 0.25, 0.5 as labeled, and have been phase-shifted so that all peaks are at the same phase.
ISI distribution become more pronounced, with each peak corresponding to one cycle of the inputs. The period distributions of the spike times are plotted in Figure 2 for the four sets of parameters used in Figure 1. The synchronization indices of the output spikes in these four cases are rout D 0 for Figure 1a with rin D 0, rout D 0.46 for Figure 1b with rin D 0.1, rout D 0.78 for Figure 1c with rin D 0.25, and rout D 0.90 for Figure 1d with rin D 0.5. This solution for the stationary spike phase density, equation 2.22, was found by discretizing each period of the inputs into 40 time steps and solving the resulting eigenvalue equation. In Figure 2, the peaks of the distributions have been shifted to a phase of p in order to facilitate their comparison. For a given set of neural parameters (N, a, h) and no input synchronization, rin D 0, variations in the average input rate, lN in , cause the average rate of output spikes, lN out , to vary in a monotonic fashion. In the limit of a large number of input bers, N ! 1, there is a critical value of the input rate, ( lN in ) crit , below which the average rate of outputs vanishes (Burkitt & Clark, 2000): ( lN in ) crit D
h . Nat
(3.1)
2654
A. N. Burkitt and G. M. Clark
w = 1.2
0.6
w =1
out
0.4 l
w = 0.9 0.2 w = 0.8 0
0
0.1
0.2
r
0.3
0.4
0.5
in
Figure 3: Plot of the average rate of output, lN out , as a function of the synchronization index of the input rin for N D 64 (dotted lines) and N D 128 (solid lines) inputs, for input frequencies of v D 0.8, 0.9, 1.0, 1.2 (the average input rate, lN in , is equal to the frequency, v, namely 1 spike per period of the inputs). The amplitudes of the individual EPSPs are a D h / N.
For nite values of N, when the synchronization of the inputs increases in this “subcritical” situation (when lN in < ( lN in ) crit ) the coincident arrival of the inputs can result in a nonvanishing rate of outputs (Kempter et al., 1998). This is illustrated in Figure 3, in which the average rate of outputs, lN out , is plotted as a function of the input synchronization, rin , for N D 64 (dotted lines) and N D 128 (solid lines) and a range of values of the input frequencies (the average input rates here are 1 spike per period in all cases, so that lN in D v). The plotted points are the results of the analytical solution, which are connected by the plotted lines for the various values of v shown on the gure. The results show that an increase in the synchronization of the input, while keeping the average input rate constant, leads to an increase in the average output rate. The effect on the interspike interval distribution is illustrated in Figure 4, in which the neural parameters have the same values as in Figure 1 but the input frequencies (and input rates) take the values of v D 0.9 (left plot) and 0.8 (right plot), and the resulting output synchronization indices are rout D 0.91, 0.92 respectively. Consequently, the effect of a reduction in the frequency and average rate of inputs is both to decrease the average rate of output spikes and increase the number of peaks in the ISI distribution.
Synchronization to Noisy Periodic Input
2655
(a) 1.8 w r
1.2
in
= 0.9 = 0.5
w
0.8 r (t;w )
r (t;w )
(b) 1.0
0.6
r
0.6
in
= 0.8 = 0.5
0.4 0.2
0
0
2
4 6 8 time (T)
10
0
0
2
4 6 8 time (T)
10 12
Figure 4: Plots of the interspike interval distribution, r ( tI v) , against time (in units of the stimulus period, T) for N D 64 inputs. The average input rate is lN in D v, input synchronization is rin D 0.5, and amplitudes of the individual EPSPs are a D h / 64. The input frequencies are (a) v D 0.9 and (b) 0.8.
3.2 Analysis of Synchronization. The dependence of the synchronization index of the output spikes on the frequency of the time-varying rate of inputs is illustrated in Figure 5, which shows how the synchronization index decreases for increasing frequencies. The average rate of the inputs, lN in , in this plot is the same as the frequency, that is, there is on average one incoming spike per ber per cycle of the inputs. For each of the four values of N D 16, 32, 64, 128, the amplitude of the individual EPSPs is given by a D h / N, where h is the difference between the threshold and reset values of the potential. The synchronization index is large for all cases in this plot and closest to one when the number of inputs is greatest (the case N D 128 in Figure 5). The synchronization index of the sinusoidal input rate is rin D 0.5, so that the outputs represent an enhancement of the synchronization relative to the input. The cutoff on the low-frequency side of the plot is determined by the vanishing of the probability of ring (no output spike being produced), and on the high-frequency side, values are plotted up to where the entrainment approaches 100% (output spikes always being generated in the rst cycle of the inputs). The reduction of the output synchronization at higher frequencies observed in Figure 5 is predominantly caused by the resultant high input rate, since the input rate on each ber was chosen to average one spike per cycle of the input. Figure 6 shows the relationship between the synchronization index of the output spikes with the frequency of the inputs when the average rate of inputs is kept constant, lN in D 1.0. The decrease in output synchronization with increasing frequency is much more gradual in this case. The nonmonotonic part at lower frequencies is ascribed to resonance effects that
2656
A. N. Burkitt and G. M. Clark
1.0
N=128
0.9 l N=16
in
=w
r
out
0.8
rin = 0.5
0.7 0.6 0.5
0.7
1 w
2
5
Figure 5: The synchronization index of the output, rout , is plotted against the frequency of the input, v (measured in units with t D 1), for a sinusoidally varying rate of inputs with synchronization index rin D 0.5. The amplitude of individual EPSPs is a D h / N. The average input rate, lN in , is the same as the frequency in all cases, and the numbers of inputs on the plot are N D 16, 32, 64, 128.
occur when the width of the EPSPs is of the same order as the period of the inputs (Kempter et al., 1998). The dependence of the synchronization index on the synchronization index of the input, rin , is illustrated in Figure 7 for a frequency of v D 1.0 and an average input rate of lN in D 1.0 (one spike per input bre per cycle). The synchronization index of the input, equation 2.4, varies from 0.0 to 0.5, so that all points on the plot show a greater degree of synchronization of the output than is evident in the input. This enhancement of synchronization is more pronounced for large numbers of inputs. In the subcritical case discussed in section 3.1 where lN in < ( lN in ) crit , this enhancement of synchronization is more pronounced, as can be seen in Figure 7b. In this plot, the neural parameters have the same values as in Figure 7a, but the input frequency and rate are v D lN in D 0.9, below the critical value of 1.0. This same pattern of enhancement of synchronization for subcritical rates of input is also illustrated in Figure 8, for which ( lN in ) crit D 0.5. The primary factor in the observed enhancement of synchronization is the rate of inputs. This is illustrated in Figure 9, in which the top left plot shows that results for the case where the input frequency and rate are the
Synchronization to Noisy Periodic Input
2657
1.0 N=128 N=64
0.9
N=32
0.8 r
out
N=16
0.7 r = 0.5 in
0.6 0.5
l
0.6
in
0.8
= 1.0
1
1.5 w
2
Figure 6: The synchronization index of the output, rout , is plotted against the frequency, v (measured in units with t D 1), for a sinusoidally varying rate of inputs with synchronization index rin D 0.5. The amplitude of individual EPSPs is a D h / N. The average input rate is kept constant, lN in D 1.0, and the numbers of inputs on the plot are N D 16, 32, 64, 128.
1.0
(a)
0.8
N=128 1.0
out
r w
0.2 0
0
N=16
0.6
out
r
0.4
N=128
0.8
N=16
0.6
(b)
= l in = 1.0
0.1 0.2 0.3 0.4 0.5 r in
0.4 0.2 0
0
w = l = 0.9 in
0.1 0.2 0.3 0.4 0.5 r in
Figure 7: The output synchronization index, rout , is plotted as a function of the input synchronization index, rin , for (a) a frequency of v D 1.0 and average input rate lN in D 1.0 (in units with t D 1) and (b) a frequency of v D 0.9 and average input rate lN in D 0.9. The amplitude of individual EPSPs is a D h / N. The number of inputs on both plots is N D 16, 32, 64, 128 (bottom to top plots, respectively).
2658
A. N. Burkitt and G. M. Clark
1.0
(a)
N=128 1.0
0.8 N=16
0
N=16
0.6
out
r
r
out
w =l
0.2 0
N=128
0.8
0.6 0.4
(b)
in
= 0.5
0.1 0.2 0.3 0.4 0.5 r in
0.4
w =l
0.2 0
0
in
= 0.35
0.1 0.2 0.3 0.4 0.5 r in
Figure 8: The output synchronization index, rout , is plotted as a function of the input synchronization index, rin , for N D 16, 32, 64, 128 (bottom to top plots, respectively). The amplitude of the individual EPSPs is a D h / 2N. The input rate, lN in , and frequency, v, are the same in each plot, with (a) v D lN in D 0.5 and (b) v D lN in D 0.35.
same, and the top right plot shows the results when the average input rate is xed at lN in D 1.0 for all four frequencies. The plots clearly indicate that it is the reduction in the input rate that leads to the enhancement of synchronization at subcritical input rates. A comparison of the results obtained using the analysis of section 2 with the results from numerical simulations is presented in Figure 10. The plots show the output synchronization index for a number of inputs ranging from N D 16 to 1024 and two values of the input synchronization index, rin D 0.1, 0.2, and with an input rate and frequency both of 1.0. The results of the numerical simulation are connected by the dotted line, and the error bars are the statistical error over simulations with 10,000 output spikes, which decrease as the number of inputs, N, increases. The agreement between the two sets of results increases for larger values of N. 3.3 Analysis of Mutual Information. The above results indicate that a subthreshold mean rate of input, lN in < ( lN in ) crit , generates an output synchronization, rout , which increases as the number of input bers, N, is increased. However, in this regime, the average output rate of ring, lN out , decreases dramatically, as shown in Figure 3. Both quantities, lN out and rout , provide some indication of the amount of sensory information that a sensory mechanism can extract from its input. The question then remains, How much information can be gained from subthreshold input on the basis of the output spikes that it generates? We address this question using the mutual
Synchronization to Noisy Periodic Input
r
1.0
w = 1.2
0.8
0.6 0.4
N = 128 l =w
0.2 0
out
out
0.8
w = 0.8
r
(a)
1.0
in
0
2659
0.1 0.2 0.3 0.4 0.5 r in (c) 1.0
(b)
w = 1.2
0.6 0.4
N = 128 l = 1.0
0.2 0
in
0 l
rout
0.8
in
l
0.6
0.1 0.2 0.3 0.4 0.5 r in = 0.8
in
= 1.2
N = 128 w = 1.0
0.4 0.2 0
w = 0.8
0
0.1 0.2 0.3 0.4 0.5 r in
Figure 9: The synchronization index, rout , is plotted as a function of the synchronization index of the input, rin , for N D 128 inputs. The amplitude of individual EPSPs is a D h / N. The average input rate and frequency are (a) equal, lN in D v (average of one input spike per period), (b) constant average input rate at lN in D 1.0 with frequencies v D 0.8, 0.9, 1.0, and 1.2 (top to bottom plots, respectively, in both a and b), and (c) constant frequency at v D 1.0 with average input rates lN in D 0.8, 0.9, 1.0, and 1.2.
information IM between the input phase of the stimulus and the timing of the output spikes, as described in section 2.6. For subthreshold input, Figure 11 shows how the mutual information decreases as the input rate decreases and as the number of inputs increases. The range of parameters plotted in this gure is limited to the regime in which the average output spike rate is low, in order that the probability of there being one output spike per cycle is low (and the probability of two output spikes in one cycle is negligible), as required by the approximation of the mutual information (see section 2.6). These results indicate that although the output synchronization increases when there is a small number of output spikes, the amount of information that can be gained about the phase of the input signal decreases. Likewise, although the results illustrated in Figures 6, 7, 8, and 10 indicate that the output synchronization increases with increasing N (and, hence, decreasing postsynaptic amplitude a), the mutual information between the input phase and spike time decreases.
2660
A. N. Burkitt and G. M. Clark
1.0 rin = 0.2 rin = 0.1
0.8
r
out
0.6 0.4 w =l
0.2 0
16
32
64
128 N
256
in
= 1.0
512 1024
Figure 10: The output synchronization index, rout , is plotted as a function of the number of input bers, N, for two values of the input synchronization index, rin D 0.1, 0.2. The frequency is v D 1.0 and average input rate lN in D 1.0 (in units with t D 1). The amplitude of individual EPSPs is a D h / N. The solid line connects the results of the solution using the rst passage time density, and the dotted line connects the results of a numerical simulation, with the error bars representing the statistical error for 10,000 output spikes.
4 Discussion The analysis presented in this study establishes the relationship between both the timing and average rate of spike outputs using the gaussian approximation in an integrate-and-re neuron with the neural parameters (number of afferent bers N, PSP amplitudes a, threshold h , membrane time constant t) and the input parameters of the periodically modulated spike input (frequency v and average rate lN in ). For any given set of parameters, the ISI distribution may be evaluated, and representative plots are given in Figures 1 and 4. The period distribution may be evaluated, as illustrated in Figure 2. The relationship between the average rate of the output spikes, lN out , and the degree of synchronization of the input spikes is illustrated in Figure 3 for a number of frequencies. The frequency dependence of the synchronization of the output spikes is shown in Figures 5 and 6 for the two cases where the input rate is the same as the input frequency (an average of one spike per cycle of the modulated rate) and the input rate is kept xed as the frequency varies. In each case, the effect of
Synchronization to Noisy Periodic Input
(a)
0.8 w =l
M
0.6 I
2661
N = 128 N = 64
in
0.4 0.2 0
0.75
0.8
0.85 l
0.9
0.95
in
(b)
w =l
0.4
in
= 0.8
I
M
0.6
0.2 0
16
32
64
128 256 512 1024 N
Figure 11: The mutual information, IM , is plotted as (a) a function of the average input rate, lN in , for two values of the number of inputs, N D 64 (solid line), 128 (dashed line), and (b) a function of the number of inputs with lN in D 0.8. In both plots, the frequency is v D lN in (in units with t D 1), and the input synchronization index is rin D 0.5. The amplitude of individual EPSPs is a D h / N.
increasing the number of incoming bers is to increase the synchronization index of the output spikes. The importance of the critical input rate, which may equivalently be formulated as a critical threshold (Kempter et al., 1998) is demonstrated in Figures 7b and 8, which show that the output synchronization is enhanced for subcritical input rates. That this subcritical enhancement is due primarily to the average input rate and not the frequency is illustrated in Figure 9. The results obtained with our method are compared with the results of numerical simulations in Figure 10, which shows that the method becomes more accurate as the number of inputs increases. However, the higher synchronization of the output spikes observed for smaller input rates lN in and larger N (or equivalently, smaller a) cannot be interpreted as meaning that the neuron functions optimally in this regime. The calculation of the mutual information between the input phase and output spike times in this regime indicates that the mutual information decreases as the number of inputs increases, as shown in Figure 11. The rapid decrease in the number of output spikes in this regime is responsible for the observed decrease in the amount of mutual information.
2662
A. N. Burkitt and G. M. Clark
This study extends that of earlier studies of coincidence detection and the neural response to noisy periodic synaptic input in a number of important ways. Many of the earlier examinations of the question of coincidence detection were numerical, in which the response to a train of spike inputs was simulated, using either a threshold model with a shot-noise response (Colburn, Han, & Culotta, 1990; Carney, 1992; Joris, Carney, Smith, & Yin, 1994; Diesmann, Gewaltig, & Aertsen, 1999) or a membrane-conductance model (Murthy & Fetz, 1994; Rothman, Young, & Manis, 1993; Rothman & Young, 1996). The results of such simulations are, of course, subject to statistical uncertainties in the accuracy of the results obtained and frequently involve massive computational resources, which limits the parameter space that can be explored. Nevertheless, such studies have been instrumental in establishing many of the central features of the neural response, such as the importance of the convergence of inputs to the postsynaptic neuron and the effect of the coincident arrival of the inputs (Joris et al., 1994). A comprehensive analytical study of the effect of noisy periodic spike input to an integrate-and-re neuron has been carried out (Kempter et al., 1998), in which the relationship between the input and output rates over the range of input synchronization is analyzed and the conditions under which a neuron can act as a coincidence detector are identied by analyzing the coherence gain and the quality factor for coincidence detection. Their analysis, however, centered on the rate at which output spikes were generated and did not give detailed information on the timing of individual output spikes. The results presented here extend their study by providing an analysis of the timing distribution of output spikes in relation to the phase of the periodic input. Our analysis enables us to nd the interspike interval distribution of the output spikes, the synchronization index of the outputs, and the phase distribution of the outputs. The results presented here illustrate some of the features of stochastic resonance—namely, how the presence of noise can enhance the response of a threshold system to a subthreshold signal. Stochastic resonance, which was originally proposed as an explanation for the ice ages (Benzi, Sutera, & Vulpiani, 1981), has been found to play a role in the sensory pathways of various organisms (Douglas et al., 1993; Braun, Wissing, Schaefer, & Hirsch, 1994). Stochastic resonance has been extensively studied using a two-well potential model in which, in the absence of both noise and external forcing, there are two stable solutions. If the periodic external forcing (signal) is not sufciently large to overcome the potential barrier, then the system remains in its initial state, while in the presence of noise, the system may jump between the two states at random times. Such transitions are found to be correlated with the (subthreshold) signal and cause a peak in the power spectrum at the frequency of the signal (Benzi et al., 1981). In this way, an increase in the noise can cause an improvement in the output signal-to-noise ratio. The resemblance between interspike interval
Synchronization to Noisy Periodic Input
2663
histograms and the residence-time distributions of noisy driven bistable systems led to the application of stochastic resonance to neural systems (Bulsara et al., 1991; Longtin, 1993; Chialvo & Apkarian, 1993; Moss, Douglas, Wilkens, Pierson, & Pantozelou, 1993; Bulsara, Elston, Doering, Lowen, & Lindenberg, 1996; Stemmler, 1996; Chialvo et al., 1997, and in particular to the integrate-and-re neural model (Plesser & Geisel, 1999a, 1999b; Shimokawa et al., 1999). The results presented in Figures 7b and 8b show that the synchronization index of the output spikes for subcritical inputs (see section 3.2) is enhanced relative to that of critical or supracritical inputs, and thus show that stochastic resonance plays a role (Hohn & Burkitt, 2001). There are a number of approximations in the analysis presented here whose effect requires closer examination. Some of the strengths and shortcomings of the integrate-and-re model were discussed in section 1, such as the lack of specic currents that underlie spiking and the linear nature of the summation of the postsynaptic responses. The model therefore does not incorporate any of the postulated nonlinear effects in the dendritic tree (Softky, 1994), nor is it capable of modeling the dynamic nature of synapses (Tsodyks & Markram, 1997; Abbott, Varela, Sen, & Nelson, 1997). The gaussian approximation used in this analysis requires that the amplitude of the individual PSPs be small, since the higher-order terms in equations 2.8 and 2.16 are neglected. This is a reasonable approximation in situations where many EPSPs are required to reach threshold, and the comparison with the results obtained by numerical simulation in Figure 10 indicates excellent agreement in the case where there is a large number of small-amplitude inputs. The expression for the conditional rst-passage time density, equation 2.12, is exact for situations where the inputs are uncorrelated, but may introduce errors when the presynaptic spikes are not generated by an (inhomogeneous) Poisson process or the synaptic currents have a nite duration (i.e., when more general synaptic response functions are used, such as those that include nite rise times of the postsynaptic response). However, it is straightforward to extend the analysis presented here to other forms of periodically varying input, such as the sum of gaussian distributions (Kempter et al., 1998). In the analysis presented here, we have ignored the refractory properties of neurons, in which there is a short time immediately following the generation of a spike during which subsequent spikes are suppressed. It is possible to include an absolute refractory period in a straightforward manner by introducing a dead time after a spike is generated, during which incoming postsynaptic potentials have no effect and no further output spike can be generated. We denote this absolute refractory time by tAR . Its effect is to delay the subsequent summation of the postsynaptic potentials following a spike, and it therefore corresponds to introducing a phase shift of w AR D vtAR . The conditional output spike density incorporating this dead time, which we denote by fh,AR ( tI v, w ) , is then simply given by the
2664
A. N. Burkitt and G. M. Clark
phase-shifted original conditional output spike density, ( 0 for t < tAR fh,AR ( tI v, w ) D fh ( t ¡ tAR I v, w C w AR ) for t ¸ tAR ,
(4.1)
such that the summation of the potential starts with a time delay of tAR , which is also reected in a shift of the initial phase w ! w C w AR . It is straightforward to extend the analysis presented here to include the effect of a distribution of synaptic strengths and of inhibitory PSPs (Burkitt & Clark, 2000), which is important because of the role that inhibition plays in synchronizing networks of neurons (Nischwitz & Glunder, ¨ 1995; Bush & Sejnowski, 1996; White, Chow, Rit, Soto-Trevino, ˜ & Kopell, 1998; Neltner, Hansel, Mato, & Meunier, 2000; Tiesinga & JosÂe, 2000). It is also possible to include the effect of reversal potentials (Burkitt, 2000, 2001). In order to match the results of the analysis with experimental results of measurements in the auditory pathway, it would also be necessary to consider the distribution of axonal propagation times (i.e., the jitter in the spike travel times along the nerve bers) that results from differences in axonal lengths, axonal diameters, internodal distances, cell-body area, and cell-body shape (Liberman & Oliver, 1984). This could be done by introducing a timing delay modeled as a gaussian distribution with a width matched to the experimentally observed spread in axonal propagation times, as originally proposed by Anderson (1973) to explain the observed loss of synchronization in auditory nerve spikes with increasing stimulus frequency, that is, the period histogram goes from a pure half-sine-wave distribution at very low frequencies to a completely at distribution at high frequencies, where all phase information is lost. Even without the inclusion of axonal delays, the results presented here show a fall-off of synchronization with increasing stimulus frequency, although not as rapidly as observed in recordings from the auditory nerve. The results also show the same pattern of increase in synchronization for low output rates that is observed in the auditory nerve (Johnson, 1980), although a more detailed comparison would require the inclusion of the axonal delays. A model for individual auditory nerve ber responses enables the modeling of responses of the various cell types in the cochlear nucleus, which is the rst stage of processing in the auditory pathway. The enhancement of synchronization observed in the results of this study have also been observed in recordings of the anteroventral cochlear nucleus (AVCN) (Joris et al., 1994). Indeed the low-frequency bers in the trapezoid body, which is the output tract of the AVCN, show very precise phase locking with synchronization indices that are mostly larger than 0.9 and can approach 0.99 (Joris et al., 1994), which is considerably larger than the synchronization index ever observed in the auditory nerve bers. The different neural parameters of the various cell types in the cochlear nucleus (and for nuclei in higher stages of the auditory pathway) provide them with different spatiotemporal
Synchronization to Noisy Periodic Input
2665
summation properties that will be reected in the temporal characteristics of their spike outputs and the resulting coding of various perceptual features of the acoustic signal (Bruce, Irlicht, & Clark, 1998). For example, it is possible to incorporate systematic differences between the phases of the inputs, such as for modeling inputs from the auditory nerve to neurons in the cochlear nucleus which originate at points distributed along a section of the basilar membrane and therefore have relative phase differences due to the traveling wave of the basilar membrane (Bruce et al., 1998; Kuhlmann, Burkitt, Paolini, & Clark, 2001). In conclusion, this study quanties the degree to which the output spikes remain phase-locked to the periodic input, and thus are able to retain and even enhance the temporal information contained in the inputs. The existence of both rate and temporal information in neural transmission and processing plays an important role in the auditory system and possibly in the higher centers of the brain. Appendix A: Analysis of the Perfect Integrator Neural Model One of the simplest neural models is the perfect integrator (or leakless integrate-and-re) model, in which there is no decay of the potential with time. Although neglecting the membrane decay constant is clearly unrealistic, the model nevertheless provides a reasonable approximation when the synaptic inputs summate to threshold over a much shorter timescale than the decay constant. The solution for the case in which there is a homogeneous Poisson process of rate l with only excitatory synaptic input follows straightforwardly from the Poisson statistics. If the threshold is reached with mh synaptic inputs, where the individual postsynaptic inputs are assumed to have uniform amplitude, then the output spike distribution is exactly the distribution of the mhth input, fmh ( t) D
lmh tmh ¡1 e¡lt . ( mh ¡ 1)!
(A.1)
The solution of the spike output density for the homogeneous Poisson process with both excitatory and inhibitory inputs may be obtained by using the renewal equation for the rst passage time density, obtained using Laplace transforms (Tuckwell, 1988b), ³ ´ p lE exp f¡( lE C lI ) tg (A.2) fmh ( t) D mh Imh ( 2t lE lI ) , lI t where lE , lI are the Poisson rates of excitatory and inhibitory inputs, respectively, mh is the value of the threshold above the resting potential (in units of the amplitude of the identical individual postsynaptic inputs), and Im is the modied Bessel function.
2666
A. N. Burkitt and G. M. Clark
A solution may also be obtained by approximating the random arrival times using a Wiener process (Gerstein & Mandelbrot, 1964). The solution for the spike output distribution, again identical to the density of the rst passage time to threshold and obtained using the renewal equation and Laplace transforms, is (Gerstein & Mandelbrot, 1964; Tuckwell, 1988b) fh ( t) D p
» ¼ (h ¡ m t ) 2 exp ¡ , 2s 2 t 2p s 2 t3 h
(A.3)
where the threshold is given by Vth D v 0 C h. The mean and variance are given by m D Nal,
s 2 D Na2 l,
(A.4)
where N is the number of incoming bers, each with a Poisson rate l of inputs that have an amplitude a. The solution for the inhomogeneous Poisson process is derived from the above solution for the homogeneous case by noting that an inhomogeneous Poisson process with a time-dependent rate l( t) can be converted to a standard temporally homogeneous Poisson process by the change in timescale (Tuckwell, 1989), Z t D
t 0
l(s) ds ´ L ( t) .
(A.5)
This transformation explicitly gives the conditional output spike density at time t: fh ( tI v, w 0 ) D l( t) fOh (L ( t) ) D p
» ¼ (h ¡ NaL ( t ) ) 2 exp ¡ . 2Na2 L ( t) 2p Na2 L 3 ( t) hl(t)
(A.6)
This expression is valid for any inhomogeneous Poisson process with rate l( t) , which is periodic with frequency v and initial phase w 0 . Inhibitory as well as excitatory inputs can be incorporated into this expression as long as the inhibition has the same rate function l( t) . The expressions for m and s in the homogeneous (constant rate) case of equation A.3 become m D NE aE lE ¡ NI aI lI
s 2 D NE a2E lE C NI a2I lI ,
(A.7)
where NE (NI ) are the number of excitatory (inhibitory) incoming bers, each with a Poisson rate of intensity lE (lI ) and amplitude of inputs aE (aI ).
Synchronization to Noisy Periodic Input
2667
Consequently when inhibition is included in equation A.6, the expression for Na2 in both denominators becomes NE a2E C NI a2I and the term Na in the numerator of the exponential becomes NE aE ¡ NI aI . The interspike interval distribution is obtained by the appropriate average over the initial phase, as given by equation 2.21, using the stationary spike phase density of the phase transition matrix, as described in section 2.5. For the perfect integrator model, the synchronization index of the output spikes is exactly equal to the synchronization index of the inputs, as is straightforwardly conrmed by numerical simulation. Appendix B: Numerical Solution of the Volterra Integral Equation The integral equation that is to be solved numerically is a Volterra equation of the rst kind, Z t (B.1) g ( t) D dsK ( t, s )r ( s) , 0
which we wish to solve for r ( t). In the situation here, the function K ( t, s ) has a singularity of ( t ¡ s) ¡1 / 2 . In addition, we wish to evaluate r ( t) at evenly spaced intervals, since this function is required for the stationary spike phase distribution, which is solved numerically as an eigenvalue problem at equally distributed phases (see section 2.5). Thepsingularity is eliminated by transforming the integration variable to x D t ¡ s (Press et al., 1992). Consequently, p
Z g ( t) D
t
0
dx2xK ( t, t ¡ x2 )r ( t ¡ x2 ) ,
(B.2)
and the integrand is now completely regular. In order to discretize the t-domain of the integral into m evenly spaced intervals of size h D mt , it is necessary to carry out a nonlinear discretization hO in the x-domain (Press et al., 1992): tk D kh p p xk D tm ¡ tm¡k D kh, k D 1, 2, . . . , m p p p hO k D x k ¡ x k¡1 D h ( k ¡ k ¡ 1).
(B.3)
The algorithm for calculating the rst passage time density r ( t) is then r0 D 0 rm D
gm ¡
Pm¡1 O p ( C hO m¡k C 1 ) tm ¡ t k Km, kr k kD 1 hm¡k , hO 1 H
m D 1, 2, . . . ,
(B.4)
2668
A. N. Burkitt and G. M. Clark
where gm D p ( Vth , mh | v0 I v, w ) rm D fh ( mhI v, w )
Km, k D p ( Vth , mh | Vth , kh, v 0 ) p H D lim t ¡ sp ( Vth , t | Vth , s, v0 ) . s!t
(B.5)
Acknowledgments We thank an anonymous referee for proposing the calculation given here of the mutual information. This work was funded by the Cooperative Research Centre for Cochlear Implant, Speech and Hearing Research, the Bionic Ear Institute, and the National Health and Medical Research Council (NHMRC, Project Grant 990816) of Australia. References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Anderson, D. J. (1973). Quantitative model for the effects of stimulus frequency upon synchronization of auditory nerve discharges. J. Acoust. Soc. Am., 54, 361–364. Benzi, R., Sutera, A., & Vulpiani, A. (1981). The mechanism of stochastic resonance. J. Phys. A, 14, L453–L457. Braun, H. A., Wissing, H., Schaefer, K., & Hirsch, M. C. (1994). Oscillation and noise determine signal transduction in shark multimodal sensory cells. Nature, 367, 270–273. Bruce, I. C., Irlicht, L. S., & Clark, G. M. (1998). A mathematical analysis of spatiotemporal summation of auditory nerve rings. Information Sciences, 111, 303–334. Bulsara, A. R., Elston, T. C., Doering, C. R., Lowen, S. B., & Lindenberg, K. (1996). Cooperative behavior in periodically driven noisy integrate-re models of neuronal dynamics. Phys. Rev. E, 53, 3958–3969. Bulsara, A., Jacobs, E. W., Zhou, T., Moss, F., & Kiss, L. (1991). Stochastic resonance in a single neuron model: Theory and analog simulation. J. Theor. Biol., 152, 531–555. Burkitt, A. N. (2000). Interspike interval variability for balanced networks with reversal potentials for large numbers of inputs. Neurocomput., 32–33, 313– 321. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-re neurons with reversal potentials. Manuscript submitted for publication.
Synchronization to Noisy Periodic Input
2669
Burkitt, A. N., & Clark, G. M. (1999). Analysis of integrate-and-re neurons: Synchronization of synaptic input and spike output in neural systems. Neural Comput., 11, 871–901. Burkitt, A. N., & Clark, G. M. (2000). Calculation of interspike intervals for integrate and re neurons with Poisson distribution of synaptic inputs. Neural Comput., 12, 1789–1820. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Comput. Neurosci., 3, 91–110. Carney, L. H. (1992). Modelling the sensitivity of cells in the anteroventral cochlear nucleus to spatiotemporal discharge patterns. Phil. Trans. R. Soc. Lond., B336, 403–406. Chialvo, D. R., & Apkarian, A. V. (1993). Modulated noisy biological dynamics: Three examples. J. Stat. Phys., 70, 375–391. Chialvo, D. R., Longtin, A., & Muller-Gerking, ¨ J. (1997). Stochastic resonance in models of neuronal ensembles. Phys. Rev. E, 55, 1798–1808. Colburn, H. S., Han, Y., & Culotta, C. P. (1990). Coincidence model of MSO responses. Hear. Res., 49, 335–346. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Destexhe, A. (1997). Conductance-based integrate-and-re models. Neural Comput., 9, 503–514. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Douglas, J. K., Wilkens, L., Pantozelou, E., & Moss, F. (1993). Noise enhancement of information transfer in craysh mechanoreceptors by stochastic resonance. Nature, 365, 337–340. Douglas, R., & Martin, K. (1991). Opening the grey box. Trends Neurosci., 14, 286–293. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Fauve, S., & Heslot, F. (1983). Stochastic resonance in a bistable system. Phys. Lett., 97A, 5–7. Gammaitoni, L., H¨anggi, P., Jung, P., & Marchesoni, F. (1998). Stochastic resonance. Rev. Mod. Phys., 70, 223–287. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstein, G. L., & Kiang, N. Y.-S. (1960). An approach to the quantitative analysis of electrophysiological data from single neurons. Biophys. J., 1, 15–28. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Glass, L., & Mackey, M. C. (1979). A simple model for phase locking of biological oscillators. J. Math. Biol., 7, 339–352. Goldberg, J. M., & Brown, P. B. (1969). Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: Some physiological mechanisms of sound localization. J. Neurophysiol., 32, 613–636.
2670
A. N. Burkitt and G. M. Clark
Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698– 1702. Hohn, N., & Burkitt, A. N. (2001). Shot noise in the leaky integrate-and-re neuron. Phys. Rev. E, 63, 031902. Hopeld, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Iyengar, S., & Liao, Q. (1997). Modeling neural activity using the generalized inverse gaussian distribution. Biol. Cybern., 77, 289–295. Jensen, O., & Lisman, J. E. (1996). Hippocampal CA3 region predicts memory sequences: Accounting for the phase precession of place cells. Learning and Memory, 3, 279–287. Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve bers to single tones. J. Acoust. Soc. Am., 68, 1115–1122. Joris, P. X., Carney, L. H., Smith, P. H., & Yin, T. C. T. (1994). Enhancement of neural synchronization in the anteroventral cochlear nucleus. I. Responses to tones at the characteristic frequency. J. Neurophysiol., 71, 1022–1036. Keener, J. P., Hoppensteadt, F. C., & Rinzel, J. (1981). Integrate-and-re models of nerve membrane response to oscillatory input. SIAM J. Appl. Math., 41, 503–517. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998). Extracting oscillations: Neuronal coincidence detection with noisy periodic spike input. Neural Comput., 10, 1987–2017. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. Oxford: Oxford University Press. Koppl, ¨ C. (1997). Phase locking to high frequencies in the auditory nerve and cochlear nucleus magnocellularis of the barn owl. J. Neurosci., 17, 3312–3321. Kuhlmann, L., Burkitt, A. N., Paolini, A., & Clark, G. M. (2001). Analysis of spatiotemporal summation in leaky integrate-and-re neurons: Application to bushy cells in the cochlear nucleus receiving converging auditory nerve ber input. Unpublished manuscript. East Melbourne, Victoria: Bionic Ear Institute. LÂansk Ây, P. (1997). Sources of periodical force in noisy integrate-and-re models of neuronal dynamics. Phys. Rev. E, 55, 2040–2043. Lapicque, L. (1907).Recherches quantitatives sur l’excitation e lÂectrique des nerfs traitÂee comme une polarization. J. Physiol. (Paris), 9, 620–635. Liberman, M. C., & Oliver, M. E. (1984). Morphometry of intracellularly labeled neurons of the auditory nerve: Correlations with functional properties. J. Comp. Neurol., 223, 164–176. Longtin, A. (1993). Stochastic resonance in neuron models. J. Stat. Phys., 70, 309–327.
Synchronization to Noisy Periodic Input
2671
Moss, F., Douglas, J. K., Wilkens, L., Pierson, D., & Pantozelou, E. (1993). Stochastic resonance in an electronic Fitzhugh–Nagumo model. Ann. N.Y. Acad. Sci., 706, 26–41. Murthy, V. N., & Fetz, E. E. (1994). Effects of input synchrony on the ring rate of a three-conductance cortical neuron model. Neural Comput., 6, 1111–1126. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons. Neural Comput., 12, 1607–1641. Nischwitz, A., & Glunder, ¨ H. (1995). Local lateral inhibition: A key to spike synchronization? Biol. Cybern., 73, 389–400. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.). New York: McGraw-Hill. Plesser, H. E., & Geisel, T. (1999a). Bandpass properties of integrate-re neurons. Neurocomput., 26–27, 229–235. Plesser, H. E., & Geisel, T. (1999b). Markov analysis of stochastic resonance in a periodically driven integrate-re neuron. Phys. Rev. E, 59, 7008–7017. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Phys. Lett. A, 225, 228–234. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992). Numerical recipes in Fortran: The art of scientic computing. Cambridge: Cambridge University Press. Rescigno, A., Stein, R. B., Purple, R. L., & Poppele, R. E. (1970). A neuronal model for the discharge patterns produced by cyclic inputs. Bull. Math. Biophys., 32, 337–353. Rose, J. E., Brugge, J. F., Anderson, D. J., & Hind, J. E. (1967). Phase-locked response to low-frequency tones in single auditory nerve bers of the squirrel monkey. J. Neurophysiol., 30, 769–793. Rothman, J. S., & Young, E. D. (1996). Enhancement of neural synchronization in computational models of ventral cochlear nucleus bushy cells. Aud. Neurosci., 2, 47–62. Rothman, J. S., Young, E. D., & Manis, P. B. (1993). Convergence of auditory nerve bers onto bushy cells in the ventral cochlear nucleus: Implications of a computational model. J. Neurophysiol., 70, 2562–2583. Shimokawa, T., Pakdaman, K., & Sato, S. (1999). Time-scale matching in the response of a leaky integrate-and-re neuron model to periodic stimulus with additive noise. Phys. Rev. E, 59, 3427–3443. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neurosci., 58, 13–41. Stein, R. B., French, A. S., & Holden, A. V. (1972). The frequency response, coherence, and information capacity of two neuronal models. Biophys. J., 12, 295–322. Stemmler, M. (1996). A single spike sufces: The simplest form of stochastic resonance in model neurons. Network: Comput. Neural Syst., 7, 687–716. Tiesinga, P. H. E., & JosÂe, J. V. (2000). Robust gamma oscillations in networks of inhibitory hippocampal interneurons. Network: Comput. Neural Syst., 11, 1–23.
2672
A. N. Burkitt and G. M. Clark
Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., & McNaughton, B. L. (1996). Population dynamics and theta rhythm phase precession of hippocampal place cell ring: A spiking neuron model. Hippocampus, 6, 271–280. Tuckwell, H. C. (1988a). Introduction to theoretical neurobiology: Vol. 1, Linear cable theory and dendritic structure. Cambridge: Cambridge University Press. Tuckwell, H. C. (1988b). Introduction to theoretical neurobiology: Vol. 2, Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Usher, M., Schuster, H. G., & Niebur, E. (1993). Dynamics of populations of integrate-and-re neurons, partial synchronizaton and memory. Neural Comput., 5, 570–586. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. White, J. A., Chow, C. C., Rit, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Received May 16, 2000; accepted February 13, 2001.
NOTE
Communicated by Todd Leen
Specication of Training Sets and the Number of Hidden Neurons for Multilayer Perceptrons L. S. Camargo [email protected]
T. Yoneyama [email protected] Instituto TecnolÂogico de AeronÂautica, S. J. dos Campos, Brazil
This work concerns the selection of input-output pairs for improved training of multilayer perceptrons, in the context of approximation of univariate real functions. A criterion for the choice of the number of neurons in the hidden layer is also provided. The main idea is based on the fact that Chebyshev polynomials can provide approximations to bounded functions up to a prescribed tolerance, and, in turn, a polynomial of a certain order can be tted with a three-layer perceptron with a prescribed number of hidden neurons. The results are applied to a sensor identication example. 1 Introduction The main objective of this work is to present a procedure for selection of a set of input-output pairs and the number of neurons required in the hidden layer that yield improved performance in the training of neural networks that approximate real valued univariate functions ( f : R ! R) . Several authors have shown that three-layered neural networks with sigmoidal units in the hidden layer sufce to approximate large class mappings (Cybenko, 1989; Chen, Chen, & Liu, 1995). One problem is the selection of the size of the hidden layer, because underestimated numbers of neurons can lead to poor approximation and generalization capabilities, while excessive nodes could result in overtting and eventually make the search for the global optimum more difcult (Lee & Coghill, 1995). When the function to be approximated is a polynomial with a known degree, then the number of neurons in the hidden layer can be chosen, for instance, using the results in Scarselli and Tsoi (1998). The number of hidden-layer units is also investigated in Suzuki (1998). On the other hand, the training process could also be improved by a parsimonious yet representative set of input-output pairs. The importance of minimal network construction is pointed out in Kendall and Hall (1992) and Elsken (1997). The problem of minimal topology size can be found in Rudolph (1997). The interpolation theory based on Chebyshev polynomials provides a convenient way to select the degree of polynomials approxic 2001 Massachusetts Institute of Technology Neural Computation 13, 2673– 2680 (2001) °
2674
L. S. Camargo and T. Yoneyama
mating a given function up to a certain prescribed accuracy and also xes the input-output pairs. Chebyshev polynomials have been used with neural networks but in a different context in Lee and Jeng (1998) and Basios, Bonushkina, and Ivanov (1997). Although other polynomials such as Laguerre’s and Legendre’s could also be used, the Chebyshev polynomials provide optimal placing of interpolating points and yield minimax absolute error (see, for instance, Burden & Faires, 1985). Universal approximation bounds are treated by Barron (1993, 1994). Combining the polynomial interpolation theory and a theorem by Scarselli and Tsoi (1998), a method is proposed for specication of the training sets and the number of hidden neurons for multilayer perceptrons in a systematic manner. The proposed method is direct in the sense that the number of hidden neurons is xed a priori rather than by pruning techniques, such as in Chintalapudi and Pal (1998). In principle, one could attempt to extend the results to the multidimensional case by using, for example, techniques such as described by Sendov and Andreev (1994). 2 Universal Approximation Consider a neural network with one hidden layer. Let the input-output map be described by yNN D f NN ( x, w) , where yNN 2 R is the output of the network, x 2 X ½ R is the input, and w 2 V ½ Rm is the weight vector. The objective is to adjust w in such a way that the network approximates a given function y D f ( x ) . Given an e > 0, w is to be found so that |yi ¡ f NN ( xi , w) | < e , where yi D f ( xi ) and the input-output pairs ( yi , xi ) , i D 1, . . . , n are to be chosen conveniently (Weaver, Baird, & Polycarpou, 1998). Results in Cybenko (1989) and Chen et al. (1995) guarantee that if f : [0, 1] ! R is a continuous function and s: R ! R is of sigmoidal type, then given an e > 0, there exist
f NN ( x, w) D
m X j D1
vj s( wj x C b ) ,
(2.1)
where m 2 ZI vj , wj 2 RI b 2 R; and w D [w1 , . . . , wm ] such that NN ( < e f x, w) ¡ f ( x)
8x 2 [0, 1].
(2.2)
It is worth noting that the result can be trivially extended to domains of form [xmin , xmax ] by scaling. However, the number of neurons required in the hidden layer is not provided. This issue is to be tackled in the sequel.
Training Sets and the Number of Hidden Neurons for MLPs
2675
3 Size of the Hidden Layer Polynomials can be approximated up to any degree of precision by neural networks such as the one described by equation 2.1 with a certain xed number of hidden-layer neurons. In fact, a polynomial of order n can be tted with n neurons in the hidden layer but with a single threshold or ( n C 1) 2 neurons if more thresholds are used (respectively, theorems 2 and 4) of Scarselli & Tsoi, 1998). Considering the advantages reported in Scarselli and Tsoi (1998), the second alternative is adopted in this work. Theorem (Scarselli and Tsoi, 1998). Let c0, . . . , cn be real numbers, p ( x) D c0 C c1 x C . . . C cn xn a univariate polynomial in x, where n is a nonnegative integer. Furthermore, suppose that s is Cn C 1 in open neighborhoods of 0 and satises 6 6 6 s ( n) ( 0) D 0 and that b1 , . . . , bn are real values satisfying bj D bi for j D i. Then there exist real weights w0,1 , . . . , w 0,n C 1 , . . . , wn,1 , . . . , wn,n C 1 , that depend on a real a, such that pa ( x) , pa ( x ) D
n C1 X n X i D1 l D 0
wl,i ( a) s( alx C albi )
(3.1)
converges uniformly to p on every bounded interval, as a # 0. Consider now a given function f ( x ) together with interpolation points fxi I i D 0, 1, . . . , Ng, x 0 < x1 < ¢ ¢ ¢ < xN and let p ( x ) be such that f ( xi ) D p( xi ) ,
8xi I
i D 0, 1, . . . , N.
(3.2)
By a result by Lagrange (see Maron & Lopez, 1990, or Atkinson, 1989) it is known that there is a unique polynomial p ( x) of degree N such that equation 3.2 is veried. For any c in [a, b], the error at c is given by f ( c) ¡ p ( c) D
f ( N C 1) (j ) ( c ¡ x 0 ) ¢ ¢ ¢ ( c ¡ xN ) ( N C 1) !
(3.3)
for some j 2 [a, b]. If the points fxi I i D 0, 1, . . . , Ng are Chebyshev nodes and a bound | f ( N C 1) | · M is provided, then |error at c| ·
M ( b ¡ a) N C 1 , 22N C 1 ( N C 1) !
(3.4)
where N is the degree of Chebyshev polynomial required in the interpolation (see, for instance, Burden & Faires, 1985). Hence, given an e > 0, it is possible to estimate the value of N such that |error at any c| < e and, moreover, to compute the location of the interpolation points. Combining
2676
L. S. Camargo and T. Yoneyama
this fact with Scarselli and Tsoi’s theorem, it can be noticed that a neural net of n D ( N C 1) 2 hidden nodes sufces to achieve accuracy e (provided that the neural net is adequately trained) and Chebyshev nodes constitute a convenient set of input-output pairs. The procedure can be summarized as follows: Step 1: Estimate M, the maximum of | f ( N C 1) | over [a, b]. Step 2: Find N that achieves the desired accuracy using equation 3.4. Step 3: Compute the Chebyshev nodes fxi I i D 0, 1, . . . , Ng, and nd the corresponding outputs fyi D f ( xi ) I i D 0, 1, . . . , Ng (Actually, f (.) may not be available in an explicit form and could be found by simulation, data acquisition, or extracted from a table.) Step 4: Train a three-layer perceptron with ( N C 1) 2 neurons in the hidden layer. In many situations of interest, the target function is a representation of known physical laws and, moreover, experimental tests can be performed a priori in the laboratory, so that the required bound M can be estimated with reasonable accuracy. 4 Examples As a rst practical example of a nonlinear function, consider the voltage characteristics of the Copper/ Constantan thermocouple (Doeblin, 1990) seen in Figure 1, where E is the output voltage (mV) and T is the temperature at the sensing junction (± C), with cold junction kept at 0± C. It is known that a polynomial of degree 2 sufces to t the data in the considered range. The three Chebyshev nodes are seen in Table 1. The initial weights of the neural net were randomly generated, and the resulting f NN is shown in Figure 1. As a second example of nonlinear function, consider a pyrometer, a temperature sensor based on detection of electromagnetic radiation, which has wavelengths in the visible and infrared portions of the spectrum. The most common type of radiation-sensing instrument in industrial applications employs a blackened thermopile as the detector and focuses the radiation by means of either lenses or mirrors. The temperature is read as an output Table 1: Chebyshev Nodes for the Thermocouple Example. Node
x-axis (T)
y-axis (E)
1 2 3
23.4 175.0 326.6
1.431 9.459 15.374
Training Sets and the Number of Hidden Neurons for MLPs
2677
C o p p e r/C o n s ta n ta n T h e rm o c o u p le 20
O u tp u t V o lta g e [m V ]
C h eb yshev N od es E x p e r im e n ta l D a ta
15
N e u r a l N e tw o rk ’s O u tp u t
10
5
0 0
50
1 00
150
200
250
3 00
350
T e m p e ra tu re [D e g re e s C e ls iu s ] Figure 1: Thermocouple: Output voltage £ temperature.
voltage given by Doeblin (1990): E D KT 4 ,
(4.1)
where K is constant and T is absolute temperature, so that ve Chebyshev nodes were used (see Table 2). The initial weights were again generated randomly, and the resulting f NN is shown in Figure 2. 5 Conclusion A method was proposed for the specication of a parsimonious set of inputoutput pairs for the training of a three-layer perceptron and also the reTable 2: Chebyshev Nodes for the Pyrometer Example. Node
x-axis (T)
y-axis (E)
1 2 3 4 5
311.6 563.9 972.0 1380.2 1632.5
0.20 1.3 8.3 28.1 50.4
2678
L. S. Camargo and T. Yoneyama
P y r o m e te r 60
C hebyshev N od es
O u tp u t V o lta g e [m V ]
50
E x p e r im e n ta l D a ta 40
N e u r a l N e t w o rk ’s O u tp u t
30
20
10
0 200
400
600
800
1000
1200
1400
1600
1800
T e m p e ra tu re [D e g re e s C e ls iu s ]
Figure 2: Pyrometer: Output voltage £ temperature.
quired number of neurons in its hidden layer. The specication is based on polynomial interpolation theory so that for a priori given function, a neural net can be obtained that achieves prescribed accuracy. The main drawback is the need to estimate the bounds on the derivative of the function to be approximated, which may be difcult when it is not given in an explicit form (for instance, when the input-output pairs are to be found by simulation or experimentation). However, in many cases of interest, the input-output function to be tted may be of a known class (although the parameters may be unknown), so that in applications such as adaptive compensator for sensors, for example, the availability of a criterion to choose the training set and the architecture of the neural net may be useful. The extension of the proposed method to the multivariate case is not trivial, although one can note that if f : Rd ! R is a polynomial of order n, then the number of required neurons in the hidden layer is (see Sendov and Andreev, 1994). ³ ´ nC d¡1 (5.1) n , d¡1 and an approximation could be obtained, in principle, using methods such as those described in Sendov and Andreev (1994).
Training Sets and the Number of Hidden Neurons for MLPs
2679
When the target function is time invariant, it may be easier to estimate directly the coefcients of the approximating polynomial. However, when adaptation is needed (such as in the case of self-calibrating sensors and instrumentation), it is frequently easier to ne-tune the weights of a neural network than to adjust directly the coefcients of a polynomial. Acknowledgments We are indebted to the anonymous reviewers whose suggestions were crucial in substantially improving this article. This work was supported by FAPESP (Funda¸ca˜ o de Amparo a` Pesquisa do Estado de S˜ao Paulo) under grant 96 / 12317-1 References Atkinson, K. (1989). An introduction to numerical analysis. New York: Wiley. Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Information Theory, 39, 930–945. Barron, A. (1994). Approximation and estimation bounds for articial neural networks. Machine Learning, 14, 115–133. Basios, V., Bonushkina, A., & Ivanov, V. (1997). A method for approximating one-dimensional functions. Computers and Mathematics with Applications, 34, 687–693. Burden, R., & Faires, J. (1985). Numerical analysis. Boston: Prindle, Weber and Schmidt. Chen, T., Chen, H., & Liu, R. (1995). Approximation capability in fC( Rn ) g by multilayer feedforward networks and related problems. IEEE Transactions on Neural Networks, 6(1), 25–30. Chintalapudi, K., & Pal, N. (1998). Novel scheme to determine the architecture of a multi-layer perceptron. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics. San Diego, CA. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signal and Systems, 2, 303–314. Doeblin, E. O. (1990). Measurement systems, application and design. New York: McGraw-Hill. Elsken, T. (1997). Even on nite test sets smaller nets may perform better. Neural Networks, 10, 369–385. Kendall, G., & Hall, T. (1992). Ockam’s nets: Improving generalization by minimal network construction. In Proceedings of the 1992 Articial Neural Networks in Engineering. Faireld, CT. Lee, K., & Coghill, G. (1995). Optimal sizing of feedforward neural networks: Case studies. In Proceedings of the Second New Zealand International Two-Stream Conference on Articial Neural Networks and Expert Systems. Auckland, New Zealand. Lee, T., & Jeng, J. (1998).The Chebyshev polynomials based unied model neural networks for function approximation. IEEE Trans. SMC, 28, 925–935.
2680
L. S. Camargo and T. Yoneyama
Maron, J. M., & Lopez, R. J. (1990). Numerical analysis: A practical approach. Reading, MA: Addison-Wesley. Rudolph, S. (1997). On topology, size and generalization of nonlinear feedforward neural networks. Neurocomputing, 16, 1–22. Scarselli, F., & Tsoi, A. C. (1998). Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks, 11(1), 15–37. Sendov, B., & Andreev, A. (1994). Approximation and iterpolation theory. In P. Ciarlet & J. Lions (Eds.), Handbook of numerical analysis (pp. 223–379). Amsterdam: North-Holland. Suzuki, S. (1998). Constructive function-approximation by three-layer articial neural networks. Neural Networks, 11, 1049–1058. Weaver, S., Baird, L., & Polycarpou, M. (1998). An analytical framework for local feedforward networks. IEEE Transactions on Neural Networks, 9(3), 473–482. Received December 28, 1999; accepted February 14, 2001.
LETTER
Communicated by Carl van Vreeswijk
Spotting Neural Spike Patterns Using an Adversary Background Model Itay Gat [email protected] Naftali Tishby [email protected] Institute of Computer Science and Center for Neural Computation, Hebrew University, Jerusalem, 91904 Israel The detection of a specic stochastic pattern embedded in an unknown background noise is a difcult pattern recognition problem, encountered in many applications such as word spotting in speech. A similar problem emerges when trying to detect a multineural spike pattern in a single electrical recording, embedded in the complex cortical activity of a behaving animal. Solving this problem is crucial for the identication of neuronal code words with specic meaning. The technical difculty of this detection is due to the lack of a good statistical model for the background activity, which rapidly changes with the recording conditions and activity of the animal. This work introduces the use of an adversary background model. This model assumes that the background “knows” the pattern sought, up to a rst-order statistics, and this “knowledge” creates a background composed of all the permutations of our pattern. We show that this background model is tightly connected to the type-based information-theoretic approach. Furthermore, we show that computing the likelihood ratio is actually decomposing the log-likelihood distribution according to types of the empirical counts. We demonstrate the application of this method for detection of the reward patterns in the basal ganglia of behaving monkeys, yielding some unexpected biological results.
1 Introduction
One of the intriguing questions in brain research concerns the neural code— the code used by neurons in the brain to process, store, and convey information. There is strong evidence for the theory that this code is conveyed by action potentials (usually referred to as spike trains), but we lack a full understanding of how these spike trains hold the information needed. The understanding of this code is a key issue for understanding how the brain works. c 2001 Massachusetts Institute of Technology Neural Computation 13, 2681– 2708 (2001) °
2682
Itay Gat and Naftali Tishby
In order to gain more information about this code, researchers usually attempt to associate the spike trains taken from extracellular recordings with the stimuli the animal was presented with at the time. The next step is to characterize this association under different conditions. In principle, this association can be discovered in one of two ways: computing the response of cells near a known stimuli or working out the proximity of a given response to the stimuli (the commonly used method). This second method has proven to be quite successful and has revealed some important basic mechanisms of neuronal code (e.g., Georgopoulos, Schwartz, & Kettner, 1986; Graybiel, Aosaki, Flaherty, & Kimura, 1994). It does, however, suffer from some severe limitations: the alignment problem and the noncontinuous search, to name only two. The alignment problem stems from the assumption that every time the animal is presented with the same stimulus, it goes through the same thinking process. This is clearly a problematic assumption. We are not able to construct the exact same process in the brain, as we are unable to control directly many relevant variables. The noncontinuous problem refers to the fact that the search is not conducted in all the continuous recordings taken from the animal’s brain. The search is conducted only a short time before and after an event that we consider important has occurred. It is possible that the response we are looking for appeared in a non-stimuli-locked fashion. It might appear in unexpected places, which are not searched using the current methods. The main problem that has prevented researchers from conducting this “reverse search” lies in the nature of the spike trains, which hide the neural code. The spike trains (especially in areas involved in complicated tasks) are known to be of a low rate (5–20 spikes per second) and of high variability (Abeles, 1991). The combination of these facts creates a signal with a very high variance. Most of the research was therefore concerned with reducing this variance by aligning several repetitions of the same task and calculating an average response. In the work presented here, when choosing a method, we took “the one less traveled by” of searching for a single trial response in a continuum of data. We made use of the fact that modern technology enables us to record from several cells simultaneously and therefore to reduce the variance of response. Our problem is therefore the detection of a specic pattern of multineural spike trains within the ongoing neural activity of these neural cells. We assume for simplicity that the pattern exists prior to the search. No attempt was made to look for ways of self-constructing such patterns. Given such a pattern, a fundamental question is, How unique or characteristic is this pattern to this particular stimulus? In other words, how specic is this “code word” in the ongoing activity of these cells? To answer this question, one should try to detect, or spot, this specic pattern within a single recording, independent of any stimulus. This is a rather difcult pattern recognition task, considering the low ring rate of these cells.
Spotting Neural Spike Patterns
2683
The decision-theoretic problem of spotting a pattern of sequenced data in a “noisy background” is classically considered a binary statistical hypothesis test. In such test, the two possible sources are a statistical model of the pattern and a background statistical model. Given a complete description of the two sources, there is a fully understood statistical approach, the likelihood ratio test (LRT), rst derived by Neyman and Pearson (Leheman, 1959). When one or more of the possible sources are not fully dened, the LRT is not applicable. In many cases, the researcher knows only that the sources belong to a certain parametric family and receives a training sequence emitted from one of the sources. This problem, also known as the two-sample problem (Leheman, 1959), requires a decision as to whether the training sequences and a candidate test sequence were both emitted from the same source. An asymptotically optimal information-theoretic approach to the two-sample problem was proposed by Ziv (1988) and extended by Gutman (1989). Their method relies on the information-theoretic notion of the typicality of the empirical sequences. It has been shown to provide the best exponential decay of the probability of classication error. Strictly speaking, their methods are restricted to nite alphabet Markov sources, for which the empirical entropy can be efciently estimated by universal compression. The information-theoretic approach clearly suggests, however, that an optimal detection method is possible without knowing the specic statistical model of the background, using the empirical type of the classied sequence. The main drawback of these methods lies in the need to estimate the probability over the nite alphabet from the data. In the next section, we present the basic set-up and denition of this work, the statistical assumption, and the exact details of treating the spike trains. Section 3 presents the problems that may occur when using the wrong background model for spotting a pattern in such data. Section 4 presents a more plausible background model. The calculation for this background model will be given in detail. Section 5 shows an application of this technique for recordings performed in the basal ganglia of behaving monkeys. The last section discusses the method presented in this work and shows possible extensions of it.
2 Denitions 2.1 Statistical Assumptions. In the work presented here, we concentrate on detection of a neural response in multiunit recordings. An example of such a response is given in Figures 1A and 1B. These gures show several of the problematic features of neuronal data, such as the low ring rate and the fast changes in rate. It can be seen that a prominent response is observed in these data (see Figure 1C); nevertheless, it is far from trivial to observe the same response in a single trial.
2684 A.
Itay Gat and Naftali Tishby B.
4
4
3
3
2
2
1
1
0
100
20 0
30 0 4 00 T im e (m s)
5 00
0
6 00
10 0
20 0
30 0 4 00 T im e (m s)
5 00
60 0
C.
100 0
100
200
300 400 Time (ms)
500
600
Figure 1: Spike trains of a single trial, and the alignment of many repetitions. (A, B) Examples of four cells recorded simultaneously from the basal ganglia. Each line represents the action potential of one cell. (C) The alignment of many repetitions according to some external stimulus.
The response is mainly manifested in elevation in ring rate (between 150 and 220 ms after the alignment point), followed by a reduction in rate (to 400 ms), and another elevation (to 500 ms). Due to the low ring rate of the cells—a few spikes per second—this pattern may be inuenced by the existence (or absence) of a single spike. The statistical description of these data is based on the common assumption that the spike trains are the outcome of a nonhomogeneous Poisson process (Abeles, 1991). According to this view, the spikes within a short period of time (a bin) are obeying a Poisson distribution. A pattern to be searched in such data is therefore parameterized by a sequence of M ring rates ( l1 , l2 , . . . , lM ) D l. The value of l are the estimated expected number of spikes in the corresponding bins of the pattern. Notice that when the bin sizes are sufciently small, such a pattern may account for arbitrarily precise spike timing. When the bins are wider, larger variability and randomness in the pattern are allowed, but there are fewer parameters to estimate. This pattern is therefore of sufcient generality, where the bin sizes are determined, such that the spike rate parameters are reliably estimated
Spotting Neural Spike Patterns
2685
Cell 4
0.4
100
200
300
400
500
600
0 0 0.4
100
200
300
400
500
600
0 0 0.4
100
200
300
400
500
600
0 0
100
200
400
500
600
Cell 1
Cell 2
Cell 3
0 0 0.4
300 Time (ms)
Figure 2: PSTH model of spike pattern. The histogram of four cells recorded simultaneously is presented. The activity of the cells was measured in bins of 20 ms. The estimation of the ring-rate estimation is given by bars whose height is proportional to the rate. The rate units are counts per bin.
from the training data. The l rates are usually estimated by alignment of several spike trains around an external stimulus. The example of the pattern in Figure 2 represents the spike activity of four neurons recorded in the basal ganglia of a behaving monkey. (A detailed description of this specic pattern is presented in section 5.) An observed ring sample is a vector of spike counts (number of spikes found in each bin in a single trial) ( n1 , n2 , . . . , n M ) D n. This vector is the outcome of a counting process that splits the data into bins and, within each bin, counts the number of spikes. The observed ring sample will be used in this work as the data point to be searched using the pattern dened above. Examples of slicing the data into bins and computing the count vectors are shown in Figure 3. 2.2 Log-Likelihood Distribution. The basic statistical relationship between the data points and the pattern is the likelihood, which is the statistical value that denes how likely a pattern is given a data point. Given a
2686
Itay Gat and Naftali Tishby
0
500
...
1000
1500 Time (ms)
...
2000
2500
...
( ...) ( ...) ( ...) 100 001 010 110
000 000 000 001
100 000 010 010
Figure 3: An illustration of the spotting process. The template (top) is shifted on the spike trains. Each shift produces a parsing of the spike trains into a matrix, where in each bin of the matrix, the number of spikes is counted. The output matrices are shown in the lower part of the gure. The task of spotting is to assign probabilities to the combination of counting matrices and the template being sought.
Spotting Neural Spike Patterns
2687
variable-rate Poisson pattern with M bins, (l1 , l2 , . . . , lM ) D l,
M X iD 1
li D L ,
(2.1)
and a multicell spike train, we can produce the sample sequence of counts, ( n1 , n2 , . . . , nM ) D n,
M X i D1
ni D N,
(2.2)
by slicing the spike trains into bins with the same width as the bins used to estimate the rate Poisson pattern. The likelihood of the observed counts under the pattern is pl ( n ) D
M Y iD 1
e ¡li
li ni . ni !
(2.3)
When summing this probability over all possible count sequences with the same total number of spikes, N, one obtains, X n
pl ( n ) D
M XY n iD 1
e¡li
li ni LN D e¡L , ni ! N!
(2.4)
which is the well-known composition rule for Poisson distributions. In other words, the Poisson distribution remains invariant under bin size changes when summing over all possible internal partitions. In the current formulation, we do not give special treatment to the relationship between bins from the same cell. We assume all bins to be independent of each other, no matter which cell they were recorded from. 2.3 Spotting Paradigm. Using the building block dened in this section, we can now construct the search for a pattern in a continuous data. We rst need a statistical description of the pattern with the following details: the length of the pattern, number of cells used, the bin width, and the values of l in those bins. We can now build a window whose width is the width of the pattern and slide it on the continuous spike trains. This sliding, performed with short, predened shifts, creates frames of data. We can now compute the likelihood of each such frame and conclude whether it is similar to the original pattern sought. The main issue that will be discussed in the following sections is the exact details of such computation—to be more specic, what we would consider as the background model for our data. A background model should describe the activity of the cells when the pattern is not presented. As described above, we do not start out with a reasonable background model, and we require a plausible one for this search. Figure 3 gives an example of such a spotting paradigm.
2688
Itay Gat and Naftali Tishby
3 Selection of Background Model
Average counts
3.1 Using a Wrong Background Model. The LRT can be shown to be optimal given a correct background model. In this section we briey show the consequences of using a background model that is wrong. The simplest assumption regarding the background may be that it is constant over all the data. The LRT in this case is reduced to the likelihood of the examined data point (see equation 2.3). The drawback of this method is evident when an example is examined. Let us assume that we wish to search for the pattern shown in Figure 2. It can be seen from our formulation that the frame with the highest likelihood for this pattern is the frame that contains no spikes at all. This is the consequence of the fact that for a Poisson variable of value less than 1, the most likely value is 0. Applying this procedure to data with high variability would supply us only with the frames in the data whose spike count is the lowest. This can be demonstrated by selecting different similarity thresholds for the spotting under this background model. When the threshold is set higher (so that the probability of crossing it becomes lower), the total number of counts in the spotted examples goes down monotonically, as shown in Figure 4. The problem shown here is not, however, limited to cases in which the expected number of counts in each bin is very small. Rather, it results from using an inadequate background mode.
46 44 42 40 38 36 34 32 30 85
80
75 70 Log likelihood ratio
65
Figure 4: The average number of spikes in examples that crossed the threshold decreases monotonically as the threshold of similarity is increased. The data presented here were computed using the following process: We chose a uniformly distributed set of l and created 10,000 simulations of random Poisson spike trains according to the correct l. For each of these simulations we counted the number of spikes and computed the likelihood, under the assumption of constant background distribution.
Spotting Neural Spike Patterns
2689
3.2 Adversary Background Model. We suggest using an adversary background model. This model is adversary in the sense that it has rstorder statistics on our data and uses these to “confuse” us. It produces examples with probability according to the given distribution family with parameters taken from the rst-order statistics of the current data. The background model cannot distinguish between the different examples (all having the same rst-order statistics), and therefore they are all assigned the same probability. The following spotting problem is an example of the use of an adversary background model. Suppose we want to spot the handwritten digit 2 in a natural writing environment; the person can write whatever she wishes, and we attempt to spot only the cases of 2. For this purpose, we can collect many examples of how people write the digit. It is a much harder task to produce a model that describes the default scribbling of a person. From this point onward we will assume that we have a statistical model for the writing of 2 and that we were also able to estimate its parameters using our training examples. Given a scribble, we would divide it into N segments and would call the position of the end points of these segments—our rst-order statistics. The adversary background model would therefore be composed of all the scribbling that goes through the same end points as our original example. An example is shown in Figure 5. Figure 5A shows the example presented to the spotter. The rst-order statistics dened in this example are given in Figure 5B and consist of a nite set of points dened for the scribbling. Two examples of the adversary background are also shown. They imitate the scribbling up to a rst-order statistics, that is, they go through the same points dened on the scribbling (see Figures 5C and 5D). The ultimate goal is to compute the LRT of the example with the background, as seen in Figure 5E. The adversary model is optimal if all that is known about the background is the rst-order statistics. This is due to the use of uniform distribution over the permutations of the data, which stems from the principle of maximum entropy. The background model is the maximum entropy distribution subject to the partial information of the rst-order statistics. Any other classier that does not impose a uniform distribution can be seen as adding some structure to the background model. If this structure does not stem from correct information about the background, it would on average increase the probabilistic distance between the background and the pattern. Such an increase creates less a tight decision boundary, and therefore the decision would be less accurate. In general what we have here is a case in which the LRT, which may be seen as a linear separator in the probability simplex (Cover & Thomas, 1991), can be extended according to our data points. An example of this phenomenon is shown in Figure 6. In this gure, we see two linear separators, each for a different background model. This gure shows that some
2690
Itay Gat and Naftali Tishby
A.
B.
C.
D.
E.
Figure 5: An adversary background model for spotting the digit 2 in an unknown background. In the rst row, we see a data example of 2 (the gure to the left), followed by an indication of the rst-order statistics being used (the points on the digit). We show two examples of our background adversary model, which has this rst-order statistics. The second row shows that the LRT is computed between the original example (and our statistical model) and the background examples.
of our data points could not have been spotted had the wrong background model been used. Using the adversary background model, one still has to choose the rstorder statistics to be used. The question is, To what extent is the adversary background able to imitate the data points? We also need to choose the parameters of these statistics—the number of points used in the case described above. 4 Computing the Adversary Background Model 4.1 Adversary Background Model and Types. In the case of spotting a neuronal response in an unknown background activity, we choose our statistic as the type class of our data point. The type class (Cover & Thomas, 1991) is dened as the class of all spike patterns that have the same empirical
Spotting Neural Spike Patterns
A.
2691
B. X1
P
X3
X2 Q X4
X5
Fixed Background
A d v e rs a r y B a c k g r o u n d
Figure 6: Spotting via the adversary background model shown in the probability simplex. (A) The regular spotting mechanism using the probability simplex. We have two models, marked P and Q . Given a set of data points, each is spotted according to the LRT of the models. This is represented as a linear separator between P and Q. (B) Using the adversary background model, we have for each data point a different background model and therefore a different separator.
distribution of bin counts (c0 empty bins, c1 bins with one spike, c2 with 2 spikes, and so on). The type class is completely characterized by the vector of numbers c 0, c1 , c2 ,. . . . Given a data point, we therefore need to calculate the LRT between the data and the background model consisting of all the possible manifestations of the same type class. In other words, we need to compute the probability of the data point according to the distribution of the type class. In order to do that, we will show what the distribution of the type class is and how to estimate its parameters. We will start by showing the distribution of the log-likelihood distribution (see equation 2.3) and later show how to decompose this distribution into the much narrower type class distributions. Finally, we present the method for computing the LRT of a given data example according to the type class distribution. 4.2 Self-Averaging of the Log-Likelihood. Let us examine the logarithm of the likelihood, log pl ( n ) , in equation 2.3. This expression is a sum over M independent random variables,
log pl ( n ) D
M ¡ X i D1
¡li C ni log li C log( ni !)
D ¡L C
M ¡ X iD 1
¢
¢ ni log li C log( ni !) .
(4.1)
2692
Itay Gat and Naftali Tishby
When N and M are large enough, this is the sum over many independent random terms, and by the central limit theorem, the distribution of this sum approaches a normal (gaussian) distribution, centered around its mean. The mean is simply given by m l D hlog pl ( n ) il D ¡L C
M X iD 1
li log li ¡
M X iD 1
hlog( ni !) il
(4.2)
and can be easily determined from the model rate vector l. The last term can be evaluated numerically for each ni (notice that the ni are small, and the Stirling approximation is not valid). To illustrate this, suppose that li D 0.15; then hlog(ni !) il is governed by the rst four values for n!: hlog( ni !) il ¼ 0 £ e¡0.15 C 0 £ e¡0.15 0.15 C log(2) £ e¡0.15 C log(6) £ e ¡0.15
0.152 2
0.153 . 6
(4.3)
The evaluation of the variance of the sample log-likelihood, equation 4.1, under the model distribution is similar:
³ sl2
D Var( log pl ( n ) ) D Var
¡L C
M X i D1
´ ni log( li ) C log(ni !) .
(4.4)
The rst term, ¡L , is constant, and its variance is 0. We are left with M X iD 1
¡ ¢ Var ni log(L ) ¡ log(ni !) .
(4.5)
Using the expression for m l , the variance of each term in this sum can be written as ¡ ¢2 h( ni log li ) 2 C log2 ( ni !) ¡ 2 log li ni log( ni !) i ¡ li log li ¡ hlog( ni !) i 2 2 D (l2i C li ) log li C hlog ( ni !) i ¡ 2 log li hni log(ni !) i
± ² ¡ l2i log2 li ¡ 2li log li hlog( ni !) i C hlog( ni !) i2
2 D li log li C Var( log(ni !) ) ¡2 log li hni log( ni !) ¡li hlog(ni !) ii.
(4.6)
The model and its sample log-likelihood distributions are shown in Figure 7.
Spotting Neural Spike Patterns
2693
Occurrences #
3000
2000
1000
0 90
80 70
60 50 40 Log probability
30
20
10
Figure 7: Distributions of the model log-likelihood log pl ( n ) for samples generated by a nonhomogeneous Poisson model. The histogram of the simulated data (given in bars) is compared with the estimated normal distribution (solid line).
4.3 Computing the Log-Likelihood Distribution of a Sample Class.
T ( n ) denotes the type class of the count vector n, and ci ( n ) denote the Pnumber of bins with i spike counts in n. Clearly, for patterns with M D i ci bins, the size of T ( n) is given by |T( n ) | D Q M
M!
i D 1 ci
( n) !
.
(4.7)
Since sample type classes are disjoint sets, the sample log-likelihood distribution is uniquely decomposed into a sum of the log-likelihood distributions for each type, that is, Pl (log pl ( n ) ) D
X T ( n)
Pl ( T ( n ) ) £ Pl ( log pl ( n ) |T(n) ) .
(4.8)
When given a sample, its type is easily computed. The idea is therefore to use only one term in this sum for detection rather than the whole sum. In other words, the sample type serves as a sufcient statistic, together with the sample likelihood, for each term in this mixture distribution. It remains only to evaluate the different distributions in this decomposition.
2694
Itay Gat and Naftali Tishby
The probability of the type class T ( n ) under the model l can be determined through X ( )i Pl ( T ( n) ) D pl ( n0 ) D |T(n) |hpl ( n ) iT ( n) ¼ |T( n ) |ehlog pl n T(n) , (4.9) n0 2T(n)
1 P where h. . .iT ( n) D |T(n) n0 2T(n) . . . denotes an average over all the members | of the type class T ( n ) , that is, an average over all the |T( n ) | permutations of bins. The last exponential approximation in equation 4.9 is based on the fact that pl ( n) is a product over many positive statistically independent terms (each with a well-behaving distribution) and thus distributed by a log– normal distribution for a large enough M (its log “self-average”). The type probabilities, Pl ( T ( n ) ) , should be normalized to 1, X (4.10) Pl ( T ( n ) ) D 1, T ( n)
so each term can be considered as the type prior. Where T is the number of congurations of counts into bins, Type ( t ) is the number of possibilities to put the t conguration into the M bins, and the remaining Poisson probability is the probability of the conguration. A conguration is dened as an innite set c0 , c1 , . . . , where c0 is the number of empty bins, c1 is the number of bins with one count, and so on. Therefore, P ( Type ( t) ) D c0 e¡li £ c1 e¡li li £ c2 e¡li
l2i l3 £ c3 e¡li i . . . 2 6
(4.11)
4.4 The Log-Likelihood Distribution in Each Type. Next, we evaluate the parameters of the distribution of log pl ( n) in each type, Pl ( log pl ( n ) ) . As can be seen from the histogram of the log-likelihood in the type (see Figure 8), the central limit theorem applies even better to this distribution, which can thus be considered as normal. We now calculate the mean and variance of this normal distribution, or the mean and variance of log( pl ( n ) ) in each type. The log-likelihood, log pl ( n ) , can be written as M £ X
log pl ( n ) D
iD 1
¡li ¡ log(ni !) C ni log li
D ¡L ¡ L ( n ) C
M X
ni log li ,
¤
(4.12)
iD 1
with the notation def
L( n) D
M X iD 1
log( ni !).
(4.13)
Spotting Neural Spike Patterns
2695
Occurrences #
2000
1000
0 48
46
44
42
40
38
36
34
32
Log probability Figure 8: Histogram of the model log-likelihood in one sample type with the gaussian approximation.
Notice that in each type class, the counts n are xed, and the only randomness is in the permutation of the bins, that is, a permutation p ( i) between the index of ni and li . Since all these permutations are equally likely in each type (which is also the assumption about the adversary background), the type mean, m T ( n) , is readily evaluated as follows: m T ( n) D hlog pl ( n ) iT ( n) D ¡L ¡ L ( n ) C
* M X
+ np ( i) log(li )
i D1
.
(4.14)
p
Since the last term is averaged over all the permutations of the sequence fni g, each count ni is multiplied by all the rates log li the same number of times. Thus, m T ( n) D ¡L ¡ L ( n) C hni
M X iD 1
log li D ¡L ¡ L ( n ) C
M N X log li . M i D1
(4.15)
To compute the mean of the normal distribution of the model log-likelihood in a specic type, one should compute the sum of all rates ¡L (which is xed for the model), the sum of log(ni !) (xed for the type), and the sum of def P M the log rates of the bins l( l) D i D 1 log( li ) (xed for the model) multiplied by the mean number of spikes in the bins.
2696
Itay Gat and Naftali Tishby
The evaluation of the variance is somewhat trickier. We rst compute the mean square: m 2T ( n) D hlog p( n ) i2
³
D L 2 C L2 ( n ) C
¡ 2L
N M
´2 l2 ( l) C 2L L ( n )
N N l( l) ¡ 2L(n) l( l) , M M
(4.16)
then average the square of log pl ( n ) over the type permutations, *
2
hlog ( p ( n ) ) iT ( n)
+ 2 N D ( ¡L ¡ L ( n ) C l( l) ) M * ³ ´2 N 2 2( ) D L CL n C l2 ( l) C 2L L ( n ) M ¾ N N ¡2L l( l) ¡ 2L( n ) l( l) . M M T ( n)
(4.17)
The only terms that contribute to the variance are those that contain both fni g and fli g, and we are left with *
Var(log pl ( n ) ) D
X
+
³
ni nj log li log lj
i, j
T ( n)
¡
N M
´2 l2 ( l).
(4.18)
The rst sum can be divided into two parts: the rst for i D j and the 6 second for i D j. Since both i and j undergo the same permutation, *
X
+ np ( i) np ( j) log li log lj
i, j
* D
X
+ n2p ( i)
p
*
2
log li
i
p 2T(n)
C
X
+ np ( i) np ( j) log li log lj
6 j iD
.
(4.19)
p
Averaging, as before, over all the permutations in the type, this term can be written as PM
i D1
M
n2i
P 2
log li C
6 j ni nj iD M2 ¡ M
X 6 j iD
log li log lj ,
(4.20)
Spotting Neural Spike Patterns
or as
2697
0³ 1³ ´2 ´ X X X 1 2 2A 2( @ log li . ni ¡ ni l l) ¡ M2 ¡ M i i i
(4.21)
Finally, the type log-likelihood variance can be written simply as sT2( n)
1 D M¡1
³
X i
n2i
N2 ¡ M
´³
X i
l(l) 2 log ( li ) ¡ M 2
´
,
(4.22)
or, through the variances of the counts and log rates, as sT2( n) D
M2 Var( fni g) Var(flog li g) . M¡1
(4.23)
The last equation clearly shows that when either the fni g or the fli g are uniform, the variance of type log-likelihood vanishes, and the log-likelihood is sharply centered on the type mean. Using the above expressions for the mean and variance of the type log-likelihood distribution, we obtain the type decomposition of Pl , equation 4.8, as: X P (log pl ( n ) ) D Pl ( T ( n ) ) P (log pl ( n ) |T( n ) ) T ( n)
D
X
³ e
(log |T(n) | C m T(n) )
e
¡
(log p(n)¡m T(n) )2 2s 2 T(n)
´ 2 ) ¡ 12 log(2p sT(n)
.
(4.24)
T ( n)
An example of such decomposition is given in Figure 9. In this gure we show how the log-likelihood distribution is formed out of a narrower distribution of type classes. The last term is the log-normal probability, for which we have just computed the mean and the variance, and the rst part is |T( n ) | exp( hlog( p ( n ) ) iT ( n) ) ,
(4.25)
or, alternatively, exp( log( |T( n ) |) C hlog( p ( n ) ) iT ( n ) ) .
(4.26)
For a given spike sample, both the type and the model log-likelihood are known, and our nal probabilistic score would be Sl ( n ) D log( |T(n) |) C m T ( n) ¡
( log p ( n ) ¡ m T ( n) ) 2 2sT2( n)
¡
1 log( 2p sT2( n) ) , 2
(4.27)
where both m T ( n ) and sT2( n) depend on the model l up to a permutation of li .
2698
Itay Gat and Naftali Tishby
5000
Occurrences #
4000 3000 2000 1000 0 80
70
60
50
40
30
20
10
Log probability Figure 9: Decomposition of the model log-likelihood into sample types. The different line styles represent different families of sample types. The dotted sample types are types that do not include more than one count in each bin. The solid sample types hold exactly one bin with two counts, and the dashed sample types have two bins with two counts.
4.5 Implementation Issues. The score computed in equation 4.26 can be used directly when trying to spot a neural template. Its use is derived from the maximum likelihood principle. The procedure includes the parceling of neuronal data into frames. These frames would usually be successive and might also overlap. For each frame, the score is computed taking into consideration the sample type of the frame. The score of all frames is compared to a threshold, and the frames whose score was above the threshold are declared to be the “spotting” of the algorithm. In the current implementation of the adversary background model, we use all the bins as though they were drawn from one source. When calculating the score for a given frame of data, the adversary background is blind to the identity of the cell a specic bin was drawn from. The only information is the type class, which counts the number of bins with no spike, with one spike, and so on. This implementation was motivated by the features of our neuronal data (as explained in section 5). We also refer to different adversary knowledge in the discussion. One important issue that must be addressed when using this adversary background model is how one goes about choosing the type classes or how
Spotting Neural Spike Patterns
2699
broad they should be. One extreme is to dene the type so broadly that (almost) all spike patterns are part of the same type; conversely, one can dene the type so narrowly that a typical type has only one pattern. These extremes do not serve the purpose of this work, and it is clear that some value intermediary should be used. The two extremes described are mainly governed by the bin size being used in parceling the neuronal template. An example of how to select an appropriate bin width will be given in section 5. 5 Application: Detecting the “Reward Pattern” in the Basal Ganglia 5.1 Description of the Data Used. The application of the new detection method was shown on the reward signal of the tonically active neurons (TANs) in the basal ganglia. A more detailed description of the data being used together with more results is given elsewhere (Gat, Raz, Abeles, Tishby, & Bergman, 2000). TANs have a spontaneous ring rate of 3–15 Hz and, after training, show strong and robust responses to cues predicting future rewards. This response is characterized by a reduction in ring rate (pause), often anked by brief elevation of the ring rate (Aosaki et al., 1994). This response will henceforth be referred to as the reward signature of the TANs. Lately, these types of cells have drawn much attention due to the fact that although the TANs represent only 1 to 5% of the total population of striatal neurons (Kawaguchi, Wilson, Augood, & Emson, 1995), they have a major role in the functions of the basal ganglia (Bolam & Bennet, 1995). Another interesting feature of these cells is that all the TANs show the same response simultaneously. The data used here consisted of recordings from two different monkeys (monkeys H and I; Cercopithecus aethiops aethiops) that were trained to perform a visual-motor task (Abeles et al., 1995). The activity of one to four TANs was recorded simultaneously while the monkeys performed the behavioral task. The recording were carried out in the Higher Brain Function Laboratory at the Haddassah Medical School. These experiments were conducted by E. Raz and A. Feingold as part of their Ph.D. work under the supervision of H. Bergman (Raz, Feingold, Zelanskaya, Vaadia, & Bergman, 1996). We dened the tone signaling of the reward (Monkey H) or the visual cue predicting the reward (Monkey I) as the reward cue. The responses of the three to four simultaneously recorded TANs to the reward cues were used to create a rate template for a specic recording session (see Figure 10A). This rate template was computed over bins of equal width (of 20 ms in our example) using the method of peristimulus time histogram (PSTH) model. The selection of bin width (and therefore the broadness of our type denition) was based on previous work (Gat & Tishby, 1997) in which we examined the optimal bin for detecting a neuronal response. In general, it is important to choose a bin width that captures the changes in ring rate and enables a reasonable estimation of the ring rate. Selecting a bin width that
2700
Itay Gat and Naftali Tishby
A 0.4 0 0.4 0 0.4 0 0.4 00
100
200
300 400 Time (ms)
500
600
100
200
300 400 Time (ms)
500
600
B 0.4 0 0.4 0 0.4 0 0.1 0
Spotting Neural Spike Patterns
2701
is too large may hide an important response. Bins that are too short may result in poorly estimated neuronal templates. The rate template was used as the model and was sought over all the trials of the recording session. 5.2 Reward Signature Is Spotted Signicantly. In the work presented here, we addressed the question of whether our detection procedure actually detected the reward signal. We found that the PSTH of the reward signal was very similar to the original one (see Figure 10B). Thus, the detection scheme works properly. It is possible, however, that given enough data, one can detect any reasonable pattern. To address that question, we designed a signicance computation procedure. In this procedure, we performed the same detection on shufed spike trains; we used the fact that we had multicell recordings and shifted the spike trains of the cells relative to each other. The results of this signicance test are as shown in the third column of Table 1. The data there show that the number of detected reward signals is signicantly greater than in the shufed data. We then tried to detect two other signatures in the same data. One signature was a constant (in which all the bins had exactly the same value). The other signature was constructed from different segments of the recording session (the 2 seconds before the reward was signaled). The differences between the number of detected signals in the real data and in the shufed data can be seen in the fourth and fth columns of Table 1. It can be seen clearly that for these new signatures, there were no signicant results whatsoever. 5.3 Reward Signal Is Not Conned to the Reward Cue. Previously it was hypothesized that the reward signature is a binary signal that can be found only in close proximity to the reward cue. This hypothesis was based on the fact that the reward signature was discovered after averaging many reward cue epochs. Using the detection method presented here, we were able to test this hypothesis directly by locating the reward signals in relation to the behavioral events of the trials. An example of such a location is shown in Figure 11. Figure 11A shows two interesting phenomena of the spotted template. Naturally, the frequency of spotted events is maximal in the epoch locked to Figure 10: Facing page. It is possible to detect the neural signature of reward even when it is not locked to external reward. (A) Example of a template used in the search for reward response. This template was calculated from the peristimulus histograms of four simultaneously recorded TANs. Reward was hinted at zero lag time (reward cue). The bin size used is 20 ms (130 trials were used to compute the histogram). The y axis values are counts per bin per trial. (B) An example of a peristimulus histogram computed from the spotted events of the previously dened template. The spotted events do not include segments from the vicinity of the reward cue. Bin size and y-axis are as in A.
2702
Itay Gat and Naftali Tishby
Table 1: Signicance of Spotting the Neural Reward Signal. (Found-Expected)/(Standard Deviation)
Session
Number of cells
Reward Response
Flat PSTH
Before Reward
H-1 H-2 H-3 I-1 I-2
4 3 3 4 3
15.3 9.4 9.9 18.2 11.8
¡0.25 0.44 ¡1.45 ¡0.88 0.96
¡0.91 0.43 2.10 1.70 ¡0.53
Notes: Column 1 gives the animal identity and the session number. The number of simultaneously recorded TANs in each session is given in column 2. We used several templates to verify the signicance of spotting: the reward response (calculated separately for each of the recording sessions), a at PSTH, and a PSTH computed from segments aligned 2 seconds before the reward cue. A signicance level was computed for each of these experiments. This level was the difference between the number of detections in the original data and the average number of detections in the shufed data. This difference was measured in standard deviation (SD) of the shufed data. Differences greater than 3 SDs are considered signicant deviations from the null hypothesis.
the reward cue. We also found that the number of spotted events was significant compared to the number expected by chance. However, the template is signicantly spotted in all periods of the trial, and the majority of spotted events was not found in the vicinity of the reward cue. The frequency of the spontaneously emitted reward signal was not constant over all behavioral periods. As expected, this frequency reached a peak at the reward cue period, but remained high in the 2 to 4 seconds following the reward cue (see Figures 11 A and 11B). Reverse correlation of the spotted signals with the other behavioral events in the task failed to reveal any time locking, further suggesting that the elevation of the frequency of reward signal from the background level is associated only with the reward cue. A clear tendency toward reduction in this frequency is observed following erroneous trials. This tendency was observed in all recording sessions being examined and is summarized in Table 2. 5.4 Comparison with Other “Type” Detection Method. We compared the results obtained by our method to those achieved using a different typebased method (Johnson, Gruner, Baggerly, & Seshagiri, 1998), henceforth referred to as the Jensen-Shannon (JS) divergence (Lin, 1991). In this technique, a statistic is computed. It has been shown elsewhere (Gutman, 1989) that this statistic is optimal in a certain sense. It was also shown that given enough training and testing data, this statistic will produce error probabilities having the same exponential decay rate as the likelihood ratio classier that clairvoyantly knows the underlying statistical model for the training data.
Number of spotted events
Spotting Neural Spike Patterns
2703
20
10
0 6000 4000 2000
0
2000
4000
6000
Time (ms) after reward Figure 11: The frequency of the neural reward signal increased after the reward was indicated. The time histogram of the reward-like response. Each bar represents the number of times the reward signal occurred in a certain delay with respect to the reward cue. The bin size used was 800 ms.
The scenario in this method is as follows. We are given a training set Tj , where the index j stands for different categories of the data. The data are made up of observations, and each observation can assume only a nite set of values: Rl 2 r1 , . . . , rK . When given a test set R D R1 , . . . , RLR , the Table 2: Deviation from Average Level in Correct Trials and Erroneous Ones. Session
Correct Trials
Erroneous Trials
H-1 H-2 H-3 I-1 I-2
61.7 22.3 28.7 44.7 12.5
¡100 ¡60.6 ¡57.4 ¡12.7 ¡54.2
Notes: The change from average level of reward signal is given both for the segment of 1 second starting 2 seconds after correct trials and for the same segment in erroneous ones. The change is given in percentage of change from average level.
2704
Itay Gat and Naftali Tishby
statistics are Gj D PN Tj , R D
LT D ( PO Tj || PN Tj , R ) C D ( PO R || PNTj , R ) LR LT PO Tj C LR POR LT C LR
,
(5.1) (5.2)
where PO is the histogram estimate of the probability mass function: PO Tj ( rk ) D
Number of times rk occurs in Tj LT
No. of times rk occurs in R . POR ( rk ) D LR
(5.3)
The method can be implemented directly by building the histogram for all response values. (For more details, see Johnson et al., 1998.) The number of entries in the histogram (type) would, however, usually prevent us from such an implementation. Having N neurons whose responses are measured across B bins, the number of possible response value is 2NB . In our case, N D 4 and B D 600, which results in the inability to estimate the probability for each of these values. In order to overcome the huge set of values, it is necessary to dene stationary segments of data and quantize the possible value of responses. As a result, the optimality promise is abandoned, but the computation becomes feasible. In the examples presented here, we divided the 600 ms after the reward into six stationary 100 ms segments and computed the histogram for each such segment. Spotting by this method suffered from two crucial problems: The highly probable spotted segments were not locked to the reward cue. The average detected segment was much lower than the original reward response, that is, similar to the assumption of at background (see Figure 4). We assume that the inability of this “type” method to serve as a feasible spotting method has to do with the nature of the data. Although the JS method is optimal given enough data, our case represents a dramatically different scenario. Our data are rather sparse, and the use of a parametric model such as the Poisson distribution seems to be highly benecial. Furthermore, when the goal is trying to learn the true nature of the ring histogram, the empty bins (which form the bulk of the data) dominate the JS method.
Spotting Neural Spike Patterns
2705
6 Discussion
In this work, we presented a novel method for pattern spotting in a harsh environment. Our goal was to be able to spot the pattern within a single trial consisting of multiunit recordings of a specic pattern of activity. The environment we aimed at consists of multicellular recordings of spikes in the brain of behaving monkeys. Statistical modeling of such data is difcult due to the following features: The rate of spike discharge per unit is very low—around 5 to 10 spikes per second. The change of activity may be relatively high. A whole pattern of activity can be encapsulated within 600 ms, where this pattern consists of a rise in ring rate, a depression, and another rise. One should remember, however, that due to the low ring rate, one may expect to see only 6 spikes within this pattern. The background activity is not a uniform ring rate but rather supplies us with similar uctuations in the activity. Furthermore, the average ring rate is similar to the pattern searched. Due to these difculties, the common method when using these kinds of data involves aligning several repetitions of the same behavioral event together and thus receiving more reliable statistics regarding the neural code. We have shown elsewhere (Gat & Tishby, 1997) the advantages of using single trial modeling and the new questions that may be answered using this mechanism. In this work, we tried to confront spotting neuronal response in a single trial. We have shown the importance of selecting a reasonable background model for the spotting algorithm. When inadequate background models were used, the spotting did not produce reasonable results. For example, in a situation where the threshold on the likelihood function assumes a constant rate background, the result was frames of data in which the number of spikes was the smallest. The core idea of the method presented here is that spotting of the neuronal pattern is attempted in an adversary environment in which all shufes of a given pattern are equally probable. When implementing this decomposition, the sample type and its log-likelihood serve as sufcient statistics. Since we are interested in patterns with a large number of bins, the total log-likelihood distribution in each sample type can be well approximated by a gaussian, whose mean and variance can be analytically evaluated. Based on this gaussian approximation,we derive an exact expression for a score that does not rely on the asymptotic properties of the direct application of the method of types. While this decomposition, shown in Figure 9, clearly breaks the log-likelihood distribution into narrower pieces, the mean of each piece depends very weakly (see equation 4.15) on the model rate parameters.
2706
Itay Gat and Naftali Tishby
The type weight, however, as given by equation 4.9, is exponential in the type log-likelihood mean. It is thus much more sensitive to the models’ rates. For these reasons, the type conditional (gaussian) distributions separate the model patterns from the background very well. We also examined a related solution to this problem based on the method of types (Gutman, 1989). Direct application of this technique for detecting multineural spike patterns can be carried out by treating the empirical type of every bin in a template model (PST histogram) of the training data (Johnson et al., 1998). In this case, one simply estimates the probability that the training data and the test sample are taken from the same source for every bin or combination of bins in the template model. As shown above, this method suffers from insufcient training data and therefore performed rather poorly on our data, albeit the optimality promise. Our method was implemented on data from the TANs (cholinergic interneurons of the striatum), and an attempt was made to spot the neural pattern usually related to the reward mechanism. This article has presented some results of this spotting; a more thorough description of the data with the signicance of the results can be found in Gat et al. (2000). We believe that these results could not have been obtained without this method of spotting the reward pattern on the basis of single trial. There are several possible extensions to the work presented here. One involves a slight change in the denition of T ( n ) . In this article, the type T ( n ) was dened as a permutation over all the bins in all the cells together. For multicell recording, where one has cells with very different rates, it is better to choose a type where T ( n ) is dened as the subset of all vectors where the permutation is being done for each cell separately. This denition makes use of the fact that the data come from N neurons from which one has measured for B time bins. In our example, however, we took advantage of the fact that TAN cells are known to respond in a similar manner (in both the response prole and their ring rates). By doing so, we were able to overcome the sparseness of our data and enrich our background model. This sparseness is probably the major cause for the failure of the JS divergence method presented here. A different extension is to change the types used here into types containing information on higher correlation of the data. In our application, we work against an adversary environment that consists of all the shufes of the given pattern in single bins. A second-order correlation would have to be an adversary environment that also keeps the order within pairs of bins. In order to perform such spotting, we need to recompute the mean and variance inside each such new type. In general, we can work with each order of correlation and thus may be able to cope with more complex spotting cases. When we extend the correlation order, the background becomes increasingly similar to the original pattern under investigation. In the current work, we found it sufcient to produce the concept of adversary background model and show how it can be obtained for rst
Spotting Neural Spike Patterns
2707
order. Fortunately, this model was quite appropriate for the data in hand— spotting the reward signal in TAN. We do not rule out the possibility that a higher-order adversary background model might be needed for different data modeling. In general, we believe that the idea of decomposing the likelihood into subparts could be useful. This could allow us to work in environments in which we cannot assume the usual at distribution of the background. It would enable us to perform reliable spotting without building a specic model. In our case, the decomposition into types seems sufcient and produced the required results. Acknowledgments
Enlightening discussions with William Bialek, Moshe Abeles, and Shai Fine are greatly appreciated. We thank Hagai Bergman and Aeyal Raz for providing us with the TAN cells recordings. This work was partially supported by a grant from the Israeli Science Foundation and a grant from the US-Israel Science Foundation (BSF). References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Tishby, N., & Vaadia, E. (1995). Cortical activity ips among quasi-stationary states. Proc. Natl. Acad. Sci., 92, 8616–8620. Aosaki, T., Tsubokawa, H., Ishida, A., Watanabe, K., Graybiel, A., & Kimura, M. (1994). Responses of tonically active neurons in the primate’s striatum undergo systematic changes during behavioral sensorimotor conditioning. J. Neurosci., 14, 3969–3984. Bolam, J. P., & Bennet, B. D. (1995). Molecular and cellular mechanisms of neostriatal function. Georgetown, TX: R. G. Landes Company. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Gat, I., Raz, A., Abeles, M., Tishby, N., & Bergman, H. (2000). The neural signature of reward expectation signal in the striatum of normal and parkinsonian primates. Preprint. Gat, I., & Tishby, N. (1997).Comparativestudy of differentsuperviseddetectionmethod of simultaneously record spike trains. Preprint. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Graybiel, A. M., Aosaki, T., Flaherty, A. W., & Kimura, M. (1994). The basal ganglia and adaptive motor control. Science, 265, 1826–1830. Gutman, M. (1989). Asymptotically optimal classication for multiple tests with empirically observed statistics. IEEE Trans. Info. Th., 35, 401–408. Johnson, D. H., Gruner, C., Baggerly, K., & Seshagiri, C. (1998, February). Information-theoretic analysis of the neural code. Available on-line at: http://www-dsp.rice.edu/ dhj.
2708
Itay Gat and Naftali Tishby
Kawaguchi, Y., Wilson, C., Augood, S., & Emson, P. (1995). Striatal interneurons: Chemical, physiological and morphological characterization. TINS, 18, 527– 535. Leheman, F. (1959). Testing statistical hypotheses. New York: Wiley. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Trans. Info. Th., 37, 145–151. Raz, A., Feingold, A., Zelanskaya, V., Vaadia, E., & Bergman, H. (1996). Neuronal synchronization of tonically active neurons in the striatum of normal and parkinsonian primates. J. Neurophysiol., 76, 2083–2088. Ziv, J. (1988). On classication with empirically observed statistics and universal data compression. IEEE Trans. Info. Th., 34, 278–286. Received August 26, 1999; accepted March 1, 2001.
LETTER
Communicated by Laurence Abbott
Intrinsic Stabilization of Output Rates by Spike-Based Hebbian Learning Richard Kempter [email protected] Keck Center for Integrative Neuroscience, University of California at San Francisco, San Francisco, CA 94143-0732, U.S.A. Wulfram Gerstner Wulfram.Gerstner@ep.ch Swiss Federal Institute of Technology Lausanne, Laboratory of Computational Neuroscience, DI-LCN, CH-1015 Lausanne EPFL, Switzerland J. Leo van Hemmen [email protected] ¨ ¨ PhysikDepartment,Technische Universit¨at Munchen, D-85747Garching bei Munchen, Germany We study analytically a model of long-term synaptic plasticity where synaptic changes are triggered by presynaptic spikes, postsynaptic spikes, and the time differences between presynaptic and postsynaptic spikes. The changes due to correlated input and output spikes are quantied by means of a learning window. We show that plasticity can lead to an intrinsic stabilization of the mean ring rate of the postsynaptic neuron. Subtractive normalization of the synaptic weights (summed over all presynaptic inputs converging on a postsynaptic neuron) follows if, in addition, the mean input rates and the mean input correlations are identical at all synapses. If the integral over the learning window is positive, ring-rate stabilization requires a non-Hebbian component, whereas such a component is not needed if the integral of the learning window is negative. A negative integral corresponds to anti-Hebbian learning in a model with slowly varying ring rates. For spike-based learning, a strict distinction between Hebbian and anti-Hebbian rules is questionable since learning is driven by correlations on the timescale of the learning window. The correlations between presynaptic and postsynaptic ring are evaluated for a piecewise-linear Poisson model and for a noisy spiking neuron model with refractoriness. While a negative integral over the learning window leads to intrinsic rate stabilization, the positive part of the learning window picks up spatial and temporal correlations in the input.
c 2001 Massachusetts Institute of Technology Neural Computation 13, 2709– 2741 (2001) °
2710
R. Kempter, W. Gerstner, and J. L. van Hemmen
1 Introduction
“Hebbian” learning (Hebb, 1949), that is, synaptic plasticity driven by correlations between pre- and postsynaptic activity, is thought to be an important mechanism for the tuning of neuronal connections during development and thereafter, in particular, for the development of receptive elds and computational maps (see, e.g. von der Malsburg, 1973; Sejnowski, 1977; Kohonen, 1984; Linsker, 1986; Sejnowski & Tesauro, 1989; MacKay & Miller, 1990; Wimbauer, Gerstner, & van Hemmen, 1994; Shouval & Perrone, 1995; Miller, 1996a; for a review, see Wiskott & Sejnowski, 1998). It is well known that simple Hebbian rules may lead to diverging synaptic weights so that weight normalization turns out to be an important topic (Kohonen, 1984; Oja, 1982; Miller & MacKay, 1994; Miller, 1996b). In practice, normalization of the weights wi is imposed by either an explicit rescaling of all Pweights after a learning step or a constraint on the summed weights (e.g., i wi D c, P or i w2i D c for some constant c), or an explicit decay term proportional to the weight itself. In recent simulation studies of spike-based learning (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999a; Song, Miller, & Abbott, 2000; Xie & Seung, 2000; Kempter, Leibold, Wagner, & van Hemmen, 2001), however, intrinsic normalization properties of synaptic plasticity have been found. Neither an explicit normalization step nor a constraint on the summed weights was needed. So we face the question: How can we understand these ndings? Some preliminary arguments as to why intrinsic normalization occurs in spike-based learning rules have been given (Gerstner et al., 1998; Kempter et al., 1999a; Song et al., 2000). In this article, we study normalization properties in more detail. In particular, we show that in spike-based plasticity with realistic learning windows, the procedure of continuously strengthening and weakening the synapses can automatically lead to a normalization of the total input strength to the postsynaptic neuron in a competitive self-organized process and, hence, to a stabilization of the output ring rate. In order to phrase the problem of normalization in a general framework, let us start with a rate-based learning rule of the form tw
d in out out out ( t) C acorr ( t) lin wi ( t) D a 0 C ain 1 li ( t ) C a1 l 2 i ( t) l dt in 2 out out ( ) 2 C ain t] , 2 [li ( t ) ] C a2 [l
(1.1)
which can be seen as an expansion of local adaptation rules up to second out (see Bienenorder in the input rates lin i (1 · i · N) and the output rate l stock, Cooper, & Munro, 1982; Linsker, 1986; Sejnowski & Tesauro, 1989). (A denition of rate will be given below.) The timescale of learning is set by out corr in out tw . The coefcients a 0, ain 1 , a1 , a2 , a2 , and a2 may, and in general will, depend on the weights wi . For example, in Oja’s rule (Oja, 1982), we have
Intrinsic Stabilization of Output Rates
2711
( wi ) D ¡wi , while all other coefcients vanish; for linear acorr D 1 and aout 2 2 neurons, the weight vector converges to a unit vector. Here we study learning rules of the form of equation 1.1, where the coefcients do not depend on the weights; we do, however, introduce upper and lower bounds for the weights (e.g., dwi / dt D 0 if wi ¸ wmax or wi · 0 for an excitatory synapse) so as to exclude runaway of individual weight values. As a rst issue, we argue of the weights, P that the focus on normalization P be it in the linear form i wi or in the quadratic form i w2i , is too narrow. A broad class of learning rules will naturally stabilize the output rate rather than the weights. As an example, let us consider a learning rule of the form tw
d wi ( t) D ¡ [lout ( t) ¡ lc ] lin i dt
(1.2)
with constant rates lin i , (1 · i · N). Equation 1.2 is a special case of equation 1.1. For excitatory synapses, lout increases with wi ; more precisely, lout ( w1 , . . . , wN ) is an increasing function of each of the wi . Hence learning stops if lout approaches lc > 0, and the xed point lout D lc is stable. In the in special case that all input lines i have an equal rate lin i D l independent of i, stabilization of the mean output rate implies the normalization of the P sum i wi . Section 3 will make this argument more precise. A disadvantage of the learning rule in equation 1.2 is that it has a correlation coefcient acorr < 0. Thus, in a pure rate description, the learning rule 2 would be classied as anti-Hebbian rather than Hebbian. As a second issue, we show in section 4 that for spike-based learning rules, the strict distinction between Hebbian and anti-Hebbian learning rules becomes questionable. A learning rule with a realistic time window (Levy & Stewart, 1983; Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Debanne, Gahwiler, ¨ & Thompson, 1998; Bi & Poo, 1998; Feldman, 2000) can be anti-Hebbian in a time-averaged sense and still pick up positive “Hebbian” correlations between input and output spikes (Gerstner, Kempter, et al., 1996; Kempter et al., 1999a; Kempter, Gerstner, & van Hemmen, 1999b; Kempter, Gerstner, van Hemmen, & Wagner, 1996). These correlations between input and output spikes that enter the learning dynamics will be calculated in this article for a linear Poisson neuron and for a noisy spiking neuron with refractoriness. Spike-based rules open the possibility of a direct comparison of model parameters with experiments (Senn, Tsodyks, & Markram, 1997; Senn, Markram, & Tsodyks, 2001). A theory of spike-based learning rules has been developed in Gerstner, Ritz, and van Hemmen (1993), H¨aiger, Mahowald, and Watts (1997), Ruf and Schmitt (1997), Senn et al. (1997, 2001), Eurich, Pawelzik, Ernst, Cowan, and Milton (1999), Kempter et al. (1999a), Roberts (1999), Kistler and van Hemmen (2000), and Xie and Seung (2000); for a review, see, van Hemmen (2000). Spike-based learning rules are closely related to rules for sequence learning (Herz, Sulzer, Kuhn, ¨ & van Hemmen,
2712
R. Kempter, W. Gerstner, and J. L. van Hemmen
1988, 1989; van Hemmen et al., 1990; Gerstner et al., 1993; Abbott & Blum, 1996; Gerstner & Abbott, 1997), where the idea of asymmetric learning windows is exploited. In section 4 a theory of spike-based learning is outlined and applied to the problem of weight normalization by rate stabilization. 2 Input Scenario
We consider a single neuron that receives input from N synapses with weights wi where 1 · i · N is the index of the synapse. At each synapse i, input spikes arrive stochastically. In the spike-based description of section 4, we will model the input spike train at synapse i as an inhomogeneous Poisson process with rate lin i ( t ). In the rate description of section 3, the input spike train is replaced by the continuous-rate variable lin i ( t ) . Throughout the article, we focus on two different input scenarios—the static-pattern scenario and the translation-invariant scenario. In both scenarios, input rates are dened as statistical ensembles that we now describe. 2.1 Static-Pattern Scenario. The static-pattern scenario is the standard framework for the theory of unsupervised Hebbian learning (Hertz, Krogh, & Palmer, 1991). The input ensemble contains p patterns dened as vectors m m m xm 2 RN where xm D ( x1 , x2 , . . . , xN ) T with pattern label 1 · m · p. Time is discretized in intervals D t . In each time interval D t , a pattern is chosen randomly from the ensemble of patterns and applied at the input; during m application of pattern m , the input rate at synapse i is lin i D xi . Temporal averaging over a time T yields the average input rate at each synapse:
lin i ( t)
1 :D T
Zt 0 dt0 lin i (t ).
(2.1)
t¡T in For long intervals T À p D t , the average input rate is constant, lin i ( t ) D li , and we nd
lin i D
p 1X m xi . p m D1
(2.2)
An overbar will always denote temporal averaging over long intervals T. The normalized correlation coefcient of the input ensemble is Cstatic ij
m m in p in 1 X ( xi ¡ li ) ( xj ¡ lj ) static . D D Cji in p m D1 lin ¢ l i j
(2.3)
An explicit example of the static-pattern scenario is given in the appendix. Input correlations play a major role in the theory of unsupervised Hebbian learning (Hertz et al., 1991).
Intrinsic Stabilization of Output Rates
2713
2.2 Translation-Invariant Scenario. In developmental learning, it is often natural to assume a translation-invariant scenario (Miller, 1996a). In this scenario, the mean input rates are the same at all synapses and (again) constant in time: in lin i ( t) D l
for all i.
(2.4)
At each synapse i, the actual rate lin i ( t ) is time dependent and uctuates on a timescale D corr around its mean. In other words, the correlation function, dened as Cij ( s) :D
in D lin i ( t ) D lj ( t ¡ s ) in lin i ¢ lj ,
D Cji ( ¡s) ,
(2.5)
in in is supposed to vanish for |s| À D corr . Here D lin i ( t ) :D li ( t ) ¡ li is the deviation from the mean. The essential hypothesis is that the correlation function is translation invariant, that is, Cij D C|i¡j| . Let us suppose, for example, that the ensemble of stimuli consists of objects or patterns of some typical size spanning l input pixels so that the correlation function has a spatial width of order l. If the objects are moved randomly with equal probability in an arbitrary direction across the stimulus array, then the temporal part of the correlation function will be symmetric in that C|i¡j| ( s ) D C|i¡j| ( ¡s). Using the static-pattern scenario or the translation-invariant scenario, we are able to simplify the analysis of Hebbian learning rules considerably (rate-based learning in section 3 and spike-based learning in section 4).
3 Plasticity in Rate Description
In this section, we start with some preliminary considerations formulated on the level of rates and then analyze the rate description proper. Our aim is to show that the learning dynamics dened in equation 1.1 can yield a stable xed point of the output rate. In order to keep the discussion as simple as possible, we focus mainly on a linearized rate neuron model and use the input scenarios of section 2. The learning rule is that of equation 1.1 with out ain 2 D a2 D 0. All synapses are assumed to be of the same type so that the out corr coefcients ain do not depend on the synapse index i. 1 , a1 , and a2 As is usual in the theory of Hebbian learning, we assume that the slow timescale tw of learning and the fast timescale of uctuations in the input and neuronal output can be separated by the averaging time T. For instance, we assume tw À T À p D t in the static-pattern scenario or tw À T À D corr in the translation-invariant scenario. A large tw implies that weight changes within a time interval of duration T are small compared to typical values of weights. The right-hand side of equation 1.1 is then “self-averaging”
2714
R. Kempter, W. Gerstner, and J. L. van Hemmen
(Kempter et al., 1999a) with respect to both randomness and time; that is, synaptic changes are driven by statistically and temporally averaged quantities, tw
d out out in ( t) wi ( t) D a 0 C ain 1 li C a1 l dt out ( t ) C acorr D lin ( t ) D lout ( t ). C acorr lin 2 i ¢l 2 i
(3.1)
“Hebbian” learning of a synapse should be driven by the correlations of pre- and postsynaptic neuron, that is, the last term on the right-hand side of equation 3.1 (cf. Sejnowski, 1977; Sejnowski & Tesauro, 1989; Hertz et al., 1991). This term is of second order in D lin , and it might therefore seem that out —but is it really? it is small compared to the term proportional to lin i ¢ l Let us speculate for the moment that in an ideal neuron, the mean output rate lout is attracted toward an operating point, lout FP D ¡
in a0 C ain 1 l
corr in l aout 1 C a2
,
(3.2)
where lin is the typical mean input rate that is identical at all synapses. Then the dominant term on the right-hand side of equation 3.1 would indeed be out because all other terms on the right-hand the covariance term / D lin i Dl side of equation 3.1 cancel each other. The idea of a xed point of the rate lout D lout FP will be made more precise in the following two sections. 3.1 Piecewise-Linear Neuron Model. As a rst example, we study the piecewise-linear neuron model with rate function
"
l
out (
N 1 X t ) D l0 C c 0 wi ( t) lin i ( t) N i D1
# .
(3.3)
C
Here, l0 is a constant, and c 0 > 0 is the slope of the activation function. The normalization by 1 / N ensures that the mean output rate stays the same if the number N of synaptic inputs (with identical input rate lin ) is increased. In case l0 < 0, we must ensure explicitly that lout is nonnegative. To do so, we have introduced the notation [.] C with [x] C D x for x > 0 and [x] C D 0 for x · 0. In the following, we will always assume that the argument inside the square brackets is positive, drop the brackets, and treat the model as strictly linear. Only at the end of the calculation do we check for consistency—for lout ¸ 0. At the end of learning, we also check for lower and upper bounds, 0 · wi · wmax for all i. (For a discussion of the inuence of lower and upper bounds for individual weights on the stabilization of the output rate, see section 5.3.)
Intrinsic Stabilization of Output Rates
2715
Since learning is assumed to be slow (tw À D corr ), the weights wi do not change on the fast timescale of input uctuations. Hence the correlation out in equation 3.1 can be evaluated for constant w . Using term D lin i i Dl equation 3.3, the denition D lout ( t) :D lout ( t ) ¡ lout ( t) , and the denition of the input correlations Cij as given by equation 2.5, we obtain out ( t ) D lin i ( t) D l
N 1 X
D lin i c0 N
wj ( t) ljin Cij ( 0) .
(3.4)
j D1
We now want to derive the conditions under which the mean output rate lout approaches a xed point lout FP . Let us dene, whatever j, the “average correlation,” PN C :D
in 2 i D1 ( li ) Cij ( 0) I PN in 2 i D 1 (li )
(3.5)
that is, we require that the average correlation be independent of the index j. This condition may look articial, but it is, in fact, a rather natural assumption. In particular, for the translation-invariant scenario, equation 3.5 always holds. We use equations 3.3 through 3.5 in equation 3.1, multiply by c 0 N ¡1 lin i , and sum over i. Rearranging the result, we obtain t
d out out ( t ) l ( t) D lout FP ¡ l dt
(3.6)
with xed point lout FP
" # N N N X X X t c0 in 2 corr 2 in in in (l ) ( ) li C a1 ¡ C a2 l0 li :D a0 , i tw N iD 1 iD 1 i D1
(3.7)
and time constant " #¡1 N N X N out X corr 2 (lin t :D ¡tw lin C (1 C C ) a2 . a i ) c 0 1 iD 1 i iD 1
(3.8)
The xed point lout FP is asymptotically stable if and only if t > 0. For the translation-invariant scenario, the mean input rates are the same at all synapses so that the xed point 3.7 reduces to lout FP D ¡
corr in in a0 C ain 1 l ¡ Ca2 l0 l corr in ( C ) 1 C aout 1 C a2 l
.
(3.9)
This is the generalization of equation 3.2 to the case of nonvanishing correlations.
2716
R. Kempter, W. Gerstner, and J. L. van Hemmen
We require lout FP > 0 so as to ensure that we are in the linear regime of our piecewise-linear neuron model. For excitatory synapses, we require in addition that the weights wi are positive. Hence, a realizable xed point should have a rate lout FP > maxf0, l 0g. On the other hand, the weight values being bounded by wmax , the xed point must be smaller than l 0 C c 0 wmax lin since otherwise it cannot be reached. The learning dynamics would stop once all weights have reached their upper bound. Finally, the stability of the xed point requires t > 0. The above requirements dene conditions out corr on the parameters a 0 , ain 1 , a1 , a2 , and l0 that we now explore in two examples. We note that a xed point of the output rate implies a xed point P P of the average weight w :D j ljin wj / j ljin (cf. equation 3.3). Example 1 (no linear terms).
In standard correlation-based learning, there out are usually no linear terms. Let us therefore set a0 D ain D 0. We 1 D a1 assume that the average correlation is nonnegative, C ¸ 0. Since we have c 0 > 0, the term inside the square brackets of equation 3.8 must be negative in order to guarantee asymptotic stability of the xed point. Since lin i > 0 corr for all i, a stable xed point lout can therefore be achieved only if < 0, a FP 2 so that we end up with anti-Hebbian learning. The xed point is lout FP D l0
C 1CC
.
(3.10)
As it should occur for positive rates, we must require l 0 > 0. In addition, because lout FP < l0 , some of the weights wi must be negative at the xed point, which contradicts the assumption of excitatory synapses (see equation 3.3). We note that for C ! 0, the output rate approaches zero; that is, the neuron becomes silent. In summary, without linear terms, a rate model of correlation-based learning cannot show rate stabilization unless some weights are negative and acorr < 0 (anti-Hebbian learning). 2 Example 2 (Hebbian learning).
In standard Hebbian learning, the correlation term should have positive sign. We therefore set acorr > 0. As before, we 2 assume C > 0. From t > 0 in equation 3.8, we then obtain the condition that aout 1 be sufciently negative in order to make the xed point stable. In particular, we must have aout < 0. Hence, ring-rate stabilization in rate-based 1 Hebbian learning requires a linear “non-Hebbian” term aout < 0. 1 3.2 Nonlinear Neuron Model. So far we have focused on a piecewiselinear rate model. Can we generalize the above arguments to a nonlinear neuron model? To keep the arguments transparent, we restrict our analysis to the translation-invariant scenario and assume identical mean inin put rates, lin i D l for all i. As a specic example, we consider a neuron model with a sigmoidal activation (or gain) function lout D g ( u ) where
Output Rate l
out
Intrinsic Stabilization of Output Rates
l
2717
max
l (0) 0
l 0(w)
0
u Input u
Figure 1: Sigmoidal activation function g. WeP plot the output rate lout (solid line) in ¡1 as a function of the neuron’s input u :D N i wi li . The spontaneous output rate at u D 0 is l( 0) , and the maximum output rate is lmax . The dashed line is a linearization of g around a mean input u > 0. We note that here l0 ( w ) < 0; cf. equation 3.11 and examples 1 and 4.
P in 0 u ( t) D N ¡1 N i D 1 wi ( t ) li ( t ) and the derivative g ( u ) is positive for all u. For vanishing input, u D 0, we assume a spontaneous output rate lout D l( 0) > 0. For u ! ¡1, the rate vanishes. For u ! 1, the function g approaches out max (see Figure 1). the maximum output rate P l Dl ¡1 Let us set w D N j wj . For a given mean weight w, we obtain a mean in in input u D w lin . If the uctuations D lin i ( t ) :D li ( t ) ¡ l are small, we can linearize g( u ) around the mean input u, that is, g( u ) D g ( u ) C g0 ( u ) ( u ¡ u ) (cf. Figure 1). The linearization leads us back to equation 3.3,
lout D l0 ( w) C c 0 ( w )
N 1 X wi lin i , N i D1
(3.11)
with l 0 ( w) D g ( u ) ¡ g0 ( u ) u and c 0 ( w ) D g0 ( u ). The discussion of output rate stabilization proceeds as in the linear case. Does a xed point exist? The condition dlout / dt D 0 yields a xed point that is given by equation 3.9 with l0 replaced by l 0 ( w ) . Thus, equation 3.9 becomes an implicit equation for lout FP . A sigmoidal neuron can reach the max . The local stability xed point if the latter lies in the range 0 < lout FP < l out analysis does not change. As before, a xed point of l implies a xed point of the mean weight w. The statement follows directly from equation 3.11 P since dlout / dt D g0 ( u) lin N ¡1 i dwi / dt and g0 > 0. That is, stabilization of the output rate implies a stabilization of the weights.
2718
R. Kempter, W. Gerstner, and J. L. van Hemmen
Let us summarize this section. The value of the xed point lout FP is identical for both linear and the linearized rate model and given by equation 3.9. It is stable if t in equation 3.8 is positive. If we kept higher-order terms in the expansion of equation 3.11, the xed point may shift slightly, but as long as the nonlinear corrections are small, the xed point remains stable (due to the continuity of the differential equation, 1.1). This highlights the fact that the learning rule, equation 3.1, with an appropriate choice of parameters achieves a stabilization of the output rate rather than a true normalization of the synaptic weights. 4 Plasticity Driven by Spikes
In this section, we generalize the concept of intrinsic rate stabilization to spike-time-dependent plasticity. We start by reviewing a learning rule where synaptic modications are driven by the relative timing of pre- and postsynaptic spikes and dened by means of a learning window (Gerstner, Kempter, et al., 1996; Kempter et al., 1996); extensive experimental evidence supports this idea (Levy & Stewart, 1983; Bell, Han, Sugawara, & Grant, 1997; Markram et al., 1997; Zhang et al., 1998; Debanne et al., 1998; Bi & Poo, 1998, 1999; Egger, Feldmeyer, & Sakmann, 1999; Abbott & Munro, 1999; Feldman, 2000). The comparison of the spike-based learning rule developed below with a rate-based learning rule discussed above allows us to out corr give the coefcients ain in equation 3.1 a more precise mean1 , a1 , and a2 corr ing. In particular, we show that a2 corresponds to the integral over the learning window. Anti-Hebbian learning in the above rate model (see example 1) can thus be identied with a negative integral over the learning window (Gerstner, Kempter, et al. 1996; Gerstner, Kempter, van Hemmen, & Wagner, 1997; Kempter et al., 1999a, 1999b). 4.1 Learning Rule. We are going to generalize equation 1.1 to spikebased learning. Changes in synaptic weights wi are triggered by input spikes, output spikes, and the time differences between input and output spikes. To simplify the notation, we write the sequence ft1i , t2i , P . . .g of spike m) arrival times at synapse i in the form of a “spike train” Sin i ( t ) :D m d( t ¡t i consisting of d functions (real spikes of course have a nite width, but what counts here is the events P “spike”). Similarly, the sequence of output spikes is denoted by Sout ( t) :D n d(t ¡tn ) . These notations allow us to formulate the spike-based learning rule as (Kempter, et al., 1999a; Kistler & van Hemmen, 2000; van Hemmen, 2000),
tw
d in out out ( t) C Sin wi ( t) D a0 C ain 1 Si ( t ) C a 1 S i ( t) dt Zt C Sout ( t ) ¡1
Zt ¡1
0 dt0 W ( ¡t C t0 ) Sin i (t ).
dt0 W ( t ¡ t0 ) Sout ( t0 )
(4.1)
Intrinsic Stabilization of Output Rates
2719
out As in equation 1.1, the coefcients a 0 , ain 1 , a1 , and the learning window W in general depend on the current weight value wi . They may also depend on other local variables, such as the membrane potential or the calcium concentration. For the sake of clarity of presentation, we drop these dependencies and assume constant coefcients. We can, however, assume upper and lower bounds for the synaptic efcacies wi ; that is, weight changes are zero if wi > wmax or wi < 0 (see also the discussion in section 5.3). Again for the sake of clarity, in the remainder of this section we presume that all weights are within the bounds. How can we interpret the terms on the right in equation 4.1? The coefcient a 0 is a simple decay or growth term that we will drop. Sin i and Sout are sums of d functions. We recall that integration of a differential equation dx / dt D f ( x, t) C ad ( t) with arbitrary f yields a discontinuity at zero: x ( 0 C ) ¡ x( 0¡ ) D a. Thus, an input spike arriving at synapse i at time t changes wi at that time by a constant amount tw¡1 ain 1 and a variable R ¡1 t 0 0 out 0 amount tw ¡1 dt W ( t ¡ t ) S ( t ), which depends on the sequence of output spikes occurring at times earlier than t. Similarly, an output spike at Rt 0 ( 0 ) in ( 0 ) time t results in a weight change tw¡1 [aout 1 C ¡1 dt W t ¡ t Si t ]. Our formulation of learning assumes discontinuous and instantaneous weight changes at the times of the occurrence of spikes, which may not seem very realistic. The formulation can be generalized to continuous and delayed weight changes that are triggered by pre- and postsynaptic spikes. Provided the gradual weight change terminates after a time that is small compared to the timescale of learning, all results derived below would basically be the same. Throughout what follows, we assume that discontinuous weight changes are small compared to typical values of synaptic weights, meaning that learning is by small increments, which can be achieved for large tw . If tw is large, we can separate the timescale of learning from that of the neuronal dynamics, and the right-hand side of equation 4.1 is “self-averaging” (Kempter et al., 1999a; van Hemmen, 2000). The evolution of the weight vector (see equation 4.1) is then
tw
d out in out i( t ) wi ( t) D ain 1 hSi i( t ) C a1 hS dt Z1 out ( t ¡ s) i. C dsW(s) hSin i ( t) S
(4.2)
¡1
Our notation with angular brackets plus horizontal bar is intended to emphasize that we average over both the spike statistics given the rates (angular brackets) and the ensemble of rates (horizontal bar). An example will be given in section 4.4. Because weights change slowly and the statistical input ensemble is assumed to have stationary correlations, the two integrals in equation 4.1 have been replaced by a single integral in equation 4.2.
2720
R. Kempter, W. Gerstner, and J. L. van Hemmen
The formulation of the learning rule in equation 4.2 allows a direct link to rate-based learning as introduced in equations 1.1 and 3.1. Indeed, averaged out i( t ) are identical to the mean ring rates lin ( t ) spike trains hSin i i(t) and hS i out and l ( t) , respectively, as dened in sections 2 and 3. In order to interout i pret the correlation terms, we have to be more careful. The terms hSin i S describe the correlation between input and output on the level of spikes. In section 4.3, we will give a detailed treatment of the correlation term. In the next section, we simplify the correlation term under the assumption of slowly changing ring rates. 4.2 Correlations in Rate-Based Learning. In this section, we relate the correlation term acorr in equation 3.1 to the integral over the learning win2 dow W in equation 4.2. In order to make the transition from equation 4.2 to equation 3.1, two approximations are necessary (Kempter et al., 1999a, 1999b). First, we have to neglect correlations between input and output spikes apart from the correlations contained in the rates. That is, we make the out ( t ¡ s) i ¼ lin ( t ) lout ( t ¡ s) , the horizontal overbar approximation hSin i ( t) S i denoting an average over the learning time. Second, we assume that these rates change slowly as compared to the width of the learning window W (cf. Figure 2a). Consequently we set lout ( t ¡ s) ¼ lout ( t) ¡ s [d / dt lout ( t) ] in the integrand of equation 4.2. With the above two assumptions, we obtain
tw
d out out ( ) in out ( t ) wi ( t) D ain t C lin 1 li C a1 l i ( t) l dt ¡ lin i ( t)
d out l ( t) dt
Z1 ds s W ( s ).
Z1 ds W ( s ) ¡1
(4.3)
¡1
The last term on the right in equation 4.3 has been termed “differentialHebbian” (Xie & Seung, 2000). Equation 4.3 is equivalentR to equation 3.1, if we neglect the differential-Hebbian term and set acorr D ds W ( s) and a 0 D 2 in out in out out ( t ) still 0. We note, however, that li ( t) l ( t ) D li ¢ l ( t) C D lin i ( t) D l contains the correlations between the uctuations in the input and output rates (but we had to assume that these correlations are slow compared to the width of the learning window). The condition of slowly changing rates will be dropped in the next section. We saw in example 1 that with rate-based learning, intrinsic normalization of the output rate is achievable once acorr < 0. We now argue that 2 < 0, which has been coined “anti-Hebbian” in the framework of rateacorr 2 based learning, can still be “Hebbian” in terms of spike-based learning. The intuitive reason is that the learning window W ( s) as sketched in Figure 2a has two parts, a positive and a negative one. Even if the integral over the learning window is negative, the size of weight changes in the positive part
Intrinsic Stabilization of Output Rates
b Firing Rate [arb. units]
W(s) [arb. units]
a
2721
0
-40
-10
0 10
s (ms)
40
0
-10
0 10
s (ms)
Figure 2: (a) Learning window W (arbitrary units) as a function of the delay n m s D tm i ¡t between presynaptic spike arrival time ti and postsynaptic ring time tn (schematic). If W ( s) is positive (negative) for some s, the synaptic efcacy wi is increased (decreased). The increase of wi is most efcient if a presynaptic spike arrives a few milliseconds before the postsynaptic neuron starts ring. For |s| ! 1, we have W ( s ) ! 0. The bar denotes the width of the learning window. The qualitative time course of the learning window is supported by experimental results (Levy & Stewart, 1983; Markram et al., 1997; Zhang et al., 1998; Debanne et al., 1998; Bi & Poo, 1998; Feldman, 2000; see also the reviews by Brown & Chattarji, 1994; Linden, 1999; Paulsen & Sejnowski, 2000; and Bi & Poo, 2001). The learning window can be described theoretically by a phenomenological model with microscopic variables (Senn et al., 1997, 2001; Gerstner, Kempter, van Hemmen, & Wagner, 1998). (b) Spike-spike correlations (schematic). The correlation function Corri ( s, w ) / lin i (full line) dened in equation 4.10 is the sum of two terms: (1) the causal contribution of an input spike at synapse i at time s to the output ring rate at time s D 0 is given by the time-reverted EPSP, that is, c 0 N ¡1 wi 2 ( ¡s) (dotted line); (2) the correlations between the mean ring rates ( ) out ( t ¡ s ) / lin lin i t l i are indicated by the dashed line. For independent inputs with stationary mean, the rate contribution would be constant and equal to lout . In the gure, we have sketched the situation where the rates themselves vary with time.
can be large enough to pick up correlations. In other words, the two approximations necessary to make the transition from equation 4.2 to equation 4.3 are, in general, not justied. An example is given in the the appendix. In order to make our argument more precise, we have to return to spike-based learning. 4.3 Spike-Spike Correlations. In the previous section, we replaced out ( t ¡ s) i by the temporal correlation between slowly changing hSin i ( t) S rates. This replacement leads to an oversimplication of the situation be-
2722
R. Kempter, W. Gerstner, and J. L. van Hemmen
cause it does not take into account the correlations on the level of spikes or fast changes of the rates. Here we present the correlation term in its general form. Just as in section 3, we require learning to be a slow process so that tw À pD t or tw À D corr (see section 2 for a denition of pD t and D corr ). The correlation term can then be evaluated for constant weights wi , 1 · i · N. For input Sin i with some given statistical properties that do not change on out ( ¡s) the slow timescale of learning, the correlations between Sin t i ( t ) and S depend on only the time difference s and the current weight vector w ( t ), out ( t ¡ s) i. Corri [s, w ( t) ] :D hSin i ( t) S
(4.4)
The horizontal bar refers, as before, to the temporal average introduced through the separation of timescales (cf. equation 2.1). Substituting equation 4.4 into equation 4.2, we obtain the dynamics of weight changes, d out out ( ) in tw wi ( t) D ain t C 1 li C a1 l dt
Z1 ds W ( s) Corri [s, w ( t) ].
(4.5)
¡1
Equation 4.5 is completely analogous to equation 3.1 except that the correlations between rates have been replaced by those between spikes. We may summarize equation 4.5 by saying that learning is driven by correlations on the timescale of the learning window. 4.4 Poisson Neuron. To proceed with the analysis of equation 4.5, we need to determine the correlations Corri between input spikes at synapse i and output spikes. The correlations depend strongly on the neuron model under consideration. As a generalization to the linear rate-based neuron model in section 3.1, we study an inhomogeneous Poisson model that gen( t) . For a membrane potential u ( t) · #, the model erates spikes at a rate lout sin ( t) is proportional to u ¡ #. We neuron is quiescent. For u > #, the rate lout sin call this model the piecewise-linear Poisson neuron or, for short, the Poisson neuron.
4.4.1 Denition. The input to the neuron consists of N Poissonian spike in trains with time-dependent intensities hSin i i( t ) D li ( t ), 1 · i · N (for the mathematics of an inhomogeneous Poisson process, we refer to appendix A of Kempter, Gerstner, van Hemmen, & Wagner, 1998). As in the rate model of section 3, the normalized correlation function for the rate uctuations at synapse i and j is dened by equation 2.5. The input scenarios are the same as in section 2—static-pattern scenario or translation-invariant scenario. f
A spike arriving at ti at synapse i evokes a postsynaptic potential (PSP) f with time course wi 2 ( t ¡ ti ) , which we assume to be excitatory (EPSP). We
Intrinsic Stabilization of Output Rates
2723
R1 impose the condition 0 ds 2 ( s) D 1 and require causality, that is, 2 ( s ) D 0 for s < 0. The amplitude of the EPSP is given by the synaptic efcacy wi ¸ 0. The membrane potential u of the neuron is the linear superposition of all contributions, u( t) D D
N X 1 X f wi ( t) 2 ( t ¡ ti ) N iD 1 f
ZC 1 N 1 X ds 2 ( s ) Sin wi ( t) i ( t ¡ s ). N iD 1
(4.6)
¡1
f
Since 2 is causal, only spike times ti < t count. The sums run over all spike P f f arrival times ti of all synapses. Sin i ( t) D f d ( t ¡ ti ) is the spike train at synapse i. In the Poisson neuron, output spikes are generated stochastically with a time-dependent rate lout that depends linearly on the membrane potential Sin whenever u is above the threshold # D ¡l 0 /c 0, 2
3 N X X c 0 f ( ) [l0 C c 0 u ( t)] C D 4l0 C lout wi ( t) 2 ( t ¡ ti ) 5 . Sin t D N iD 1 f
(4.7)
C
Here c 0 > 0 and l 0 are constants. As before, the notation with square brackets [.] C ensures that lout be nonnegative. The brackets will be dropped in Sin the following. The subscript Sin of lout indicates that the output rate deSin pends on the specic realization of the input spike trains. We note that the spike-generation process is independent of previous output spikes. In particular, the Poisson neuron model does not show refractoriness. We drop this restriction and include refractoriness in section 4.6. 4.4.2 Expectation Values. In order to evaluate equations 4.4 and 4.5, we rst have to calculate the expectation values h.i (i.e., perform averages over the spike statistics of both input and output given the rates) and then average over time. By denition of the rate of an inhomogeneous Poisson process, in the expected input is hSin i i D li . Taking advantage of equations 4.6 and 4.7, we nd that the expected output rate hSout i D hlout i D lout , which is the Sin rate contribution of all synapses to an output spike at time t, is given by hSout i( t) D lout ( t) D l 0 C
Z N c0 X wi ( t) ds 2 ( s) lin i ( t ¡ s) , N iD 1
where the nal term is a convolution of 2 with the input rate.
(4.8)
2724
R. Kempter, W. Gerstner, and J. L. van Hemmen
out ( ¡ Next we consider correlations between input and output, hSin t i ( t) S s) i—the expectation of nding an input spike at synapse i at time t and an output spike at t ¡ s. Since this expectation is dened as a joint probability density, it equals the probability density lin i ( t ) for an input spike at time t times the conditional probability density hSout ( t ¡ s) ifi,tg of observing an output spike at time t ¡ s given the input spike at synapse i at t; out ( ¡ ) i out ( ¡ ) i that is, hSin t s D lin t s fi,tg . Within the framework of i ( t) S i ( t ) hS the linear Poisson neuron, the term hSout ( t ¡ s) ifi,tg equals the sum of the expected output rate lout ( t ¡ s) in equation 4.8, and the specic contribution c 0 N ¡1 wi 2 ( ¡s) of a single input spike at synapse i (cf. equation 4.7). To summarize, we get (Kempter et al., 1999a), out ( t ¡ s ) i D lin hSin i ( t) S i ( t)
hc
0
N
i wi ( t) 2 ( ¡s) C lout ( t ¡ s ) .
(4.9)
Due to causality of 2 , the rst term on the right, inside the square brackets of equation 4.9, must vanish for s > 0. 4.4.3 Temporal Average. In order to evaluate the correlation term, equation 4.4, we need to take the temporal average of equation 4.9, that is, Corri [s, w ( t) ] D lin i
c0 out ( t ¡ s ). wi ( t) 2 ( ¡s) C lin i ( t) l N
(4.10)
For excitatory synapses, the rst term gives for s < 0 a positive contribution to the correlation function, as it should be (see Figure 2b). We recall that s < 0 means that a presynaptic spike precedes postsynaptic ring. The second term on the right in equation 4.10 describes the correlations between the input and output rates. These rates and correlations between them can, in principle, vary on an arbitrary fast timescale (for modeling papers, see Gerstner, Kempter, et al., 1996; Xie & Seung, 2000). We assume, however, for in reasons of transparency that the mean input rates lin i ( t ) D li are constant for all i (see also the input scenarios in section 2). The mean output rate lout ( t) and the weights wi ( t) may vary on the slow timescale of learning. 4.4.4 Learning Equation. 4.5, we nd
tw
Substituting equations 2.5, 4.8, and 4.10 into
d out out in out ( t ) ( t) C lin wi ( t) D ain 1 li C a1 l i ¢l dt 2 £ wj ( t) 4Qij lin i ¢
Z0 ljin C
dij lin i ¡1
Z1 ds W ( s ) C ¡1
3
ds W ( s) 2 ( ¡s)5 ,
N c0 X N jD 1
(4.11)
Intrinsic Stabilization of Output Rates
2725
where
Qij :D
Z1
Z1 ds W ( s)
¡1
ds0 2 ( s0 ) Cij ( s C s0 ) .
(4.12)
0
The important factor in equation 4.12 is the input correlation function Cij (as it appears in equation 2.5) (low-pass) ltered by learning window W and the postsynaptic potential 2 . For all input ensembles with Cij ( t ) D Cij ( ¡t ) (viz. temporal symmetry), we have Qij D Qji . Hence, the matrix ( Qij ) can be diagonalized and has real eigenvalues. In particular, this remark applies to the static-pattern scenario and the translation-invariant scenario introduced in section 2. It does not apply, for example, to sequence learning; cf. (Herz et al., 1988, 1989; van Hemmen et al., 1990; Gerstner et al., 1993; Abbott & Blum, 1996; Gerstner & Abbott, 1997). To get some preliminary insight into the nature of solutions to equain tion 4.11, let us suppose that all inputs have the same mean lin i D l . As we will see, for a broad parameter range, the mean output rate will be attracted toward a xed point so that the rst three terms on the right-hand side of equation 4.11 almost cancel each other. The dynamics of pattern formation in the weights wi is then dominated by the largest positive eigenvalue of the R matrix [Qij C dij ( lin ) ¡1 ds W ( s) 2 ( ¡s) ]. The eigenvectors of this matrix are identical to those of the R matrix ( Qij ) . The eigenvalues of ( Qij ) can be positive even though acorr D ds W ( s ) is negative. A simple example is given in the 2 appendix. To summarize this section, we have solved the dynamics of spike-timedependent plasticity dened by equation 4.2 for the Poisson neuron, a (piecewise) linear spiking neuron model. In particular, we have succeeded in evaluating the spike-spike correlations between presynaptic input and postsynaptic ring. 4.5 Stabilization in Spike-Based Learning. We are aiming at a stabilization of the output rate in analogy to equations 3.7 and 3.8. To arrive at a differential equation for lout , we multiply equation 4.11 by lin i c 0 / N and sum over i. Similarly to equation 3.5, we assume that
Q :D
N ± ²2 1 X lin Qij i N iD 1
(4.13)
is independent of the index j. To keep the arguments simple, we assume in in addition that all inputs have the same mean lin i D l . For the translationinvariant scenario, both assumptions can be veried. With these simplications, we arrive at a differential equation for the mean output rate lout ,
2726
R. Kempter, W. Gerstner, and J. L. van Hemmen
d out out with xed which is identical to equation 3.6, that is, t dt l D lout FP ¡ l point
lout FP D
µ ± ² ¶ 2 t in ( ) C c 0 ain l ¡ b l Q 0 1 tw
(4.14)
and time constant 2 3 ¡1 ± ²2 Z1 tw in in t D ¡ 4aout ds W ( s) C Q C b 5 , 1 l C l c0
(4.15)
¡1
where lin b :D N
Z0 ¡1
ds W ( s) 2 ( ¡s)
(4.16)
is the contribution that is due to the spike-spike correlations between preand postsynaptic neuron. A comparison with equations 3.7 and 3.8 shows R ( s ). The average correlation is to that we may set a0 D 0 and acorr ds D W 2 be replaced by CD
QCb . R ds W ( s )
( lin ) 2
(4.17)
All previous arguments regarding intrinsic normalization apply. The only difference is the new denition of the factor C in equation 4.17 as compared to R equation 3.5. This difference is, however, important. First, the sign of ds W ( s ) enters the denition of C in equation 4.17. Furthermore, an additional term b appears. It is due to the extra spike-spike correlations between presynaptic input and postsynaptic ring. Note that b is of order N ¡1 . Thus, it is small compared to the rst term in equation 4.17 whenever the number of nonzero synapses is large (Kempter et al., 1999a). If we neglect the second ) term (b D 0) and if we set 2 ( s) D d(s) and W ( s ) D acorr 2 d( s , equation 4.17 is identical to equation 3.5. On the other hand, b is of order lin in the input rates, whereas Q is of order ( lin ) 2 . Thus, b becomes important whenever the mean input rates are low. As shown in the appendix, b is, for a realistic set of parameters, of the same order of magnitude as Q. Let us now study the intrinsic normalization properties for two combinations of parameters. R C1 We consider the case ¡1 ds W ( s) > 0 and assume that correlations in the input on the timescale of the learning window are positive so that Q > 0. Furthermore, for excitatory synapses, it is Example 3 (positive integral).
Intrinsic Stabilization of Output Rates
2727
natural to assume b > 0. In order to guarantee the stability of the xed point, we must require t > 0, which yields the condition aout 1
<
¡lin
Z1
¡1
ds W ( s ) ¡
QCb lin
< 0.
(4.18)
Hence, if aout is sufciently negative, output rate stabilization is possible 1 even if the integral over the learning window has a positive sign. In other words, for a positive integral over the learning window, a non-Hebbian “linear ” term is necessary. Example 4 (no linear terms).
In standard Hebbian learning, all linear terms out vanish. Let us therefore set ain D 0. As in example 3, we assume 1 D a1 C b > 0. To guarantee the stability of the xed point, we must now Q require Z1 ¡1
ds W ( s) < ¡
QCb ( lin ) 2
< 0.
(4.19)
Thus, a stable xed point of the output rate is possible if the integral over the learning window is sufciently negative. The xed point is lout FP D l0
QCb Q C b C ( lin ) 2
R
ds W ( s)
.
(4.20)
Note that the denominator is negative because of equation 4.19. For a xed point to exist, we need lout FP > 0; hence, l0 < 0 (cf. Figure 1). We emphasize that in contrast to example 1, all weights at the xed point can be positive, as should be the case for excitatory synapses. To summarize the two examples, a stabilization of the output rate is possible for positive integral with a non-Hebbian term aout < 0 or for a 1 negative integral and vanishing non-Hebbian terms. 4.6 Spiking Model with Refractoriness. In this section, we generalize the results to a more realistic model of a neuron: the spike response model (SRM) (Gerstner & van Hemmen, 1992; Gerstner, van Hemmen, & Cowan, 1996; Gerstner, 2000). Two aspects change with respect to the Poisson neuron. First, we include arbitrary refractoriness into the description of the neuron in equation 4.6. In contrast to a Poisson process, events in disjoint intervals are not independent. Second, we replace the linear ring intensity in equation 4.7 by a nonlinear one (and linearize only later on).
2728
R. Kempter, W. Gerstner, and J. L. van Hemmen
Let us start with the neuronal membrane potential. As before, each input spike evokes an excitatory postsynaptic potential described by a response R1 kernel 2 ( s) with 0 ds 2 ( s) D 1. After each output spike, the neuron undergoes a phase of refractoriness, which is described by a further response kernel g. The total membrane potential is u ( t | tO) D g(t ¡ tO) C
N X 1 X f wi ( t) 2 ( t ¡ ti ) , N i D1 f
(4.21)
where tO is the last output spike of the postsynaptic neuron. As an example, let us consider the refractory kernel ³ ´ s g( s) D ¡g 0 exp ¡ H ( s) , tg
(4.22)
where ¡g( 0) D g 0 > 0 and tg > 0 are parameters and H( s) is the Heaviside function, that is, H ( s ) D 0 for s · 0 and H( s) D 1 for s > 0. With this denition of g, the SRM is related to the standard integrate-and-re model. A difference is that in the integrate-and-re model, the voltage is reset after each ring to a xed value, whereas in equations 4.21 and 4.22, the membrane potential is reset at time t D tO by an amount ¡g0 ¡ g( t ¡ t0 ) , which depends on the time t0 of the previous spike. In analogy to equation 4.7, the probability of spike ring depends on the momentary value of the membrane potential. More precisely, the stochastic intensity of spike ring is a (nonlinear) function of u ( t | tO) , lout ( t | tO) D f [u( t | tO) ],
(4.23)
with f ( u ) ! 0 for u ! ¡1. For example, we may take f ( u ) D l 0 exp[c 0 ( u¡ # ) / l0 ] with parameters l 0 , c 0 , # > 0. For c 0 ! 1, spike ring becomes a threshold process and occurs whenever u reaches the threshold # from below. For # D 0 and |c 0 u / l0 | ¿ 1, we are led back to the linear model of equation 4.7. A systematic study of escape functions f can be found in Plesser and Gerstner (2000). We now turn to the calculation of the correlation function Corri as dened in equation 4.4. More precisely, we are interested in nding the generalization of equation 4.10 (for Poisson neurons) to spiking neurons with refractoriness. Due to the dependence of the membrane potential in equaO the rst term on the right-hand side of tion 4.21 upon the last ring time t, equation 4.10 will become more complicated. Intuitively, we have to average over the ring times tO of the postsynaptic neuron. The correct averaging can be found from the theory of population dynamics studied in Gerstner (2000). To keep the formulas as simple as possible, we consider constant in input rates lin i ( t ) ´ l for all i. If many input spikes arrive on average
Intrinsic Stabilization of Output Rates
2729
within the postsynaptic integration time set by the kernel 2 , the membrane potential uctuations are small, and we can linearize the dynamics about the mean trajectory of the membrane potential: uQ ( t | tO) D g( t ¡ tO) C lin
N 1 X wi . N iD 1
(4.24)
To linear order in the membrane potential uctuations, the correlations are (Gerstner, 2000), Corri ( s, w ) D lin Pi ( ¡s) C lin lout ,
(4.25)
where Pi ( s ) D
lout
wi d N ds
Z1
dx L ( x) 2 ( s ¡ x )
0
Zs dtO f [u(s | tO) ] S ( s | tO) Pi ( tO) .
C
(4.26)
0
The survivor function S is dened as 8 9 > > < Zs = S ( s | tO) D exp ¡ dt0 f [u( t0 | tO) ] , > > : ;
(4.27)
tO
the mean output rate is 2 lout ( t) D 4
Z1
3 ¡1 dt S ( t | 0) 5
,
(4.28)
0
and the “lter” L
is
Z1
L ( x) D
dj x
d f [u(j | x) ] S (j | 0) . du
(4.29)
In equations 4.26 through 4.29, the membrane potential u is given by the mean trajectory uQ in equation 4.24. Equations 4.25 and 4.26 are our main result. All considerations regarding normalization as discussed in the preceding subsection are valid with the function Corri dened in equations 4.25 and 4.26 and an output rate lout
2730
R. Kempter, W. Gerstner, and J. L. van Hemmen
dened in equation 4.28. In passing, we note that equation 4.28 denes the P gain function of the corresponding rate model, lout D g[lin N ¡1 i wi ]. Let us now discuss equation 4.26 in more detail. The rst term on its right-hand side describes the immediate inuence of the EPSP 2 on the ring probability of the postsynaptic neuron. The second term describes the reverberation of the primary effect that occurs one interspike interval later. If the interspike interval distribution is broad, the second term on the right is small and may be neglected (Gerstner, 2000). In order to understand the relation of equations 4.25 and 4.26 to equation 4.10, let us study two examples. Example 5 (no refractoriness).
We want to show that in the limit of no refractoriness, the spiking neuron model becomes identical with a Poisson model. If we neglect refractoriness (g ´ 0), the mean membrane potential P in equation 4.24 is uQ D lin N ¡1 i wi . The output rate is lout D f ( uQ ) (cf. equation 4.23), and we have S ( s | 0) D exp[¡lout s] (see equation 4.27). In equation 4.29, we setc 0 D d f / du evaluated at uQ and nd Pi ( s) D 2 ( s) wi c 0 / N. Hence, equation 4.25 reduces to equation 4.10, as it should be. Example 6 (high- and low-noise limit).
If we take an exponential escape rate f ( u ) D l0 exp[c 0 ( u ¡ # ) / l 0 ], the high-noise limit is found for |c 0 ( u ¡ # ) / l0 | ¿ 1 and the low-noise limit for |c 0 ( u ¡ #) / l0 | À 1. In the high-noise limit, it can be shown that the lter L is relatively broad. More precisely, it starts with an initial value L ( 0) > 0 and decays slowly, L ( x) ! 0 for x ! 1 d ( ) (Gerstner, 2000). Since decay is small, we set dx L x ¼ 0 for x > 0. This R s yields Pi ( s) D wi N ¡1 lout L ( 0) 2 ( s ) C 0 dtO f [u(s | tO) ] S ( s | tO) Pi ( tO). Thus, the correlation function Corri contains a term proportional to 2 ( ¡s) as in the linear model. In the low-noise limit (c 0 ! 1), we retrieve the threshold process, and the lter L approaches a d-function— L ( x ) D d d( x) with some constant d > 0 (Gerstner, 2000). In this case, the correlation function Corri contains a term proportional to the derivative of the EPSP: d2 ( s) / ds. To summarize this section, we have shown that it is possible to calculate the correlation function Corri ( s, w ) for a spiking neuron model with refractoriness. Once we have obtained the correlation function, we can use it in the learning dynamics, equation 4.5. We have seen that the correlation function depends on the noise level. This theoretical result is in agreement with experimental measurements on motoneurons (Fetz & Gustafsson, 1983; Poliakov, Powers, Sawczuk, & Binder, 1996). It was found that for high noise levels, the correlation function contained a peak with a time course roughly proportional to the postsynaptic potential. For low noise, however, the time course of the peak was similar to the derivative of the postsynaptic potential. Thus, the (piecewise-linear) Poisson neuron introduced in section 4.4 is a valid model in the high-noise limit.
Intrinsic Stabilization of Output Rates
2731
5 Discussion
In this article we have compared learning rules at the level of spikes with those at the level of rates. The learning rules can be applied to both linear and nonlinear neuron models. We discuss our results in the context of the existing experimental and theoretical literature. 5.1 Experimental Results. Rate normalization has been found in experiments performed by Turrigiano, Leslie, Desai, Rutherford, and Nelson (1998). They blocked GABA-mediated inhibition in a cortical culture, which initially raised activity. After about two days, the ring rate returned close to the control value, and at the same time, all synaptic strengths decreased. Conversely, a pharmacologically induced ring-rate reduction leads to synaptic strengthening. Again, the output rate was normalized. Turrigiano et al. (1998) suggest that rate normalization is achieved by multiplying all weights by the same factor. In our ansatz, however, weights are normalized subtractively, that is, by addition or subtraction of a xed amount from all weights. The discrepancy between the experiment and our model can be resolved by assuming—beyond upper and lower bounds for individual weights—that the learning coefcients themselves depend on the weights too. The extended ansatz exceeds, however, the scope of this article. For additional mechanisms contributing to the activity-dependent stabilization of ring rates we refer to Desai, Rutherford, and Turrigiano (1999), Turrigiano and Nelson (2000), and Turrigiano (1999). In our model, several scenarios for intrinsic rate normalization are possible, depending on the choice of parameters for the Hebbian and nonHebbian terms. Let us discuss each of them in turn.
5.1.1 Sign of Correlation Term. Rate normalization in our model is most easily achieved if the integral of the learning window W is negative, even though this is not necessary (see examples 1 and 4). On the basis of neurophysiological data known at present (Markram et al., 1997; Zhang et al., 1998; Debanne et al., 1998; Bi & Poo, 1998; Feldman, 2000), the sign of the integral over the learning window cannot be decided (see also the reviews by Linden, 1999; Bi & Poo, 2001). We have shown that a negative integral corresponds, in terms of rate coding, to a negative coefcient acorr and hence 2 to an anti-Hebbian rule. Nevertheless, the learning rule can pick up positive correlations if the input contains temporal correlations Cij on the timescale of the learning window. An example is given in the appendix. 5.1.2 Role of Non-Hebbian Terms. We saw in examples 1 and 4 that the linear terms in the learning rule are not necessary for rate normalization. Nevertheless, we saw from example 2 or 3 that a negative coefcient aout 1 helps. A value aout < 0 means that in the absence of presynaptic activation, 1 postsynaptic spiking alone induces heterosynaptic long-term depression.
2732
R. Kempter, W. Gerstner, and J. L. van Hemmen
This has been seen, for example, in hippocampal slices (Pockett, Brookes, & Bindman, 1990; Christo, Nowicky, Bolsover, & Bindman, 1993; see also the reviews of Artola & Singer, 1993; Brown & Chattarji, 1994; Linden, 1999). For cortical slices, both aout > 0 and aout < 0 have been reported (Volgushev, 1 1 Voronin, Chistiakova, & Singer, 1994). On the other hand, a positive effect of presynaptic spikes on the synapses in the absence of postsynaptic spiking (ain 1 > 0) helps to keep the xed point in the range of positive rates lout > 0 (see equation 3.7). Some experiFP ments indeed suggest that presynaptic activity alone results in homosynaptic long-term potentiation (ain 1 > 0) (Bliss & Collingridge, 1993; Urban & Barrionuevo, 1996; Bell et al., 1997; see also Brown & Chattarji, 1994). 5.1.3 Zero-Order Term. As we have seen from equation 3.7, spontaneous weight growth a0 > 0 helps keep the xed point of the rate in the positive range. Turrigiano et al. (1998) chronically blocked cortical culture activity and found an increase of synaptic weights, supporting a0 > 0. 5.2 Spike-Time-Dependent Learning Window. In Figure 2a we indicated a learning window with two parts: a positive part (potentiation) and a negative one (depression). Such a learning window is in accordance with experimental results (Markram et al., 1997; Zhang et al., 1998; Debanne et al., 1998; Bi & Poo, 1998; Feldman, 2000). We may ask, however, whether there are theoretical reasons for this time dependence of the learning window? For excitatory synapses, a presynaptic input spike that precedes postsynaptic ring may be the cause of the postsynaptic activity or, at least, “takes part in ring it” (Hebb, 1949). Thus, a literal understanding of the Hebb rule suggests that for excitatory synapses, the learning window W ( s) is positive for s < 0, if s is the difference between the times of occurrence of an input and an output spike. (Recall that s < 0 means that a presynaptic spike precedes postsynaptic spiking; see also Figure 2b.) In fact, a positive learning phase (potentiation) for s < 0 has been found to be an important ingredient in models of sequence learning (Herz et al., 1988, 1989; van Hemmen et al., 1990; Gerstner et al., 1993; Abbott & Blum, 1996; Gerstner & Abbott, 1997). If the positive phase is followed by a negative phase (depression) for s > 0 as in Figure 2a, the learning rule acts as a temporal contrast lter enhancing the detection of temporal structure in the input (Gerstner, Kempter, et al., 1996; Kempter et al., 1996, 1999a; Song et al., 2000; Xie & Seung, 2000). A two-phase learning rule with both potentiation and depression was, to a certain degree, a theoretical prediction. The advantages of such a learning rule have been realized in models (Gerstner, Kempter, et al., 1996; Kempter et al., 1996) even before experimental results on the millisecond timescale have become available (Markram et al., 1997; Zhang et al., 1998; Debanne et al., 1998; Bi & Poo, 1998; Feldman, 2000).
Intrinsic Stabilization of Output Rates
2733
5.3 Structural Stability of Output Rates. We have seen that synaptic plasticity with an appropriate combination of parameters pushes the output rate lout toward a xed point. To explicitly show the existence and stability of the xed point, we had to require that the factor C dened by equation 3.5 is independent of the synapse index j. The standard example that we have used throughout the article is an input scenario that has the same mean rate in at each synapse (lin i D l for all i) and is translation invariant C ij D C |i¡j| . In this case, rate normalization is equivalent to a stable xed point of the P mean weight N ¡1 i wi (see section 3.2). We may wonder what happens if equation 3.5 is not exactly true but holds only approximately. To answer this question, let us write the linear model of equation 4.11 in vector notation dw / dt D k1 e 1 C Mw with some matrix M and the unit vector e 1 D N ¡1/ 2 ( 1, 1, . . . , 1) T . A stable xed point of the mean weight implies that w D e 1 is an eigenvector of M with negative eigenvalue. If such a negative eigenvalue lM < 0 exists under the condition 3.5, then, due to (local) continuity of solutions of the differential equation 4.11 in dependence on parameters, the eigenvalue will stay negative in some regime where equation 3.5 holds only approximately. The eigenvector corresponding to this negative eigenvalue will be close to but not identical with e 1 . In other words, the output rate remains stable, but the weight vector w is no longer exactly normalized. The normalization properties of our learning rule are thus akin to, but not identical with, subtractive normalization (Miller & MacKay, 1994; Miller, 1996b). From our point of view, the basic property of such a rule is a stabilization ofP the output rate. The (approximate) normalization of the mean weight N ¡1 i wi D c with some constant c is a consequence. The normalization is exact if C in equation 3.5 is independent of the synapse index j, and mean input rates ljin are the same at all synapses. Finally, the learning rule 1.1 with constant coefcients is typically unstable because weights receiving strongest enhancement grow without bounds at the expense of synapses receiving less or no enhancement. Unlimited growth can be avoided by explicitly introducing upper and lower bounds for individual weights. As a consequence, most of the weights saturate at these bounds. Saturated weights no longer participate in the learning dynamics, and the stabilization arguments of sections 3 and 4 can be applied to the set of remaining weights. For certain models, it can be shown explicitly that the xed point of the output rate remains unchanged compared to the case without bounds (Kempter et al., 1999a). We simply have to check that the xed point lies within the parameter range allowed by the bounds on the weights. 5.4 Principal Components and Covariance. Whereasthe output rate converges P to a xed point so that the mean weight adapts to a constant value N ¡1 i wi D c, some synapses may grow and others decay. It is this synapse-
2734
R. Kempter, W. Gerstner, and J. L. van Hemmen
specic change that leads to learning in its proper sense (in contrast to mere adaptation). Spike-based learning is dominated by the eigenvector of ( Qij ) with the largest eigenvalue (Kempter et al., 1999a, 1999b). Thus, learning “detects” the principal component of the matrix ( Qij ) dened in equation 4.12. A few remarks are in order. First, we emphasize that for spike-timedependent learning, a clear-cut distinction between Hebbian and anti-Hebbian rules is difcult. Since ( Qij ) is sensitive to the covariance of the input on the timescale of the learning window, the same rule can be considered as anti-Hebbian for one type of stimulus and as Hebbian for another one (see the appendix). Second, in contrast to Oja’s rule (Oja, 1982), it is not necessary to require that the mean input lin i vanishes at each synapse. Intrinsic rate normalization implies that the mean input level does not play a role. After an initial adaptation phase, neuronal plasticity has automatically “subtracted” the mean input and becomes sensitive to the covariance of the input (see also the discussion around equations 3.2 and 3.9). Thus, after convergence to the xed point, the learning rate becomes equivalent to Sejnowski’s covariance rule (Sejnowski, 1977; Sejnowski & Tesauro, 1989; Stanton & Sejnowski, 1989). 5.5 Subthreshold Regime and Coincidence Detection. We have previously exploited the normalization properties of learning rules with negative integral of the learning window in a study of the barn owl auditory system (Gerstner, Kempter, et al., 1996, 1997). Learning led to a structured delay distribution with submillisecond time resolution, albeit the width of the learning window was in the millisecond range. Intrinsic normalization of the output rate was used to keep the postsynaptic neuron in the subthreshold regime where the neuron functions as a coincidence detector (Kempter et al., 1998). Song et al. (2000) used the same mechanism to keep the neuron in the subthreshold regime where the neuron may show large output uctuations. In this regime, rate stabilization induces stabilization of the value of the coefcient of variation (CV) over a broad input regime. Due to stabilization of output rates, we speculate that a coherent picture of neural signal transmission may emerge, where each neuron in a chain of several processing layers is equally active once averaged over the complete stimulus ensemble. In that case, a set of normalized input rates lin i in one “layer” is transformed, on average, into a set of equal rates in the next layer. Ideally the target value of the rates to which the plasticity rule converges would be in the regime where the neuron is most sensitive: the subthreshold regime. This is the regime where signal transmission is maximal (Kempter et al., 1998). The plasticity rule could thus be an important ingredient to optimize the neuronal transmission properties (Brenner, Agam, Bialek, & de Ruyter van Steveninck, 1998; Kempter et al., 1998; Stemmler & Koch, 1999; Song et al., 2000).
Intrinsic Stabilization of Output Rates
2735
Appendix
We illustrate the static-pattern scenario for the case of a learning window with negative integral. At each time step of length D t , we present a pattern vector xm chosen stochastically from some database fxm I 1 · m · pg. The input rates are m
lin i ( t ) D xi
for m D t < t < (m C 1) D t .
(A.1)
The “static” correlation between patterns is dened to be
Cstatic ij
m m in p in 1 X ( xi ¡ li ) ( xj ¡ lj ) D , in p m D1 lin i ¢ lj
(A.2)
Pp m ¡1 where lin i :D p m D1 xi now denotes the average over all input patterns. We assume that there is no intrinsic order in the sequence of presentation of patterns. The “time-dependent” correlation in the input is then (see Figure 3)
Cij ( s) D
» static (1 ¡ |s| / D t ) Cij 0
for ¡ D t · s · D t else.
Letus consider a postsynaptic potential 2 ( s) D d(s ¡ D t ) , that is, the form of the potential is approximated by a delayed localized pulse, a (Dirac) delta function. As a learning window we take (see Figure 3) » W ( s) D
1/D t ¡1 / D t
a
¡2 D t < s · 0 0 < s · 3 Dt
b static
Cij
t
C ij (s)
1/D W(s)
for for
0 -1/D
t
-2 D
t
0 s
2D
0 t
-2D
t
0 s
2D
t
Figure 3: (a) Example of a learning window W and (b) a correlation function Cij . Both are plotted as a function of the time difference s.
2736
R. Kempter, W. Gerstner, and J. L. van Hemmen
R with ds W ( s) D ¡1 D : acorr 2 . Hence, the learning rule could be classied as “anti-Hebbian.” On the other hand, Z Qij D
Z ds W ( s)
ds0 2 ( s0 ) Cij ( s C s0 ) D Cstatic ij
(A.3)
gives the static correlations. Since Cij D Cji , also Qij D Qji holds, and the R P matrix ( Qij ) is Hermitian. Furthermore, ij Cstatic > 0 and ds W ( s) ( 1 ¡ ij P 6 |s C D t | / D t ) > 0, so that ij Qij c¤i cj > 0 for all vectors c D ( ci ) D 0. Hence, ( Qij ) is a positive-denite matrix whose eigenvalues are positive ( > 0). The ) with the dynamics of learning is dominated by the eigenvector of ( Cstatic ij largest eigenvalue just as in standard Hebbian learning (Hertz et al., 1991). We note that in contrast to standard Hebbian learning, we need not impose lin i D 0; the spike-time-dependent learning rule automatically subtracts the mean. P Let us now assume that both ljin and Q :D ( lin ) 2 N ¡1 i Qij are independent of j (see also section 4.5). We want to compare the value of Q with that R0 of b D N ¡1 lin ¡1 ds W ( s) 2 ( ¡s) in equation 4.17. We nd " Q D N lin D t b
# N 1 X static . C N iD 1 ij
(A.4)
Let us substitute some numbers so as to estimate the order of magnitude of this fraction. The learning window has a duration (width) of, say, the order of D t D 10 ms. The mean rate, averaged over the whole stimulus ensemble, is about lin D 10 Hz. A typical number of synapses is N D 1000. Let us assume that each input channel is correlated with 10% of the others with a value of Cstatic D 0.1 and uncorrelated with the remaining 90%. Thus, ij P 0.01. This yields Q / b D 1. Hence, b has the same order of N ¡1 i Cstatic D ij magnitude as Q. For a lower value of the mean rate, Q / b will decrease; for a larger number of synapses, it will increase. Acknowledgments
We thank Werner Kistler for a critical reading of a rst version of the manuscript for this article. R. K. has been supported by the Deutsche Forschungsgemeinschaft (FG Horobjekte ¨ and Ke 788/1-1). References Abbott, L. F., & Blum, K. I. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cereb. Cortex, 6, 406–416.
Intrinsic Stabilization of Output Rates
2737
Abbott, L. F., & Munro, P. (1999). Neural information processing systems (NIPS). Workshop on spike timing and synaptic plasticity, Breckenridge, CO. Available on-line at: http://www.pitt.edu/»pwm/LTP LTD 99/. Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. Trends Neurosci., 16, 480–487. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278– 281. Bi, G.-q., & Poo, M.-m. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G.-q., & Poo, M.-m. (1999). Distributed synaptic modication in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bi, G.-q., & Poo, M.-m. (2001). Synaptic modication by correlated activity: Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–39. Brenner, N., Agam, O., Bialek, W., & de Ruyter van Steveninck, R. R. (1998). Universal statistical behavior of neuronal spike trains. Phys. Rev. Lett., 81(18), 4000–4003. Brown, T. H., & Chattarji, S. (1994). Hebbian synaptic plasticity: Evolution of the contemporary concept. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks II (pp. 287–314). New York: Springer-Verlag. Christo, G., Nowicky, A. V., Bolsover, S. R., & Bindman, L. J. (1993). The postsynaptic induction of nonassociative long-term depression of excitatory synaptic transmission in rat hippocampal slices. J. Neurophysiol., 69, 219–229. Debanne, D., G¨ahwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999).Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nat. Neurosci., 2, 515–520. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efcacy in spiny stellate neurons in rat barrel cortex. Nat. Neurosci., 2, 1098–1105. Eurich, C. W., Pawelzik, K., Ernst, U., Cowan, J. D., & Milton, J. G. (1999). Dynamics of self-organized delay adaption. Phys. Rev. Lett., 82, 1594–1597. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fetz, E. E., & Gustafsson, B. (1983). Relation between shapes of postsynaptic potentials and changes in the ring probability of cat motoneurons. J. Physiol., 341, 387–410. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states and locking. Neural Computation, 12, 43–89.
2738
R. Kempter, W. Gerstner, and J. L. van Hemmen
Gerstner, W., & Abbott, L. F. (1997). Learning navigational maps through potentiation and modulation of hippocampal place cells. J. Comput. Neurosci., 4, 79–94. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking. Neural Comput., 8, 1689–1712. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1997). A developmental learning rule for coincidence tuning in the barn owl auditory system. In J. Bower (Ed.), Computational neuroscience: Trends in research 1997 (pp. 665– 669). New York: Plenum Press. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1998). Hebbian learning of pulse timing in the barn owl auditory system. In W. Maass, & C. M. Bishop (Eds.), Pulsed neural networks (pp. 353–377). Cambridge, MA: MIT Press. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993).Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biol. Cybern., 69, 503–515. H¨aiger, P., Mahowald, M., & Watts, L. (1997). A spike based learning neuron in analog VLSI. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 692–698). Cambridge, MA: MIT Press. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Herz, A. V. M., Sulzer, B., Kuhn, ¨ R., & van Hemmen, J. L. (1988). The Hebb rule: Representation of static and dynamic objects in neural nets. Europhys. Lett., 7, 663–669. Herz, A. V. M., Sulzer, B., Kuhn, ¨ R., & van Hemmen, J. L. (1989).Hebbian learning reconsidered: Representation of static and dynamic objects in associative neural nets. Biol. Cybern., 60, 457–467. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999a). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999b). Spike-based compared to rate-based Hebbian learning. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural informationprocessing systems,11 (pp. 125–131). Cambridge, MA: MIT Press. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1996). Temporal coding in the sub-millisecond range: Model of barn owl auditory pathway. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 124–130). Cambridge, MA: MIT Press. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998). Extracting oscillations: Neuronal coincidence detection with noisy periodic spike input. Neural Comput., 10, 1987–2017. Kempter, R., Leibold, C., Wagner, H., & van Hemmen, J. L. (2001). Formation of temporal-feature maps by axonal propagation of synaptic learning. Proc. Natl. Acad. Sci. U.S.A., 98, 4166–4171.
Intrinsic Stabilization of Output Rates
2739
Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials Neural Comput., 12, 385–405. Kohonen, T. (1984). Self-organization and associative memory. Berlin: SpringerVerlag. Levy, W. B., & Stewart, D. (1983). Temporal contiguity requirements for longterm associative potentiation / depression in the hippocampus. Neurosci., 8, 791–797. Linden, D. J. (1999). The return of the spike: Postsynaptic action potentials and the induction of LTP and LTD. Neuron, 22, 661–666. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. USA, 83, 7508– 7512. MacKay, D. J. C., & Miller, K. D. (1990). Analysis of Linsker ’s application of Hebbian rules to linear networks. Network, 1, 257–297. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Miller, K. D. (1996a). Receptive elds and maps in the visual cortex: Models of ocular dominance and orientation columns. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks III (pp. 55–78). New York: Springer-Verlag. Miller, K. D. (1996b). Synaptic economics: Competition and cooperation in correlation-based synaptic plasticity. Neuron, 17, 371–374. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comput., 6, 100–126. Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biol., 15, 267–273. Paulsen, O., & Sejnowski, T. J. (2000). Natural patterns of activity and long-term synaptic plasticity. Curr. Opin. Neurobiol., 10, 172–179. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-re models: From stochastic input to escape rates. Neural Computation, 12, 367–384. Pockett, S., Brookes, N. H., & Bindman, L. J. (1990). Long-term depression at synapses in slices of rat hippocampus can be induced by bursts of postsynaptic activity. Exp. Brain Res., 80, 196–200. Poliakov, A. V., Powers, R. K., Sawczuk, A., & Binder, M. D. (1996). Effects of background noise on the response of rat and cat motoneurons to excitatory current transients. J. Physiol., 495, 143–157. Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Compu. Neurosci., 7, 235– 246. Ruf, B., & Schmitt, M. (1997). Unsupervised learning in networks of spiking neurons using temporal coding. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Proc. 7th Int. Conf. Articial Neural Networks (ICANN’97) (pp. 361–366). Heidelberg: Springer-Verlag. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons J. Mathematical Biology, 4, 303–321.
2740
R. Kempter, W. Gerstner, and J. L. van Hemmen
Sejnowski, T. J., & Tesauro, G. (1989). The Hebb rule for synaptic plasticity: Algorithms and implementations. In J. H. Byrne, & W. O. Berry (Eds.), Neural models of plasticity: Experimental and theoretical approaches (pp. 94–103). San Diego: Academic Press. Senn, W., Tsodyks, M., & Markram, H. (1997). An algorithm for synaptic modication based on exact timing of pre- and postsynaptic action potentials. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Proc. 7 th Int. Conf. Articial Neural Networks (ICANN’97) (pp. 121–126). Heidelberg: SpringerVerlag. Senn, W., Markram, H., & Tsodyks, M. (2001).An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13, 35–67. Shouval, H. Z., & Perrone, M. P. (1995). Post-Hebbian learning rules. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 745–748). Cambridge, MA: MIT Press. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919– 926. Stanton, P. K., & Sejnowski, T. J. (1989). Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature, 339, 215– 218. Stemmler, M., & Koch, C. (1999). How voltage-dependent conductances can adapt to maximize the information encoded by neurons. Nat. Neurosci., 2, 521–527. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends Neurosci., 22, 221– 227. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. Turrigiano, G. G., & Nelson, S. B. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10, 358–364. Urban, N. N., & Barrionuevo, G. (1996). Induction of Hebbian and non-Hebbian mossy ber long-term potentiation by distinct patterns of high-frequency stimulation. J. Neurosci., 16, 4293–4299. van Hemmen, J. L. (2000). Theory of synaptic plasticity. In F. Moss & S. Gielen (Eds.), Handbook of biological physics, Vol. 4: Neuro-informatics, neural modelling (pp. 749–801). Amsterdam: Elsevier. van Hemmen, J. L., Gerstner, W., Herz, A., K uhn, ¨ R., Sulzer, B., & Vaas, M. (1990). Encoding and decoding of patterns which are correlated in space and time. In G. Dorffner (Ed.), Konnektionismus in articial intelligence und kognitionsforschung (pp. 153–162). Berlin: Springer-Verlag. Volgushev, M., Voronin, L. L., Chistiakova, M., & Singer, W. (1994). Induction of LTP and LTD in visual cortex neurons by intracellular tetanization. NeuroReport, 5, 2069–2072. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100.
Intrinsic Stabilization of Output Rates
2741
Wimbauer, S., Gerstner, W., & van Hemmen, J. L. (1994). Emergence of spatiotemporal receptive elds and its application to motion detection. Biol. Cybern., 72, 81–92. Wiskott, L., & Sejnowski, T. (1998).Constrained optimization for neural map formation: A unifying framework for weight growth and normalization. Neural Comput., 10, 671–716. Xie, X., & Seung, S. H. (2000). Spike-based learning rules and stabilization of persistent neural activity. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural informationprocessingsystems,12 (pp. 199–205). Cambridge, MA: MIT Press. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M.-m. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received January 4, 2000; accepted March 1, 2001.
LETTER
Communicated by John Lisman
Information Transfer Between Rhythmically Coupled Networks: Reading the Hippocampal Phase Code Ole Jensen [email protected]. Brain Research Unit, Low Temperature Laboratory, Helsinki University of Technology, FIN-02015 Espoo, Finland There are numerous reports on rhythmic coupling between separate brain networks. It has been proposed that this rhythmic coupling indicates exchange of information. So far, few computational models have been proposed that explore this principle and its potential computational benets. Recent results on hippocampal place cells of the rat provide new insight; it has been shown that information about space is encoded by the ring of place cells with respect to the phase of the ongoing theta rhythm. This principle is termed phase coding and suggests that upcoming locations (predicted by the hippocampus) are encoded by cells ring late in the theta cycle, whereas current location is encoded by early ring in the theta cycle. A network reading the hippocampal output must inevitably also receive an oscillatory theta input in order to decipher the phase-coded ring patterns. In this article, I propose a simple physiologically plausible mechanism implemented as an oscillatory network that can decode the hippocampal output. By changing only the phase of the theta input to the decoder, qualitatively different information is transferred: the theta phase determines whether representations of current or upcoming locations are read by the decoder. The proposed mechanism provides a computational principle for information transfer between oscillatory networks and might generalize to brain networks beyond the hippocampal region. 1 Introduction
Occasional frequency locking is often observed between two or more brain regions (for an extensive review see Varela, Lachaux, Rodriguez, & Martinerie, 2001). This coherent oscillatory activity could indicate exchange and manipulation of information between networks. For instance, whisker twitching in the rat synchronizes the neural activity in the brain stem, thalamus, and primary somatosensory cortex at a 7–12 Hz rhythm (Nicolelis, Baccala, Lin, & Chapin, 1995). The olfactory system provides numerous examples of rhythmic coupling. In the cat, synchronized oscillations in the 35–50 Hz band were measured between the entorhinal cortex, posterior periform cortex, and olfactory bulb during odor sampling (Boeijinga & Lopes c 2001 Massachusetts Institute of Technology Neural Computation 13, 2743– 2761 (2001) °
2744
Ole Jensen
da Silva, 1989). In the salamander, the presentation of odors elicits 10–20 Hz synchronized oscillations in the olfactory epithelium and olfactory bulb (Dorries & Kauer, 2000). When rats are performing demanding olfactory memory tasks (reversal learning), the hippocampal theta rhythm occasionally couples to the snifng rhythm (Macrides, Eichenbaum, & Forbes, 1982). In humans, long-range oscillatory synchrony has been reported. An electroencephalogram (EEG) study demonstrated coherence in the 4–7 Hz band between frontal and posterior regions in a working memory task (Sarnthein, Petsche, Rappelsberger, Shaw, & von Stein, 1998). Beyond the work of Ahissar, Haidarliu, and Zacksenhouse (1997) and Ahissar, Sosnik, and Haidarliu (2000) on somatosensory processing, little theoretical work has been done to explore the potential computational advantages of information exchange between oscillatory brain networks. The rat hippocampus and regions to which it projects provide an excellent system for exploring theoretical ideas on oscillatory coupled networks. Of particular interest are the ring properties of hippocampal place cells (O’Keefe & Dostrovsky, 1971; Olton, Branch, & Best, 1978), which are strongly modulated by the theta rhythm: as a rat enters a place eld, place cells re late in the theta cycle. As the rat passes through the place eld, the phase of ring advances systematically (O’Keefe & Recce, 1993; Skaggs, McNaughton, Wilson, & Barnes, 1996; Shen, Barnes, McNaughton, Skaggs, & Weaver, 1997). This effect is termed the theta phase precession or phase advance. Recently it was demonstrated that when reconstructing a rat’s location from an ensemble of place cells, the error of reconstruction is signicantly reduced when the theta phase of ring is taken into account (Jensen & Lisman, 2000). This result supports the case that information is encoded by the theta phase of ring: phase coding. A network receiving the hippocampal ring pattern, for example, the entorhinal and cingulate cortex, can by phase decoding extract information beyond what is encoded in the ring rates. The decoding network must necessarily also receive information about the theta rhythm. Hence, if one accepts the principle of phase coding, there might be an advantage to oscillatory coupling between networks. To explore this idea, I will suggest a physiologically plausible oscillatory network that can read the hippocampal phase code. Section 2 is devoted to an improved implementation of the Jensen and Lisman oscillatory network model (Jensen & Lisman, 1996a), which, by sequence readout, can account for the theta phase precession and thereby the generation of phase-coded information (the phase encoder). The output produced by this network is then used to test the network receiving the hippocampal phase code (the phase decoder). By numerical simulations, I will demonstrate how qualitatively different information is transferred from one network to the other, depending on only the phase lag between the networks. I then discuss the brain regions that might perform the phase decoding and different principles according to which the phase decoding might work. The model framework results in a set of testable predictions.
Information Transfer Between Rhythmically Coupled Networks
2745
2 Methods
The model presented in this article consists of two networks—one representing the hippocampal phase encoder (CA3) and the other the phase decoder. I describe each network individually. A diagram of the network model is shown in Figure 1. 2.1 The Hippocampal Phase Encoder. This network is an improved version of a previously proposed network model (Jensen & Lisman, 1996a),
Figure 1: A schematic illustration of the oscillatory networks responsible for performing the phase encoding and decoding. The pyramidal neurons of the CA3 region (left) receive informational input from the perforant path or the mossy bers. The synchronized activity of the inhibitory network is represented by a single interneuron providing local GABAergic feedback. The theta drive is imposed externally on all pyramidal neurons. Sequence representations are encoded by the synapses of the recurrent CA3 collaterals. The pyramidal cells of the phase decoder (right) receive excitatory input from a subset of the pyramidal neurons and the theta input.
2746
Ole Jensen
which can account for the phase precession by repeated sequence readout. In the original network, the model neurons were implemented without a membrane capacitance. This has been improved by implementing each neuron with a membrane capacitance so they function as leaky integrateand-re units (Koch & Segev, 1998). The advantage of integrate-and-re units is that they are computationally inexpensive since the sodium, potassium, and calcium currents producing a spike are not modeled explicitly. The simplicity of the integrate-and-re units is mainly justied by the inhibitory feedback from the interneurons. This feedback hyperpolarizes the cells in the network well below ring threshold after a subset of pyramidal neurons has red. Thus, the detailed membrane dynamics following an action potential is of less importance. The representation of the interneuronal network responsible for generating the GABAergic feedback has also been improved. The interneuronal network is represented by an inhibitory integrate-and-re unit rather than just as a function of the spike times of the pyramidal neurons. In the original model, representations of locations were modeled by only a single cell per location (Jensen & Lisman, 1996a). To simulate a location which is represented by the simultaneous ring of a group of neurons, ve cells ring in synchrony will represent a location. The full hippocampal network has 45 pyramidal neurons and one inhibitory interneuron. The membrane potential of each pyramidal neuron is modeled by the equation:
¡C
dV D IL C IAHP C IH,hippo C Isyn,GABAA C Isyn,AMPA dt C Isyn,RC C Inoise .
(2.1)
The membrane potential is reset to Vrest D ¡65 mV when it reaches the threshold Vthreshold D ¡55 mV. This event represents a spike or a short burst. The membrane capacitance is C D 0.5 m F / cm2 . The leakage current is dened as IL D gL ( V ¡ E L ) with the reversal potential EL D ¡65 mV and conductance gL D 0.03 mS/ cm2 . The conductance of the slow afterhyperpolarization (AHP) current is modeled by a decreasing exponential function: t¡tspike ) , where tspike is the time when the cell spikes gAHP ( t) D g0AHP exp( ¡ tA HP 0 2 and gAHP D 0.06 mS/ cm and tAHP D 40 ms. The AHP current is dened as IAHP D gAHP ( t) ( V ¡ EAHP ) , where EAHP D ¡70 mV. The slow AHP current makes it less likely for a pyramidal cell to re multiple times within a theta cycle. The septal input to the hippocampus that provides the drive at theta frequency is modeled as IH,hippo D I0H cos (2p fH t) , where the theta frequency is fH D 7 Hz and I0H D 0.18 m A / cm2 .
Information Transfer Between Rhythmically Coupled Networks
2747
The informational input to the CA3 (from the perforant path or the mossy bers) excites the pyramidal neurons synaptically by AMPA-receptormediated currents. The AMPA receptor conductances are modeled by at¡tinput t¡tinput ) , and the excitatory postfunctions, gAMPA ( t) D g0AMPA tA MPA exp( ¡ tAMPA ( ) ( synaptic currents (EPSCs) are IAMPA D gAMPA t V ¡ EAMPA ) . The constants are tAMPA D 3 ms, EAMPA D 0 mV, and g0AMPA D 0.13 mS / cm2 . The conductance of the GABAergic input is implemented by an at¡t t¡t function: gGABA ( t) D g0GABA t spike, inh exp( ¡ t spike, inh ) , where tspike,inh denotes GABA GABA the time of spiking of the neuron representing the interneuronal network. The inhibitory postsynaptic current (IPSC) is expressed as IGABA D gGABA ( t) ( V ¡ EGABA ), where tGABA D 5 ms, EGABA D ¡70 mV, and g0GABA D 0.15 mS / cm2 . The neuron representing the inhibitory network is modeled by a leaky integrate-and-re unit: ¡C
dV D IL C IAHP C Isyn,AMPA C Inoise . dt
(2.2)
The parameters that are different from those of the excitatory neurons (see equation 2.1) are C D 0.25 m F / cm2 , g0AHP D 0.60 mS / cm2 , and tAHP D 5 ms. All the pyramidal neurons are connected to the inhibitory neuron with AMPA synapses, g0AMPA D 0.026 mS/ cm2 . With these settings, the interneuron will re after four to ve pyramidal cells re synchronously. For a justication of the time constants and reversal potentials, see Jensen, Idiart, and Lisman (1996) and Jensen and Lisman (1996b). The excitatory feedback of the CA3 recurrent collaterals produces the sequence readout creating the phase code. Since the excitatory feedback has to outlast at least the interval of the inhibitory feedback (one gamma cycle, 20–30 ms), it was suggested that NMDA receptors with a decay time of approximately 100 ms mediate the excitation (Jensen & Lisman, 1996a, 1996b). This hypothesis has become less plausible in the light of recent results demonstrating that place elds of well-learned environments are relatively unaffected by the NMDA antagonist CPP (Kentros et al., 1998). Several alternatives exist. One possibility is that the ring of CA3 pyramidal cells is sustained by reciprocal connections between the dentate and CA3 during each gamma cycle (Lisman, 1999). Another solution is that kainate receptors mediate the slow excitatory feedback. Kainate receptors are found postsynaptically at CA3 neurons and have a decay time of 50–100 ms (Castillo, Malenka, & Nicoll, 1997; Yamamoto, Sawada, & Ohno-Shosaku, 1998). Although most research has concentrated on kainate receptors at mossy ber terminals, it is possible that they also are involved in mediating recurrent CA3 excitation. Further research is required to identify the details of the recurrent excitation creating the sequence readout. In this work, I assume that kainate receptors are responsible. Thus, the time-dependent conduc-
2748
Ole Jensen
tance of the synapses of the CA3 recurrent collaterals (RC) at cell i from cell j j (spiking at time tspike ) is modeled by a double exponential function: 0 ij
ij
gRC ( t) D g0RC WCA3 exp @¡
j
t ¡ tspike tRC,decay
10
0
A @1 ¡ exp @ ¡
j
t ¡ tspike tRC,rise
11 AA ,
(2.3)
ij
where WCA3 is the synaptic weight from cell j to cell i and the commonscaling constant is g0RC D 0.0061 mS / cm2 . The rise and decay times at room temperature of the kainate receptors have, respectively, been estimated to trise ¼ 7 ms and tdecay D 60–100 ms. Since the kinetics is faster at body temperature, the constants are set to tRC,rise D 5 ms and tRC,decay D 40 ms. The model is fairly robust with respect to variations of these time constants. The total curP ij rent to cell i from the recurrent collaterals is Isyn,RC D j gRC ( t) ( Vi ¡ ERC ) where ERC D 0 mV. The sequence of locations is encoded in the asymij metric weight matrix WCA3 . The synaptic strengths for representation N (cell j D 1 C 5( N ¡1) to 5N) to representation N C K (cell i D 1 C 5( N C K ¡ 1) ij KT KT to 5( N C K) ) are WCA3 D exp( ¡ t c ) ( 1 ¡ exp( ¡ t c ) ) . The time Tc D 30 ms rise
decay
is approximately the period of a gamma cycle. As a result, representation A (N D 1) is connected most strongly to representation B (N D 2), with weaker connections to representation C (N D 3), and so on. It has previously been demonstrated how an asymmetric synaptic weight matrix could emerge in a model with NMDA-dependent long-term potentiation (LTP) (Blum & Abbott, 1996; Jensen & Lisman, 1996b).
2.2 The Phase Decoder. This network receives an afferent input from the CA3 network (see Figure 1). The membrane potential of each of the nine cells in the decoder is dened as
¡C
dV D IL C Isyn,AMPA C IH,decoder C Inoise . dt
(2.4)
The theta drive takes the form I H,decoder D I0H cos( 2p fH t C w ) where w denes the phase lag with respect to the hippocampal theta rhythm. The synaptic AMPA conductances to cell k from hippocampal cell i are dened as gik syn,AMPA
D
ik W decoder
t ¡ tispike tAMPA
³ exp ¡
t ¡ tispike tAMPA
´ ,
(2.5)
ik where the matrix Wdecoder denes the connectivity from the hippocampus to the decoder. The rst ve cells in the hippocampal network are connected i1 to the rst cell of the decoder: Wdecoder D 0.006 m S / cm2 for i D 1, . . . , 5.
Information Transfer Between Rhythmically Coupled Networks
2749
i2 Cells 6 to 10 are connected to the second cell of the decoder: Wdecoder D ik 0.006 m S / cm2 and so on. For the other connections, Wdecoder D 0. Cells in the decoder will re when they receive sufciently strong synaptic input coinciding with a depolarizing theta drive. Constants that are different from those of the pyramidal neurons in the encoder are I0H D 0.18 m A / cm2 and g0AMPA D 0.006 mS / cm2 . In the simulations, gaussian distributed noise, Inoise , with a mean of 0 and a standard deviation of 0.1 m A / cm2 was applied to all the neurons in the two networks. The differential equations dening the networks were numerically integrated with the time step D t D 0.1 ms.
3 Results
Several groups have suggested that the theta phase precession is produced by repeated readouts of time-compressed sequences (Jensen & Lisman, 1996a; Skaggs et al., 1996; Tsodyks, Skaggs, Sejnowski, & McNaughton, 1996; Wallenstein & Hasselmo, 1997). Figure 2A illustrates this principle. Assume that each location from A to I has a neuronal representation in the hippocampus. These representations are linked synaptically as a sequence in the CA3 region as a result of the temporal asymmetry in NMDA-dependent longterm potentiation (LTP) (Blum & Abbott, 1996; Jensen & Lisman, 1996b). When the rat arrives at location A, the sequence from B to E is recalled within a theta cycle. In the next theta cycle, the rat has advanced to location B, and the sequence from C to F is recalled. When recording from a place cell participating in the representation of location E, one observes a systematic advance in the theta phase of ring as the rat traverses through the place eld: theta phase precession. A consequence of this model is that qualitatively different information is encoded in the phases of the theta cycle: predicted upcoming locations are represented by ring in the late phase of the theta cycle, whereas current locations are represented by early-phase ring. It is problematic to make the sequence models recall the individual representations at a sufciently slow rate (about seven representations per theta cycle). Jensen and Lisman (1996a, 1997) proposed a solution in which the gamma rhythm (30–80 Hz) clocks the sequence readout: within a theta cycle, each of the recalled representations is separated in time by a hyperpolarizing GABAergic drive from the interneuronal network. This principle is in agreement with experimental work showing that the gamma and theta rhythms coexist in the rat hippocampus with a frequency ratio of about 5 to 10 (Soltesz & Deschenes, 1993; Bragin et al., 1995). Thus, the theta phase of place cell ring advances about one gamma period per theta period. As shown by Jensen and Lisman (1997) this scheme is in accordance with the experimental values for the phase precession and the observation that a place cell is active for about 5 to 10 theta cycles as a rat crosses a place eld (Skaggs et al., 1996). Another nding consistent with the sequence readout models is that the theta phase of ring advances as a function of location rather
2750
Ole Jensen
Information Transfer Between Rhythmically Coupled Networks
2751
than time (O’Keefe & Recce, 1993). This is a consequence of the sequence recall being probed by the rat’s location. Finally, it was demonstrated that if the synaptic strength of the encoded sequences increases with number of presentations (Jensen & Lisman, 1996a), the sequence model can account for the systematic expansion of place cells after multiple traversals of the same path as observed experimentally (Mehta, Barnes, & McNaughton, 1997). Since the theta-gamma model is consistent with these experimental ndings, it is a qualied candidate for a biophysically realistic mechanism that can produce phase-coded information. 3.1 Simulation of the Hippocampal Phase Encoder. The model of the hippocampal CA3 network (see Figure 1) is an improved version of a previously proposed model (Jensen & Lisman, 1996a). (For details on the improvements and implementation, see section 2.) The network has 45 pyramidal cells. The synchronous ring of 5 cells represents a location, allowing the network to represent up to nine different locations. Only voltage traces of 1 cell per representation are shown in Figure 2A (traces A–I). The last trace illustrates the theta drive. When the rat arrives at location A, the rst 5 pyramidal neurons of the CA3 are excited by inputs from the dentate gyrus or by direct input from the entorhinal cortex (trace 1; only one neuron shown). The ring of these 5 cells activates the next representation (B, trace 2) by the CA3 recurrent collaterals. The ring pattern of representation B then activates representation C, and so forth. The inhibition from the interneuron (second to the last trace) serves to keep the representations in the sequence separate in time. Simulations demonstrated that in the Figure 2: Facing page. Simulations demonstrating information transfer by the principle of phase encoding and decoding. (A) Hippocampal phase encoding by sequence readout. The top panel illustrates a rat traversing a path from left to right. Neuronal representations of locations A to E are received by the hippocampus as the rat advances. At each location, a time-compressed sequence of upcoming locations is recalled. In the simulations, each location is represented by the synchronous ring of ve cells, but only the voltage trace of one cell per representation is shown (traces A–I). The second to the last trace is the voltage trace of the inhibitory interneuron, and the last trace is the hippocampal theta drive. (B) The voltage traces of the cells in the decoder and the theta inputs. The only parameter varied in the simulations is the phase w of the theta drive. In panel 1, the decoder receives an input at 125 degrees, resulting in a transfer of representations from the early phases of the hippocampal theta rhythm, that is, the rat’s current location. In panel 2, the encoder receives a theta input in phase (0 degree) with the hippocampus; thus upcoming near locations are transferred. In panel 3, where the phase is 270 degrees, representations from the late theta phase of the encoder are read out (i.e., upcoming distant locations). In panel 4, the decoder receives a theta input in anti-phase with the input of the encoder, and the information transfer is blocked.
2752
Ole Jensen
presence of noise, the sequences were recalled with few errors. About 5% of the recalled patterns were represented by only 4 instead of 5 cells’ ring simultaneously. The incomplete representations were due to cells’ failing to re or ring a gamma cycle too early or too late. Due to the redundancy of the representations, these errors did not prevent a correct recall of the following representations or errors in the phase decoding. Further simulations (not shown) demonstrated that the model can tolerate at least a 20% overlap between neighboring representations. As an example of theta phase precession, consider the voltage trace of a pyramidal cell participating in the representation of pattern E (marked by the arrow). As the rat runs through the linear maze, the ring occurs earlier and earlier in the theta cycle. It is the spatiotemporal ring pattern that carries the phase-coded information. 3.2 Simulation of the Phase Decoder. The oscillatory network model reading the hippocampal phase code is illustrated in the right-hand portion of Figure 1. The hippocampal output of the CA3 region might pass several structures, among others the CA1, before arriving at a network that can perform the phase decoding. This work is not directly concerned with the processing in these intermediate regions. (For a discussion on how other hippocampal regions besides the CA3 might be involved in sequence readout, see Lisman, 1999.) Each of the 9 pyramidal cells in the decoder represents one of the locations from A to I. The rst cell is synaptically connected to pyramidal cells 1 to 5 (representation A) of the CA3 network. The next cell is synaptically connected to cells 6 to 10 (representation B), and so on. Cells in the decoder re if they receive sufciently strong synaptic excitation coinciding with a depolarizing theta drive. The only parameter varied in the numerical simulations, Figure 2B, is the phase of the theta drive. The top panel represents the phase-coded input from the hippocampus. In panel 1, the phase decoder receives a theta input shifted 125 degrees from the hippocampal theta rhythm. The result is that representations from the early part of the hippocampal theta cycle activate cells in the decoder. For instance, the decoder cell representing location A res shortly after representation A activates in the CA3 network. By this principle, information about current location is transferred to the decoder. If the decoder receives a theta input that is in phase with the hippocampal theta drive (panel 2), upcoming nearby locations about two to three theta cycles ahead in time (about 8–13 cm ahead of the rat with running speed of 30 cm / s) are activated in the decoder. In panel 3, the decoder receives a theta input shifted 270 degrees from the hippocampus. In this case, ring patterns from the very late phase of the hippocampus are transferred. Hence, upcoming distant locations about four theta cycles ahead in time (about 17–18 cm ahead of the rat with a running speed of 30 cm / s) are represented in the decoder. If the theta input to the decoder is in antiphase (panel 4) with the hippocampal CA3 input, the cells in the decoder will not re; no information is transferred. Nor will cells in
Information Transfer Between Rhythmically Coupled Networks
2753
Figure 3: Qualitatively different information is decoded depending on the theta phase lag (angle in the circle map) between the hippocampus and the decoding network.
the decoder re if the amplitude of the theta input is zero (data not shown). Notice that in panels 2 and 3, the phase decoder makes a few errors. For instance, the cell representing D, panel 2, res twice. These errors are due to the noise added in the simulations and would be of less importance if the decoding network included more cells per representation. When adjusting the parameters of the neurons in the decoding network, it is crucial that the sum of the synaptic input currents and the maximum current of the theta input is strong enough to bring the pyramidal cells just above ring threshold. When this condition is fullled, the model is relatively insensitive to the ratio of the synaptic current compared to the theta current. For instance, the results in Figure 2B are reproduced when the synaptic input conductance is decreased from g0AMPA D 0.006 mS/ cm2 to 0.0034 mS / cm2 while the theta current is increased from I0H D 0.18 m A / cm2 to 0.27 m A / cm2 . The results of the simulations are summarized in the circle map (see Figure 3). The phase values in degrees denote the phase lag between the hippocampal network and the phase decoder. At phase lags of 125 degrees, current locations are decoded. As the phase difference is decreased, more and more distant locations are represented in the decoder. At phase lags of 270 degrees (D ¡90 degrees), the most distant locations are represented. At the portion where the theta drives are in antiphase (broken line), no information is transferred.
2754
Ole Jensen
4 Discussion
I have proposed a simple but biophysically plausible mechanism for information transfer between rhythmically coupled networks. The mechanism was tested on a model for the CA3 region of the hippocampus, which can produce phase-coded spatial information by repeated readout of spatiotemporal ring patterns. By varying the phase of the rhythmic theta drive to the network receiving the phase-coded ring pattern, qualitatively different information is transferred. The phase difference determines if representations of the current location or predicted upcoming representations are transfered to the decoder. Information transfer is blocked if the theta inputs to the two networks are in antiphase. A network receiving the hippocampal ring pattern, but not the theta input, would only be able to decipher the ring rates and extract less detailed information. Hence, the rhythmic coupling is computationally advantageous since it allows transfer and decoding of phase-coded representations. Which ring properties would be expected of the cells in regions performing the phase decoding of the hippocampal ring patterns? The summed activity of a large number of decoding cells, measured by eld recordings, for instance, would produce a rhythmic theta signal. This signal would be coherent but not necessarily in phase with the hippocampal theta rhythm. The ring of the individual cells would correlate with the theta rhythm but not show phase precession since the cells in the decoder re at a xed phase with respect to the theta rhythm (see Figure 2B). Like the ring of place cells, the ring of the decoding cells would be correlated with the position of the rat. However, the spatial extent of the place elds would be ve to seven times smaller compared to the place elds of the hippocampal place cells (see Figure 2B). What brain areas could possibly make use of the hippocampal phase code? One candidate is the entorhinal cortex. This structure has abundant connections both to and from the hippocampal areas through the perforant path. The ring of some entorhinal cells is strongly modulated by a theta rhythm, possibly paced by the medial septum. Recently the existence of an entorhinal theta generator relatively independent of the hippocampal CA3 theta generator was established (Kocsis, Bragin, & Buzsaki, 1999). The phase relationship between the two generators was not reported. The ring of entorhinal cells is correlated with the location of the rat. Contradicting the predictions of the model, the entorhinal place elds are larger than hippocampal place elds (Barnes, McNaughton, Mizumori, Leonard, & Lin, 1990). However, one cannot exclude the possibility that entorhinal place cells with small elds do exist but have not yet been identied. Another possibility is that the phase-coded information passes through the entorhinal cortex to the cingulate cortex. In this area, cells that re rhythmically at the theta frequency have been identied in both urethane-anesthetized rats (Holsheimer, 1982) and freely moving rats (Borst, Leung, & MacFabe, 1987).
Information Transfer Between Rhythmically Coupled Networks
2755
In an in vivo study, the activity of some cingulate cells was shown to be coherent with the hippocampal theta rhythm (Holsheimer, 1982). No consistent phase lag between pairs of coherent cells in the hippocampus and the cingulate cortex was found. As for the hippocampus, the theta activity of the cingulate cortex is thought to be modulated by input from the medial septum (Bland & Oddie, 1998). Interestingly, one study demonstrated cases in which the cingulate theta rhythm persists, whereas the hippocampal theta rhythm is abolished following lesions of the medial septum (Borst et al., 1987). Also, the cingulate cortex is well connected to many cortical areas (Vogt & Miller, 1983). Thus, both the entorhinal and the cingulate cortex are candidates for networks that can use the hippocampal phase code. A recent study in which the ring of prefrontal cortical neurons was phase-locked to the hippocampal theta rhythm opens the possibility that the prefrontal cortex makes use of the phase code (Siapas, Lee, Lubenov, & Wilson, 2000). Further research is required to characterize the prefrontal cells with respect to the rat’s position and phase precession. The regions reading the hippocampal phase code can work according to at least two different principles, as illustrated in Figure 4. According to the rst principle, different regions of the phase decoder receive theta inputs at different phases. An example is shown in Figure 4A. The left-hand part of the phase decoder receives a theta input at 125 degrees phase relative to that of the hippocampus, corresponding to panel 1 in Figure 2B. The activity in this region will represent information about the current location of the rat. The right-most region of the phase decoder receives a theta input at 270 degrees, corresponding to panel 3 of Figure 2B. This region represents upcoming distant locations. By this principle, phase-coded ring patterns are spatially segregated into different regions. The maximum number of regions required to represent the degrees of predictions is determined by the phase resolution of the hippocampal code. Since the theta phase of ring advances about one gamma cycle per theta cycle, it is the number of gamma cycles per theta cycle that determines the degrees of predictions. Thus, a maximum of about 5 to 10 subregions in the decoder is sufcient for extracting the complete information of the phase code. The proposed scheme requires that theta inputs with different phases are available at different subregions of the decoder. The septum could potentially provide drives at different theta phases since different cells in the septum re at different phases of the theta rhythm (Brazhnik & Fox, 1997; King, Recce, & O’Keefe, 1998). Thus, the proposed mechanism provides a purpose for septal cells’ ring at different theta phases, a nding that is paradoxical if one considers that the only role of the septum is to pace the hippocampal theta rhythm. Interestingly, cells in the supramammillary nucleus have also been found to re a theta frequency. Like the cells in the septum, the individual neurons have different preferred phases with respect to the hippocampal theta rhythm (Kocsis & Vertes, 1997). This suggests that the supramammillary nucleus and structures to which it projects could be involved in reading the hippocampal phase code.
2756
Ole Jensen
Figure 4: Two alternative principles of transfer of phase-decoded information between oscillatory networks. (A) Different regions in the phase decoder receive the theta drive at different phases. This allows the phase decoder to extract qualitatively different information simultaneously. The left-hand part of the phase decoder extracts representations corresponding to current location, and the right-hand part representations of upcoming distant locations. (B) The phase decoder receives the hippocampal phase-coded information and the theta drive from the septum. The phase of the theta input is adjustable and determines if the decoder is extracting representations of either current or predicted upcoming locations.
Information Transfer Between Rhythmically Coupled Networks
2757
Another possibility is that the phase relationship between the hippocampus and the phase decoder changes dynamically (see Figure 4B). This mechanism requires about 5 to 10 times fewer cells than the previously proposed mechanism. When the rat is in a situation where predictions are important, for instance, navigating in a complicated but well-learned maze, the phase difference might be 270 degrees, corresponding to panel 3 of Figure 2B. If the rat is in a situation where no predictions are possible, for instance, foraging in an unknown environment, the phase difference might be 125 degrees, corresponding to panel 1 of Figure 2B. By this principle, qualitatively different information is transferred depending on the relative phase of theta input to the decoder. How can the phase relationship between the hippocampus and the decoder change dynamically? Again, the cells ring at different theta phases in the septum provide a possible solution: the drive from septal cells ring at a particular phase could be weighted according to the situation. The model proposed in Figure 4A requires more cells than the one in Figure 4B. Clearly, more experimental work is needed to investigate the phase relation between the hippocampus and the regions potentially performing the phase decoding in order to distinguish between the two mechanisms. The proposed model for decoding is quite sensitive to the level of excitability; it is important that the sum of the synaptic input current and the current from the theta input are strong enough to bring the membrane potential of the cells just above threshold. However, the model is quite robust with respect to the ratio of theta current to synaptic current. Experimental work has revealed several physiological mechanisms promoting homeostatic stability of neuronal excitability (Bear, 1995; Desai, Rutherford, & Turrigiano, 1999; Turrigiano, 1999). Through these mechanisms, excitability is regulated by changes in intrinsic properties of the neurons and adjustments of excitatory and inhibitory synaptic inputs. I propose that similar physiological feedback mechanisms are responsible for adjusting the parameters to reasonable levels in the phase-decoding network. Further research is required to explore how such feedback mechanisms can help to adjust the network parameters to a robust regime. Another class of models addressing transfer of information encoded by spike timing relative to an ongoing rhythm has been proposed by Ahissar (1998). This framework deals primarily with the decoding of the spike trains from single cells, not the decoding of a spatiotemporal code from an ensemble of cells. In the models, an oscillating network converts temporally encoded information to a rate code. The frequency of the oscillating decoder (the rate-controlled oscillator) is determined by the mean frequency of the incoming spike train. The decoding network compares the timing expectation dened according to the rate-controlled oscillator with the timing of an incoming spike. The difference in timing (phase information) is then converted to a rate code. This decoding scheme has been shown to be consistent with several experimental observations in the somatosensory cortex
2758
Ole Jensen
(Ahissar et al., 1997, 2000). Although this framework has several components in common with the model proposed in this article, such as phase detection, it does not apply directly for reading the hippocampal phase code. The main reason is that due to the theta phase precession, the interspike intervals of place cells are shorter than a theta period (see Figure 2A). Thus, it is not possible to set the phase and frequency for the rate-controlled oscillator by the spike trains from individual place cells. However, it is possible that the decoding scheme proposed by Ahissar (1998) could apply to the hippocampal phase code if it was modied such that the frequency and phase of the rate-controlled oscillator were set according to the mean ring rate of the ensemble of place cells. The theoretical framework developed in this article adds signicance to the importance of oscillatory brain activity and the idea of phase coding. The principles of phase coding and information transfer between oscillatory networks can be generalized beyond the rat hippocampus. It has recently been proposed that the hippocampal sequence readout within theta cycles serves as a general mechanism for the encoding and recall of episodic memory in humans as well as in animals (Lisman, 1999). New methods are being developed for the purpose of localizing coherent oscillatory sources from human data recorded by magnetoencephalography and electroencephalography (Gross et al., 2001). Future research applying such approaches, as well as intracranial recordings in which single unit activity is related to the ongoing rhythmic activity, would be needed in order to test the ideas on information transfer between rhythmically coupled networks. Acknowledgments
I thank Claudia D. Tesche, John E. Lisman, and Freya M. Johnson for valuable comments on the manuscript for this article. This research was supported by the Danish Medical Research Council. References Ahissar, E. (1998). Temporal-code to rate-code conversion by neuronal phaselocked loops. Neural Comput., 10(3), 597–650. Ahissar, E., Haidarliu, S., & Zacksenhouse, M. (1997). Decoding temporally encoded sensory input by cortical oscillations and thalamic phase comparators. Proc. Natl. Acad. Sci. U.S.A., 94(21), 11633–11638. Ahissar, E., Sosnik, R., & Haidarliu, S. (2000). Transformation from temporal to rate coding in a somatosensory thalamocortical pathway. Nature, 406(6793), 302–306. Barnes, C. A., McNaughton, B. L., Mizumori, S. J., Leonard, B. W., & Lin, L. H. (1990). Comparison of spatial and temporal characteristics of neuronal activity in sequential stages of hippocampal processing. Prog. Brain. Res., 83, 287–300.
Information Transfer Between Rhythmically Coupled Networks
2759
Bear, M. F. (1995). Mechanism for a sliding synaptic modication threshold. Neuron, 15(1), 1–4. Bland, B. H., & Oddie, S. D. (1998). Anatomical, electrophysiological and pharmacological studies of ascending brainstem hippocampal synchronizing pathways. Neurosci. Biobehav. Rev., 22(2), 259–273. Blum, K. I., & Abbott, L. F. (1996). A model of spatial map formation in the hippocampus of the rat. Neural Comput., 8(1), 85–93. Boeijinga, P. H., & Lopes da Silva, F. H. (1989). Modulations of EEG activity in the entorhinal cortex and forebrain olfactory areas during odour sampling. Brain. Res., 478(2), 257–268. Borst, J. G., Leung, L. W., & MacFabe, D. F. (1987). Electrical activity of the cingulate cortex. II. Cholinergic modulation. Brain Res., 407(1), 81–93. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzsaki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15(1 Pt 1), 47–60. Brazhnik, E. S., & Fox, S. E. (1997). Intracellular recordings from medial septal neurons during hippocampal theta rhythm. Exp. Brain Res., 114(3), 442–453. Castillo, P. E., Malenka, R. C., & Nicoll, R. A. (1997). Kainate receptors mediate a slow postsynaptic current in hippocampal CA3 neurons. Nature, 388(6638), 182–186. Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999).Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nat. Neurosci., 2(6), 515–520. Dorries, K. M., & Kauer, J. S. (2000). Relationships between odor-elicited oscillations in the salamander olfactory epithelium and olfactory bulb. J. Neurophysiol., 83(2), 754–765. Gross, J., Kujala, J., H¨am¨al¨ainen, M., Timmermann, L., Schnitzler, A., & Salmelin, R. (2001).Dynamic imaging of coherent sources: Studying neural interactions in the human brain. Proc. Natl. Acad. Sci. U.S.A., 98, 694–699. Holsheimer, J. (1982). Generation of theta activity (RSA) in the cingulate cortex of the rat. Exp. Brain Res., 47(2), 309–312. Jensen, O., Idiart, M. A., & Lisman, J. E. (1996).Physiologically realistic formation of autoassociative memory in networks with theta / gamma oscillations: Role of fast NMDA channels. Learn Mem., 3, 243–256. Jensen, O., & Lisman, J. E. (1996a). Hippocampal CA3 region predicts memory sequences: Accounting for the phase precession of place cells. Learn. Mem., 3(2–3), 279–287. Jensen, O., & Lisman, J. E. (1996b). Theta / gamma networks with slow NMDA channels learn sequences and encode episodic memory: Role of NMDA channels in recall. Learn. Mem., 3(2–3), 264–278. Jensen, O., & Lisman, J. E. (1997).The importance of hippocampal gamma oscillations for place cells. In J. M. Bower (Ed.), Computational neuroscience (pp. 683– 689). New York: Plenum Press. Jensen, O., & Lisman, J. E. (2000). Position reconstruction from an ensemble of hippocampal place cells: Contribution of theta phasecoding. J. Neurophys., 83, 2602–2609. Kentros, C., Hargreaves, E., Hawkins, R. D., Kandel, E. R., Shapiro, M., & Muller, R. V. (1998). Abolition of long-term stability of new hippocam-
2760
Ole Jensen
pal place cell maps by NMDA receptor blockade. Science, 280(5372), 2121– 2126. King, C., Recce, M., & O’Keefe, J. (1998). The rhythmicity of cells of the medial septum/diagonal band of Broca in the awake freely moving rat: Relationships with behaviour and hippocampal theta. Eur. J. Neurosci., 10(2), 464–477. Koch, C., & Segev, I. (1998). Methods in neuronal modeling. Cambridge, MA: MIT Press. Kocsis, B., Bragin, A., & Buzsaki, G. (1999). Interdependence of multiple theta generators in the hippocampus: A partial coherence analysis. J. Neurosci., 19(14), 6200–6212. Kocsis, B., & Vertes, R. P. (1997). Phase relations of rhythmic neuronal ring in the supramammillary nucleus and mammillary body to the hippocampal theta activity in urethane anesthetized rats. Hippocampus, 7(2), 204–214. Lisman, J. E. (1999). Relating hippocampal circuitry to function: Recall of memory sequences by reciprocal dentate-CA3 interactions. Neuron, 22(2), 233– 242. Macrides, F., Eichenbaum, H. B., & Forbes, W. B. (1982). Temporal relationship between snifng and the limbic theta rhythm during odor discrimination reversal learning. J. Neurosci., 2(12), 1705–1717. Mehta, M. R., Barnes, C. A., & McNaughton, B. L. (1997).Experience-dependent, asymmetric expansion of hippocampal place elds. Proc. Natl. Acad. Sci. U.S.A., 94(16), 8918–8921. Nicolelis, M. A., Baccala, L. A., Lin, R. C., & Chapin, J. K. (1995). Sensorimotor encoding by synchronous neural ensemble activity at multiple levels of the somatosensory system. Science, 268(5215), 1353–1358. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34(1), 171–175. O’Keefe, J., & Recce, M. L. (1993).Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3(3), 317–330. Olton, D. S., Branch, M., & Best, P. J. (1978). Spatial correlates of hippocampal unit activity. Exp. Neurol., 58(3), 387–409. Sarnthein, J., Petsche, H., Rappelsberger, P., Shaw, G. L., & von Stein, A. (1998). Synchronization between prefrontal and posterior association cortex during human working memory. Proc. Natl. Acad. Sci. U.S.A., 95(12), 7092–7096. Shen, J., Barnes, C. A., McNaughton, B. L., Skaggs, W. E., & Weaver, K. L. (1997). The effect of aging on experience-dependent plasticity of hippocampal place cells. J. Neurosci., 17(17), 6769–6782. Siapas, A. G., Lee, A. K., Lubenov, E. V., & Wilson, M. A. (2000).Prefrontal phaselocking to hippocampal theta oscillations. Soc. Neurosci. Abstr., 26, 467.1. Skaggs, W. E., McNaughton, B. L., Wilson, M. A., & Barnes, C. A. (1996). Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus, 6(2), 149–172. Soltesz, I., & Deschenes, M. (1993). Low- and high-frequency membrane potential oscillations during theta activity in CA1 and CA3 pyramidal neurons of the rat hippocampus under ketamine-xylazine anesthesia. J. Neurophysiol., 70(1), 97–116.
Information Transfer Between Rhythmically Coupled Networks
2761
Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., & McNaughton, B. L. (1996). Population dynamics and theta rhythm phase precession of hippocampal place cell ring: A spiking neuron model. Hippocampus, 6(3), 271–280. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends Neurosci., 22(5), 221–227. Varela, F., Lachaux, J. P., Rodriguez, E., & Martinerie, J. (2001). The brainweb: Phase synchronization and large-scale integration. Nat. Rev. Neurosci. 2(4), 229–239. Vogt, B. A., & Miller, M. W. (1983). Cortical connections between rat cingulate cortex and visual, motor, and postsubicular cortices. J. Comp. Neurol., 216(2), 192–210. Wallenstein, G. V., & Hasselmo, M. E. (1997). GABAergic modulation of hippocampal population activity: Sequence learning, place eld development, and the phase precession effect. J Neurophysiol, 78(1), 393–408. Yamamoto, C., Sawada, S., & Ohno-Shosaku, T. (1998). Distribution and properties of kainate receptors distinct in the CA3 region of the hippocampus of the guinea pig. Brain Res., 783(2), 227–235. Received July 19, 2000; accepted February 13, 2001.
LETTER
Communicated by Emilio Salinas
Gaussian Process Approach to Spiking Neurons for Inhomogeneous Poisson Inputs Ken-ichi Amemori [email protected] Shin Ishii [email protected] Nara Institute of Science and Technology,Ikoma-shi,Nara 630-0101,Japan,and CREST, Japan Science and Technology Corporation Seika, Soraku, Kyoto 619-0288, Japan This article presents a new theoretical framework to consider the dynamics of a stochastic spiking neuron model with general membrane response to input spike. We assume that the input spikes obey an inhomogeneous Poisson process. The stochastic process of the membrane potential then becomes a gaussian process. When a general type of the membrane response is assumed, the stochastic process becomes a Markov-gaussian process. We present a calculation method for the membrane potential density and the ring probability density. Our new formulation is the extension of the existing formulation based on diffusion approximation. Although the single Markov assumption of the diffusion approximation simplies the stochastic process analysis, the calculation is inaccurate when the stochastic process involves a multiple Markov property. We nd that the variation of the shape of the membrane response, which has often been ignored in existing stochastic process studies, signicantly affects the ring probability. Our approach can consider the reset effect, which has been difcult to deal with by analysis based on the rst passage time density. 1 Introduction
Spiking neuron models aim to describe neuronal activities based on spikes that pass through the neuron. The spike response model (Gerstner, 1998, 2000; Kistler, Gerstner, & van Hemmen, 1997) is one of the spiking neuron models. In that model, the membrane potential of a neuron is dened by the weighted summation of the postsynaptic potentials, each of which is induced by a single input spike. When the membrane potential becomes larger than a certain ring threshold, the neuron emits a spike to connecting neurons. The stochastic nature of spikes has been suggested in recent physiological studies. The expression of cerebellum Purkinje cells is based on the ensemble rate coding, and each spike is a temporally variable stochastic event. The ensemble spike rate obtained by averaging over trial is correc 2001 Massachusetts Institute of Technology Neural Computation 13, 2763– 2797 (2001) °
2764
Ken-ichi Amemori and Shin Ishii
lated with the motor command (Shidara, Kawano, Gomi, & Kawato, 1993; Gomi et al., 1998). Stochastic spikes are also seen in the cerebral cortex. Spike variability that seemingly obeys a Poisson process is observed at a low, spontaneous ring rate, 2–5 Hz, and even at a high ring rate of up to 100 Hz (Softky & Koch, 1993). In spite of this variablity, stimulusinduced spike modulation can be observed in temporal patterns of spikes averaged over trials. Spike histograms over trials are called poststimulus time histograms (PSTHs). Middle temporal (MT) neurons of a behaving monkey exhibit spike modulation among background spikes of 3–4 Hz, and the modulation is almost invariant over different trials (Bair & Koch, 1996). According to Arieli, Sterkin, Grinvald, and Aertsen (1996), the neuronal responses vary a lot even when they are evoked by the same stimulus. By adding the response averaged over trials to the initial neuron states, however, the actual single response can be well approximated. This implies that the trial-averaged spikes or ring probability density may be the expression of information conveyed in the neuronal network. The physiological studies above suggest that spikes are stochastic events, while their temporally variable frequency represents the temporal information. Such stochastic nature of spikes can be theoretically modeled by an inhomogeneous Poisson process, where the Poisson intensity corresponds to the represented information. In this article, we adopt two assumptions: that a neuron can be described by the spiking neuron model and that spikes input to the neuron are described by an inhomogeneous Poisson process. We believe that these two assumptions are fairly plausible, especially in the cortical areas whose spike rates are low. Under these assumptions, the dynamics of the membrane potential can be expressed by a gaussian process. Moreover, it often involves a Markov property. In this case, the dynamics can be formulated as a Markovgaussian process. This article presents a new theoretical approach to the Markov-gaussian process that can precisely calculate the dynamics of the membrane potential and the ring probability. Many stochastic process studies have dealt with integrate-and-re neuron models. The membrane response to a single input spike is approximated as an exponentially decaying function (Stein, 1967; Tuckwell, 1988, 1989). The existing studies based on the diffusion approximation have adopted the same approximation (Tuckwell & Cope, 1980; Tuckwell, 1989; Ricciardi & Sato, 1988; many other Ornstein-Uhlenbeck process studies). This approximation signicantly simplies the stochastic process analysis. When one considers a more general type of the membrane response, however, these studies cannot accurately describe the dynamics of the membrane potential. This article shows that consideration of the membrane response could be important for neuronal information processing. On the other hand, there have been studies based on the backward resolution of the Volterra integral equation (Plesser & Tanaka, 1997; Burkitt & Clark, 1999, 2000). These studies deal with the general type of the membrane response. Since they aim mainly to calculate the rst passage time
Gaussian Process Approach to Spiking Neurons
2765
q > v(t) I(t)
C R
Figure 1: Hardware circuit for the integrate-and-re neuron model.
density, however, they are not suitable for considering a situation where the neuron may re many times (i.e., multiple rings). Because we assume the neuronal information is represented by the temporally variable rate, this is a aw. Compared with the existing studies, our approach has several important advantages. First, the calculation is precise. Second, the general type of the membrane response of the spiking neuron model can be considered. Third, multiple rings can be considered. Our approach is thus a general framework to deal with the spiking neuron model. This article is organized as follows. Section 2 introduces the spiking neuron model. In section 3, we show that the probability density of the membrane potential is described as a Markov-gaussian process. In section 4, numerical simulations are shown. The theoretical results agree very well with those by Monte Carlo simulations. In sections 4.1 and 4.2, we also show that the single Markov assumption, which corresponds to the diffusion approximation, is not accurate for some types of the membrane response. In section 4.3, our method is extended so as to deal with the multiple rings. Section 5 is devoted to discussion. Section 5.1 suggests the importance of the general type of the membrane response. Section 5.2 shows the comparison between our study and the existing stochastic process studies. Section 6 summarizes this article. 2 Spiking Neuron Model 2.1 Integrate-and-Fire Neuron Model. The membrane potential of an integrate-and-re neuron can be dened using a hardware circuit illustrated by Figure 1. The circuit consists of a register R and a capacitor C. The input current to the circuit is denoted by I ( t) . The membrane potential of this ( ) ( ) neuron, v( t) , is formulated by I ( t) D vRt C C dvdtt . Dening the membrane time constant by tm D RC, this equation becomes
tm
dv ( t) D ¡v(t) C RI ( t ). dt
(2.1)
2766
Ken-ichi Amemori and Shin Ishii
The input current is described as n j ( t) N ± ² X X f Q t ¡ tj , I ( t) D cj jD 1
(2.2)
f D1
where N is the number of synapses projecting to the neuron. nj ( t ) is the number of spikes input through synapse j until time t. cj is the synaptic f
coefcient from presynaptic neuron j. tj is the time at which the f th spike enters synapse j. The response function Q ( s) is often dened by an exponentially decaying function, ³ ´ 1 s exp ¡ H ( s) , ts ts
Q ( s) D
(2.3)
where H ( s ) is the Heaviside (step) function dened by »
H ( s) D
1 0
( s ¸ 0) ( s < 0) ,
(2.4)
and ts is the synaptic time constant. An integrate-and-re neuron emits a spike when the membrane potential exceeds a certain spiking threshold h, and after that, the membrane potential is reset to vreset D 0. The threshold introduces nonlinearity. 2.2 Spiking Neuron Model. The solution of equations 2.1 through 2.3
becomes v ( t) D
N X jD 1
wj
n j ( t) X f D1
± ² f u t ¡ tj ,
(2.5)
where wj D Rcj / tm and u ( s) D
µ ³ ´ ³ ´¶ tm s s exp ¡ ¡ exp ¡ H ( s) tm ¡ ts tm ts
´ q( e¡as ¡ e ¡b s ) H ( s) ,
(2.6)
where a ´ 1 / tm Rand b ´ 1 / ts . q D b / ( b ¡ a) is the normalization term so 1 that the integral ¡1 u ( s) ds is set to 1 / a D tm . u ( s) , which is called the spike response function, denes the membrane response to a single input spike. When the membrane potential is described by a linear sum of the response function, the neuron model is called the spike response model or the spiking neuron model (Gerstner, 1998).
Gaussian Process Approach to Spiking Neurons
2767
Figure 2: Shape of the response function u ( s) , where a D 0.1 (1 / ms). The solid line denotes function 2.8, and the dash-dotted line denotes function 2.7. The dash lines denote function 2.6, where b D 3.0, 1.0, 0.5, 0.3, 0.2, and 0.14 (1 / ms) from left to right. All of the functions are normalized so that their integrals are 1 / a.
When b ! a, function 2.6 approaches ³ ´ s s exp ¡ H ( s ) D qse¡as H ( s) , u( s ) D tm tm
(2.7)
which is called the alpha function. When b ! 1, function 2.6 approaches ³ ´ s H ( s) D qe¡as H ( s ). (2.8) u ( s ) D exp ¡ tm The normalization term is set at q D 1 for equation 2.8 and q D a for equation 2.7. The shapes of the response functions are shown in Figure 2. Both response functions, 2.7 and 2.6, are the solutions of the second-order linear differential equation. In this article, we assume that the spike response function has the form 2.8, 2.7, or 2.6. 3 Density of the Membrane Potential for Stochastic Inputs
In this article, we examine stochastic properties of a single neuron dened f by the spiking neuron model. We assume that the input spikes, ftj | f D 1, . . . , nj ( t) g, are stochastic events. Subsequently, random variables are denoted by math fonts in boldface type. Given the spike response function u ( s) , we examine the property of the membrane potential:
v ( t) D
N X j D1
wj
n j ( t) X f D1
f
u ( t ¡ tj ) ´
N X jD 1
wj xj ( t) ,
(3.1)
2768
Ken-ichi Amemori and Shin Ishii
[mV]
15
[Hz]
0 15 0 5 0
0
100
200 300 Time [ms]
400
500
Figure 3: Sample paths of the membrane potential. (Top) The membrane potential when the absorbing threshold is assumed. (Middle): The membrane potential when the reset after ring is considered. The input spikes obey an inhomogeneous Poisson process whose intensity is described in the bottom part of the gure. h D 15 mV, a D 0.1 and b D 0.3 (1/ms). Other parameter are dened in section 4.
where wj xj ( t) is the membrane response (postsynaptic potential; PSP) corresponding to synapse j. This neuron generates a spike when v ( t) > h. When a spike is generated, the membrane potential is reset to vreset D 0. Sample paths of this stochastic process are shown in Figure 3. 3.1 Free Membrane Potential Density. Here we examine the dynamics of the density of the membrane potential when the threshold is ignored. The effect of the threshold will be discussed in section 3.2. We assume that the input spikes obey an inhomogeneous Poisson process. If the spike frequency, which is indicated by Poisson intensity, is large enough in comparison to the decay of the membrane potential, the density of the membrane potential can be represented by a gaussian distribution due to the central limit theorem (Rice, 1944). For the time being, synapse index j and transmission efcacy wj are omitted. The probability density of the membrane response for a single synapse, P ( t) ( ¡ t f ) , is described as (Papoulis, 1984) x( t) D n f D1 u t
Z
g( x
t)
t
D hd(x ¡ x ( t) ) i D
1 ¡1
« ¬ dj exp(ij xt ) exp ( ¡ij x( t) ) , 2p
(3.2)
p where i D ¡1 and h¢i denotes an ensemble average over trials. Time [0, t] is partitioned into intervals [sl , sl C 1 ] (l D 0, . . . , T ¡ 1) whose period is D s, namely, sl D lD s, s0 D 0, and sT D T D s D t. If the time step D s is small enough, the change of the membrane response at time t, which is caused by the spikes input within [sl , sl C 1 ], can be approximated as
Gaussian Process Approach to Spiking Neurons
2769
D xl ( t) ¼ u( t ¡sl ) D nl . Here, D nl denotes the number of spikes input within
[sl , sl C 1 ]. Then the membrane response x( t) can be described by the summation of statistically independent random variables fD xl ( t) | l D 0, . . . , T ¡1g, and the characteristic function of x( t) becomes hexp( ¡ij x ( t) ) i ¼
T¡1 Y lD 0
hexp( ¡ij u ( t ¡ sl ) D nl ) i
D exp D exp
³T¡1 X
³
lD 0
¡ij £
´
lnhexp( ¡ij u ( t ¡ sl ) D nl ) i T¡1 X lD 0
u ( t ¡ sl )hD nl i ¡
( hD n2l i ¡ hD nl i2 ) C
X j 2 T¡1 u2 ( t ¡ s l ) 2 lD 0
O
´
(j 3 )
.
(3.3)
Denoting the Poisson intensity at time s by l( s) , we obtain hD nl i ¼ l( sl ) D s and hD n2l i ¡ hD nl i2 ¼ l(sl ) D s. The mean and the variance are given by (Amemori & Ishii, 2000) Z t Z t ( s o ) 2 ( t) D go ( t ) D (3.4) u ( t ¡ s )l( s) ds, u2 ( t ¡ s )l( s) ds. 0
0
The spike response u ( t) decays with a D 1 / tm . When 10¡3 l(t) tm À 1, a number of spike responses are accumulated within their decay time tm .1 In this case, the density of the membrane potential can be well estimated by a gaussian distribution due to the central limit theorem; namely, the components of O(j 3 ) can be omitted. Then we obtain the following equation for D s ! 0: ³ ´ Z 1 j2 o 2 dj t t o exp ij x ¡ ijg ( t) ¡ ( s ) ( t) g( x ) D 2 ¡1 2p µ ¶ t o ( x ¡ g ( t) ) 2 1 exp ¡ . (3.5) D p 2(s o ) 2 ( t) 2p ( s o ) 2 ( t) When the input spikes are statistically independent with respect to the synapse indices j, the density of the integrated membrane potential becomes µ ¶ ( vt ¡ g( t) ) 2 1 t) ( exp ¡ gv D p , 2s 2 ( t) 2p s 2 ( t) g( t) D 1
N X jD 1
wjgjo ( t) ,
s 2 ( t) D
N X
( wj ) 2 ( sjo ) 2 ( t) .
jD 1
The coefcient 10¡3 is added because l(t) is in Hz and tm is in ms.
(3.6)
2770
Ken-ichi Amemori and Shin Ishii
gjo ( t) and ( sjo ) 2 ( t) are the mean and the variance of the jth membrane response evoked by the jth synapse whose input spike intensity is lj ( t ). They are given similarly to equation 3.4. When the spikes are independent with respect to the synapse index j, the P condition of a gaussian distribution can be written as 10¡3 jND 1 lj ( t) tm À 1. Even if the input intensity lj ( t) is small in comparison to the decay constant a D 1 / tm , the density becomes a gaussian distribution for large number of synapses. This corresponds to the well-known gaussian approximation (Burkitt & Clark, 1999). 3.2 Membrane Potential Density with Threshold. In this section, we examine the density of the membrane potential when the absorbing threshold is assumed; once the membrane potential reaches the threshold, the sample is removed from the density. It is assumed that the neuron behaviors are observed at every D t time step along a discretized time sequence ft0 D 0, t1 D D t, . . . , tm D mD t D tg, where D t is sufciently small time interval. Our description approaches that of the continuous-time stochastic process as D t ! 0. This will be discussed in section A.4 in the appendix. We denote the membrane potential at time t as vt . For expression simplicity, vtm is abbreviated as vm , and the sequence of variables, fvm , . . . , v1 g, will be denoted as fvgm 1.
3.2.1 Time Evolution of the Membrane Potential Before Firing. The time evolution of the neuron ensemble can be determined by the joint probability density, 1 m g (fvgm 1 ) ´ hd( v ¡ v ( tm ) ) . . . d( v ¡ v ( t1 ) ) i,
(3.7)
of the membrane potential. The density is decomposed as m m¡1 ) ( m¡1 ) . . . g( v1 ) , | fvgm¡2 g (fvgm gv 1 ) D g( v | fvg1 1
(3.8)
) D g (fvgm ) ( m¡1 ) is the conditional probability denwhere g ( vm | fvgm¡1 1 1 / g fvg1 sity of the membrane potential, which is subsequently called the transition probability density. Let g0 ( vm ) be the density at time tm whose neuron samples have not red until time tm¡1 . Then g 0 ( vm ) becomes (van Kampen, 1992) Z g 0 ( vm ) D
h ¡1
Z dvm¡1 . . .
h
dv1
¡1
) g( vm¡1 | fvgm¡2 ) . . . g( v1 ) . £ g( vm | fvgm¡1 1 1
(3.9)
Gaussian Process Approach to Spiking Neurons
2771
Figure 4: Density of the membrane potential at time tm . g ( vm ) is the density of the membrane potential ignoring the threshold, where vm is the membrane potential at time tm . gu ( vm ) is the density of the membrane potential with the threshold. g f ( vm ) D g( vm ) ¡ gu ( vm ) is the density of the membrane potential that has already red. gu ( vm ) is continuously zero at h when D t ! 0. (Left) The case in which threshold h is high in comparison to mean g( tm ) of the free density g ( vm ) . (Right) The case in which threshold h is relatively low.
By using g 0 ( vm ), the density whose samples have not red until time tm is given by » gu ( vm ) D
0 g 0 ( vm )
vm ¸ h vm < h.
(3.10)
The density whose sample has red until time tm is given by g f ( vm ) D g ( vm ) ¡ gu ( vm ).
(3.11)
The relation of the densities is illustrated in Figure 4. Density g ( vm ) , which is the free density in the absence of the threshold, always forms a gaussian distribution. The probability that the neuron sample for the rst time res at time tm is given by Z r ( tm ) D
h
1
g 0 ( vm ) dvm .
(3.12)
y ( t) ´ r ( t) / D t is called the rst passage time density. This threshold effect in the continuous time is discussed in section A.4. 3.2.2 Transition Probability Densities. In order to obtain the density before ring (see equation 3.9 or 3.10), we need the transition probability ) in each time step. Here, we develop the method that density g ( vm | fvgm¡1 1 incrementally calculates the transition probability density.
2772
Ken-ichi Amemori and Shin Ishii
Since the transition probability density is given ) D g ( fvgm ) (f m¡1 ) , we by the joint probability density as g ( vm | fvgm¡1 1 1 / g vg1 m rst consider g ( fvg1 ) . If l( t) and tm are not small, the joint probability density can be well approximated by a gaussian distribution: Joint probability density.
g (fvgm 1) D p
³ ´ 1 1 ( vE ¡ g) exp ¡ E 0 ( S m ) ¡1 ( vE ¡ g) E , (2p ) m | S m | 2
(3.13)
where vE D ( v1 , . . . , vm ) 0 and gE D (g( t1 ) , . . . , g( tm ) ) 0 . Prime (0 ) denotes a transpose. g(tm ) is the mean of v ( tm ) , which is given by equation 3.6. S m ´ ( S klm ) is an m-by-m autocorrelation matrix of vE , whose components are S klm ´ hv ( tk ) v ( tl ) i ¡ hv ( tk ) ihv ( tl ) i. Equation 3.13 is derived in section A.1. When the joint probability density is represented by a gaussian distribution, the stochastic process is called a gaussian process (Hida, 1980). ( S m ) ¡1 can be incrementally calculated as follows. Because the synaptic inputs are statistically independent of each other, S klm can be described by the weighted summation of the autocorrelation of single synaptic inputs: Time evolution of gaussian process.
S klm
D
N X
Z wj2
j D1
min(tk ,tl ) 0
u ( tk ¡ s ) u ( tl ¡ s) lj ( s) ds.
(3.14)
For simplicity, we discuss the case where the input intensity is common for all of the synapses. In this case, Z
S klm
D w2
min(tk ,tl ) 0
u ( tk ¡ s) u ( tl ¡ s )l( s) ds,
(3.15)
qP N 2 where w D j D 1 wj . In order to see the time-evolving characteristics of the autocorrelation matrix S m , its incremental (recursive) denition along the discretized time sequence ft1 , . . . , tm g is derived here. At time tm , S m is an m-by-m matrix:
S klm
2
Dw
min( Xk,l) n D1
u ( tk ¡ tn ) u ( tl ¡ tn ) l(tn ) D t.
(3.16)
Because S m is symmetric and positive denite, it can be decomposed into ( U m ) 0 U m , where U m D ( u kl ) is an upper triangular matrix. From equation 3.16, the components of Um are described as » u kl D
p w ¢ u( tl ¡ tk ) l(tk ) D t 0
k·l k > l.
(3.17)
Gaussian Process Approach to Spiking Neurons
2773
Later we will see that these components are important for describing the time evolution of the gaussian process. By using Um¡1 and um ´ ( u1m , . . . , um¡1 m ) 0 , Um can be obtained as ³ m¡1 ´ U um m (3.18) U D , 00 umm where 0 D ( 0, . . . , 0) 0 whose dimension is m ¡ 1. Let C m ´ ( C klm ) be the inverse of U m , which is given by 0 1 1 m¡1 m ) m¡1 (C ¡ u C BC umm C. Cm D B (3.19) @ A 1 0 0 umm The mth column of C m is constructed by the Gram-Schmidt orthogonalization. The inverse of S m is then given by ( S m ) ¡1 D C m ( C m ) 0
0
m¡1 m¡1 0 BC ( C ) C
B @
DB
¡
( C m¡1 um ) (C m¡1 um ) 0 ( umm ) 2
1 ( umm )
¡
1
1 ( umm )
( C m¡1 um ) C 2
C C. A
1
( C m¡1 um ) 0 2
(3.20)
( umm ) 2
Equation 3.20 introduces an incremental calculation of the inverse of the autocorrelation matrix, ( S m ) ¡1 , along the time sequence. When one of the response functions 2.8, 2.7, or 2.6, is used, the inverse autocorrelation matrix ( S m ) ¡1 becomes simple. This simplicity also implies that the transition probability density is strictly determined by a multiple Markov process along the discretized time sequence. The order of this Markov process for each response function is discussed in section A.2. We obtain the transition probability density of the gaussian process. From equation 3.13, the transition probability ) D g (fvgm ) ( m¡1 ) becomes density g ( vm | fvgm¡1 1 1 / g fvg1 Transition probability density.
1
) D p g ( vm | fvgm¡1 1
2p | S m | / | S m¡1 | 0 1 m X m m¡1 X m¡1 X 1X 1 £ exp @ ¡ f m vi v j ¡ f m¡1 vi v j A 2 iD 1 jD 1 ij 2 iD1 j D1 ij 0 ±
D p
1 2pc
m
B exp @ ¡
vm ¡
P m¡1 i D1
2c
m
k im vk
²2 1
C A,
(3.21)
2774
Ken-ichi Amemori and Shin Ishii
where ( fijm ) ´ ( S m ) ¡1 and vi ´ vi ¡g( ti ) . From equations 3.19 and 3.20, the conditional variance c m and the correlation coefcient kim are given by c
m
D
| Sm| | S m¡1 |
D ( umm ) 2 ,
m kim D ( umm ) 2 fim .
(3.22)
In the general case, an incremental calculation along the time sequence is m in equation 3.22. If we can assume a Markov necessary for obtaining fim property, however, the calculation becomes drastically simplied, as we discuss below. 3.3 Markov Property. When the spiking neuron model employs the spike response function 2.8, 2.7, or 2.6, the transition probability density of the neuron ensemble has a Markov property. This feature is discussed in section A.2. If we know that the stochastic process has an n-ple Markov property, the transition probability density is theoretically given as follows.
3.3.1 Time Evolution of Multiple Markov Process. If the conditional proba) D g ( vm | fvgm¡1 bility density has an n-ple Markov property, g ( vm | fvgm¡1 m¡n ) . 1 Under this assumption, g 0 ( vm ) is given by Z h Z h Z h g 0 ( vm ) D dvm¡1 dvm¡2 . . . dv1 g( vm | fvgm¡1 m¡n ) ¡1
£ g( v
m¡1
|
¡1 ) fvgm¡2 m¡n¡1
¡1 ( n C1
...g v
| fvgn1 ) g ( fvgn1 ) .
(3.23)
Let g 0 ( fvgm¡1 m¡n ) be the joint probability density between tm¡n and tm¡1 , whose sample has not reached the threshold until time tm¡n . Then, Z h Z h m¡1 m¡n¡1 ). ... (3.24) g 0 ( fvgm¡n ) ´ dv dv1 g( fvgm¡1 1 ¡1
¡1
g 0 (fvgm¡1 m¡n ) can be incrementally calculated by Z h m¡1 ) g 0 ( fvgm D dvm¡n g ( vm | fvgm¡1 m¡n ) g0 ( fvgm¡n ) . m¡n C 1
(3.25)
¡1
From equations 3.23 and 3.24, g 0 ( vm ) is given by the n-dimensional integral: Z h Z h m¡1 (3.26) g 0 ( vm ) D dvm¡1 . . . dvm¡n g ( vm | fvgm¡1 m¡n ) g 0 (fvgm¡n ) . ¡1
¡1
The ring probability at time tm is given by Z 1 r ( tm ) D g0 ( vm ) dvm . h
For examples, two special cases are described.
(3.27)
Gaussian Process Approach to Spiking Neurons
2775
) D g( vm | vm¡1 ) , density g 0 ( vm ) Under the Markov property g ( vm | fvgm¡1 1 is calculated by very simple equations: g 0 ( v1 ) D g ( v1 ) I Z h g 0 ( vm ) D g ( vm | vm¡1 ) g 0 ( vm¡1 ) dvm¡1
( m D 2, 3 . . .) .
¡1
(3.28)
If we can calculate g ( vm | vm¡1 ) (which will be discussed in section 3.3.2), we obtain the ring probability density from equations 3.27 and 3.28. ) D g ( vm | fvgm¡1 ), Under the double Markov property g( vm | fvgm¡1 1 m¡2 density g 0 ( vm ) is calculated by Z g 0 ( v1 ) D g( v1 ) , g 0 ( v2 ) D
h ¡1
g ( v2 , v1 ) dv1 ,
g 0 ( v2 , v1 ) D g( v2 , v1 ) I Z h Z h g 0 ( vm ) D g ( vm | vm¡1 , vm¡2 ) ¡1
¡1
£ g 0 ( vm¡1 , vm¡2 ) dvm¡1 dvm¡2 , Z h m m¡1 ) ( g0 v , v D g ( vm | vm¡1 , vm¡2 ) g 0 ( vm¡1 , vm¡2 ) dvm¡2 , ¡1
( m D 3, 4 . . .).
(3.29)
3.3.2 Firing Probability of Markov-Gaussian Process. We have seen that the transition probability density of the gaussian process is given by a gaussian distribution, equation 3.21. Here, we see how equation 3.21 is simplied under the Markov property. If an n-ple Markov process is assumed, kim D 0 (i D 1, . . . , m ¡n). In this case, the transition probability density is described by l( tm ) , umm ,. . . , um¡n m , and g( tm ) , . . . , g(tm¡n ) . The transition probability density is determined by the present intensity and the membrane properties for the previous n steps. In the Markov-Gaussian process, the transition probability density g ( vm | m¡1 ) is calculated as v g( v | v m
m¡1 )
D p
³ ¡
1 2pc
m
exp ¡
where vm D vm ¡ g( tm ) , and c tion 3.20 as c
m
2 D ( umm ) ,
m
m km¡1 D
m vm¡1 vm ¡ km¡1
2c
m
¢2 ´ ,
(3.30)
m and km¡1 are simply obtained from equa-
um¡1 m . um¡1 m¡1
(3.31)
2776
Ken-ichi Amemori and Shin Ishii
The denition of u kl is given by equation 3.17. The ring probability is determined by equations 3.27 and 3.28 as Z 1 Z h r ( tm ) D dvm g ( vm | vm¡1 ) g 0 ( vm ) dvm¡1 ¡1
h
Z D
³
m vm¡1 h ¡ g( tm ) ¡ km¡1 1 erfc p 2 2c m
h ¡1
´
g 0 ( vm¡1 ) dvm¡1 ,
(3.32)
R 2 where erfc(x) D 1 ¡ p2p 0x e¡y dy and g 0 ( vm¡1 ) is incrementally calculated by using equations 3.28 and 3.30. Although we cannot avoid the numerical integration for vm¡1 , this integration does not cost much computation. In the double Markov-gaussian process, the transition probability density g ( vm | vm¡1 , vm¡2 ) is calculated as g ( vm | vm¡1 , vm¡2 ) D p
1 2p c
m
³
exp ¡
m vm¡1 ¡ k m vm¡2 ) 2 ( vm ¡ k m¡1 m¡2
2c
m
´ ,
(3.33)
m m where vm D vm ¡ g( tm ), and k m¡1 and km¡2 are theoretically obtained from equation 3.20 as m k m¡1 D
and c
m
um¡1 m , um¡1 m¡1
m k m¡2 D
um¡2 m um¡2 m¡1 um¡1 m ¡ , um¡2 m¡2 um¡2 m¡2 um¡1 m¡1
(3.34)
D ( umm ) 2 . The ring probability is given by
Z
r ( tm ) D
h
1
Z
dvm
h
Z
dvm¡1
¡1 m¡1
h
dvm¡2
¡1
£ g ( vm | v , vm¡2 ) g 0 ( vm¡1 , vm¡2 ) ³ ´ Z h Z h m vm¡1 ¡ k m vm¡2 h ¡ g(tm ) ¡ k m¡1 1 m¡2 p erfc D 2c m ¡1 ¡1 2 £ g 0 ( vm¡1 , vm¡2 ) dvm¡1 dvm¡2 ,
(3.35)
where g0 ( vm¡1 , vm¡2 ) is incrementally calculated by equations 3.29 and 3.33. This numerical integration is fairly expensive. The simplication of the integration is discussed in section 4.2 It should be noted that under the Markov assumption, there is no need for the incremental calculation of equation 3.20. 4 Examples of Stochastic Spiking Neuron Models
As a simplied model of the cerebral cortex, we consider a network of excitatory and inhibitory neurons. The numbers of excitatory inputs and
Gaussian Process Approach to Spiking Neurons
2777
inhibitory inputs are set at NE D 6000 and NI D 1500, respectively. We also set the synaptic efcacy values to be the same within each of the excitatory and inhibitory input groups; they are given by wE D 0.21 and wI D 0.63, respectively. These experimental conditions follow Amit and Brunel (1997b). Total weight w, which is dened in section 3.2.2, is w2 D NE w2E C NI w2I . We compare the results obtained by our method with those of the Monte Carlo simulation. As an example for the inhomogeneous Poisson process, we assume that the Poisson intensity l(t) (Hz) changes with time according to ± ± ²² l( t) D l 0 C l sin 2p ( t C t ) / t C sin ( 2p ( t C t ) / t ) 2 , (4.1) where l 0 D 3.0 Hz, l D 1.0 Hz, and t D 0.2s. The numerical calculation is done with the time step D t D 0.5 ms. 4.1 Variation of the Synaptic Time Constant and the Firing Threshold.
This section shows application results of our method to the spike response functions 2.8, 2.7, and 2.6. Since these response functions imply the Markov or double Markov property, the calculation becomes very simple.c m and k im , whose denitions are given by equations 3.31 or 3.34, are theoretically given in very simple forms. Moreover, k im becomes constant. Then the transition probability density depends on only l( t) and g( t) . Case 1: Equation 2.8, u ( s) D qe¡as H ( s) .
In this case, the Markov assumption can be adopted (see section A.2). From equations 3.17 and 3.31, the m correlation coefcient is k m¡1 D e ¡aD t , and the conditional variance is c m D w2 q2 l( tm ) D t. The result obtained by the calculation of equations 3.28, 3.30, and 3.32 and its comparison to that by the Monte Carlo simulation are shown in Figure 5. Our method exhibits very good agreement with the Monte Carlo simulation. Case 2: Equation 2.7, u ( s) D qse¡as H ( s ) .
In this case, the double Markov assumption can be adopted (see section A.2). From equations A.9 and 3.34, m m the correlation coefcients are given by km¡1 D 2e¡aD t and k m¡2 D ¡e¡2aD t . 2 2 ¡2a D 3 m t The conditional variance is c D w q e l(tm ) ( D t ) . Case 3: Equation 2.6, u ( s) D q ( e ¡as ¡ e¡bs ) H ( s ) .
In this case, the double Markov assumption can be adopted (see section A.2). From equations A.9 and 3.34, the correlation coefcients 3.34 are given by m km¡1 D m km¡2 D
e¡2aD t ¡ e ¡2b D t , e¡aD t ¡ e ¡b D t
e¡3aD t ¡ e ¡3b D t ¡ e¡aD t ¡ e ¡b D t
³
e ¡2aD t ¡ e¡2b D t e¡aD t ¡ e¡b D t
´2 .
(4.2)
2778
Ken-ichi Amemori and Shin Ishii
(a) F iring P robability D ensity [Hz]
10 5
[mV]
0
(b) M ean ±±S D of P otential
20 10
[mV]
0
(c) D ensity of Potential
20
0.05
0
0
(d) Input Intensity
[Hz]
10 5 0
0
100
200
tim e [m s]
300
400
500
Figure 5: Membrane properties for the spike response function 2.8: u ( s) D qe ¡as H ( s) . (a) Firing probability density. (b) Ensemble mean § standard deviation (SD) of g ( vt ) when the ring is not considered. The center line denotes the mean, and the upper and the lower lines denote the mean § SD. (c) Probability density of samples that have not red yet, gu ( vt ) . When a sample neuron res, it is removed from the ensemble. (d) Input intensity l(t) given by equation 4.1. In a and b, the solid and dashed lines denote theoretical results and Monte Carlo simulation results obtained from 160,000 trials, respectively. The maximum difference between the theoretical result and that by the Monte Carlo simulation is 0.8 for a and 0.02 for b, these errors are too small to be observed. Then the two kinds of lines are overlapped in each gure.
The conditional variance is c m D w2 q2 ( e¡aD t ¡ e¡b D t ) 2 l( tm ) D t. Calculation of equations 3.29, 3.33, and 3.35 is done for b D 1 / ts D 3.0, 1.0, 0.5, 0.3, 0.2, and 0.14. Results are shown in Figure 6. It should be noted that our method does not introduce any approximation except the gaussian approximation discussed in section 3.1. Therefore, the results by Monte Carlo simulation will exactly agree with the theoretical results as the number of Monte Carlo trials increases, as seen in Figure 7. 4.2 Simplication of the Double Markov to Single Markov Process.
Here, we examine whether the double Markov process can be simplied into a single Markov process. The double Markov process is replaced by the single Markov process whose rst- and second-order statistics are the same with those of the double Markov process. The reduced Markov process is
Gaussian Process Approach to Spiking Neurons
[Hz]
10
2779
(a): threshold = 20
5 0
(b): threshold = 18
10 0 20 10 0 40
(c): threshold = 16
(d): threshold = 14
20 0 50
(e): threshold = 12
0 10
(f): input intensity
5 0 0
100
200
300 Time [ms]
400
500
Figure 6: Variation of the ring probability density due to variation of the synaptic time constants tm D 1 / b. (a–e) Firing probability density when the threshold is 20 (for a), 18 (for b), 16 (for c), 14 (for d), and 12 (for e). In each gure, lines are for b D 1 (function 2.8), 3.0, 1.0, 0.5, 0.3, 0.2, 0.14 and a (function 2.7) from top to bottom. (f) Comparison with Monte Carlo simulation results in the case of h D 12 mV. b ! 1 and b ! a cases are considered. Solid and dotted lines denote theoretical and Monte Carlo results, respectively. (g) Input intensity l( t) given by equation 4.1.
described in section A.3. Figure 7 shows the results when this approximation is applied to the response function, equation 2.6. As seen in the left part of the gure, this simplication works fairly well when the threshold is low and the ring probability density is high. However, the error induced by the simplication may not be ignored when the threshold is high and the ring probability density is low, as can be seen in the right part of the gure. The error becomes large especially when b is small. Note that the approximation is exact when b ! 1, because the case of b ! 1 corresponds to the response function, equation 2.8. Accordingly, the simplication to the single Markov process is not valid in the case of a high threshold value and a large ts D 1 / b value. 4.3 Stochastic Spiking Neurons with Resets. Conventional studies (Plesser & Tanaka, 1997; Burkitt & Clark, 1999; and others) based on the backward resolution of the Volterra equation have put focus on the rst passage time density (FPTD): the event that for the rst time the neuron
2780
Ken-ichi Amemori and Shin Ishii
Figure 7: Difference between the double (solid line) and single (dashed line) Markov models, both of which have the same ensemble mean and variance. Input conditions are the same as those used in Figure 6. Monte Carlo simulation results are overwritten by dotted lines. (1a–h) Firing probability density when the threshold is 12. b increases from top to bottom. As b increases, the error of the simplied model becomes small. (2a–h) Firing probability density when the threshold is 20. The error becomes large as b decreases.
sample reaches the ring threshold. In those studies, when a sample neuron res, it is removed from the ensemble; this is called the absorbing threshold (see Figure 3). When one assumes that neuron samples may re many times, which corresponds to the introduction of the reset after ring, an extension of the FPTD-based analysis can be used only when the input spikes obey a homogeneous Poisson process. The distribution of neuron samples for the rst passage and for the second passage are different when the inputs are inhomogeneous. Since the FPTD-based analysis cannot consider the multiple crossing of the threshold, it is necessary to prepare a lot of distributions whose Poisson intensities are different (Plesser & Geisel, 1999). Due to this difculty, the backward approach is inappropriate for dealing with the reset after ring. On the other hand, forward calculation methods such as our method and the studies based on the Fokker-Planck equation can consider the reset after ring. In the forward approach, the density whose samples have red at time tm¡1 is moved at time tm to the density whose potential is vreset (D 0). Although the transition probability density depends on the inhomo-
Gaussian Process Approach to Spiking Neurons
2781
geneous spike rate, the density of the membrane potential does not depend on it. Under the single Markov assumption, the ring probability r ( tm ) and the density of the membrane potential g 0 ( vm ), which have been given by equations 3.27 and 3.28, are modied into Z 1 r ( tm ) D g0 ( vm ) dvm , h
Z
g 0 ( vm ) D
h ¡1
g ( vm | vm¡1 ) g 0 ( vm¡1 ) dvm¡1 C r ( tm¡1 )d( vm ) ,
(4.3)
where d( ¢) is the Dirac’s delta function. The second term of g0 ( vm ) is the distribution whose sample has red at tm¡1 and has potential vm D 0 at time tm . Here, note that the meaning of g 0 ( vm ) is changed. In sections 3.2.1 and 3.3.1, it indicates the distribution of samples that have not red until time tm¡1 . Here, it indicates the distribution of samples that do not re between the interval [tm¡n¡1 , tm¡1 ] where n is the Markov order. Under the double Markov assumption, the ring probability r ( tm ) and the density g 0 ( vm ) , which have been given by equations 3.27 and 3.29, are modied into Z 1 r ( tm ) D g 0 ( vm ) dvm , h
Z g 0 ( vm ) D
h ¡1
Z
h ¡1
g ( vm | vm¡1 , vm¡2 ) g 0 ( vm¡1 , vm¡2 )
£ dvm¡1 dvm¡2 Z h g 0 ( vm , vm¡1 ) D g ( vm | vm¡1 , vm¡2 ) g 0 ( vm¡1 , vm¡2 ) dvm¡2 ¡1
C r ( tm¡1 ) g ( vm , vm¡1 ) .
(4.4)
The joint density of reset states, g ( vm , vm¡1 ) , is given by equation 3.13. Figure 8 shows the result of these calculations. For the single Markov process, equations 3.30, 3.31, and 4.3 are used. For the double Markov process, equations 3.33, 3.34, and 4.4 are used. The results are compared with those by the Monte Carlo simulations. The response function is given by equations 2.6, 2.7, or 2.8. This gure shows that our method can precisely calculate the membrane potential density and the ring probability density even in the multiple ring situation. 5 Discussion 5.1 Dependency on the Synaptic Time Constant. Figures 5 and 8 show that the ring probability density signicantly depends on the synaptic time
2782
Ken-ichi Amemori and Shin Ishii
[Hz]
20
(a): threshold = 18
10
[Hz]
20
[Hz]
0 40
50
(b): threshold = 15
0
(c): threshold = 12
[Hz]
[mV]
[mV]
0 15 10 5 0 5
(d): means of potential
(e): standard deviations
0 5 0 0
(f): input intensity 100
200
Time [ms]
300
400
500
Figure 8: Variation of the ring probability density due to variation of the synaptic time constant ts ´ 1 / b. (a–c) Firing probability density when the iterative rings and reset are considered. The threshold is 18 (for a), 15 (for b), and 12 (for c). In each gure, solid lines are for b D 1 (function 2.8), 3.0 (function 2.6), and a (function 2.7) from top to bottom. (d) Means of the membrane potential when the ring is not considered. ts is not very important for the mean. (e) Standard deviations (SDs) of the membrane potential when the ring is not considered. SDs decrease as ts increases. (a–e) The results of Monte-Carlo simulations over 160,000 trials are overwritten by dashed lines in all of the subgures. Difference between the theoretical results (solid) and the Monte Carlo results (dashed) is too small to observe in each subgure. The maximum difference is about 2.0 for the ring probability density and 0.02 for the mean and SD, which will decrease as the Monte Carlo trials increase. (f) Input intensity l( t) given by equation 4.1.
constant ts . This dependency is prominent especially when threshold h is large, and hence the ring probability density is low. The change of the ring probability density is mainly due to the change of the potential variance s 2 ( t) . Mean g(t) does not change very much because the integral of the response function is xed, while variance s 2 ( t) decreases as ts increases. Conventionally, it has been considered that the change of the synaptic efcacy wj is important for neuronal information processing, as in the Hebb’s prescription. On the other hand, our theoretical result implies another possibility for the neuronal information processing: the modulation of the ring probability density by changing the synaptic time constant.
Gaussian Process Approach to Spiking Neurons
2783
When the threshold is relatively low in comparison to the intensity of the input spikes, the ring probability density is mainly determined by the input intensity. In this case, the change of the synaptic time constant is not very important. When the threshold is relatively high, on the other hand, the modulation is possible. In the cortical areas at which the ring rate is low and coincidental spikes often occur, neurons might process their information by temporally changing the shapes of PSPs. Physiologically, the decrease of parameter b D 1 / ts may correspond to effect of the dopaminergic (DA) modulation (Durstewiz, Seamans, & Sejnowski, 2000). Vitro experiments nd that DA reduces the excitatory postsynaptic potential amplitude while prolonging its duration (Cepeda, Radisavljevic, Peacock, Levine, & Buchwald, 1992; Kita, Oda, & Murase, 1999) 5.2 Comparison with Other Methods. Here, we compare our method with existing stochastic process methods. Many stochastic process studies are based on the rst-order Fokker-Planck equation. Section 5.2.1 discusses the relation between our method and the existing studies based on the Fokker-Planck equation (Ricciardi & Sato, 1988; Brunel & Hakim, 1999; Brunel, 2000; and many others). In these studies, the stochastic spiking neuron is assumed to employ the response function, equation 2.8. On the other hand, a useful backward resolution method of the Volterra integral equation has been developed by Plesser and Tanaka (1997; see also Burkitt & Clark, 1999). Section 5.2.2 discusses the relation between our approach and the backward approach.
5.2.1 Continuous Description and Diffusion Approximation. The stochastic spiking neuron, whose response function is dened by equation 2.8 ( u ( s ) D qe¡as H ( s) ) , is called the Stein’s model (Stein, 1967). By using the diffusion approximation, the Stein’s model, which includes discontinuous jumps in the spike response, can be expressed by the Ornstein-Uhlenbeck (OU) process. The OU process is primarily a continuous stochastic process that assumes innitesimal jump in the spike response. Section A.4 is devoted to the formulation of the continuous case whose response function is dened by equation 2.8. We nd that our single Markov-gaussian process formulation whose response function is equation 2.8 is equivalent to the diffusion approximation of the Stein’s model—the OU process approach. However, the OU process approach cannot deal with the response function 2.6 or 2.7, which involves a multiple Markov process. In order to consider such higher-order response functions, higher-order time derivatives such as @t2 g ( vt ) will be necessary. In this point, our formulation is a more general framework than the OU process formulation. In section A.4, we also nd that the denition of the ring probability 3.32 is equivalent to that of the Fokker-Planck equation with the boundary condition gu (h ) D 0 for the continuous-time limit (D t ! 0).
2784
Ken-ichi Amemori and Shin Ishii
Our method is more accurate than the methods based on the FokkerPlanck equation, even in the single Markov case. In order to calculate density gu ( vt ) , instead of equation 3.28, other numerical integration methods for the Fokker-Planck equation can be used. Equation 3.32 can also be replaced by equation A.39. According to our simulations, however, the error produced by the numerical calculation of equation A.39 is larger than that of equation 3.32. 2 Under the same experimental condition as that in Figure 5, the maximum error produced by equation 3.32 is about 0.8 Hz, while the maximum error by equation A.39 is about 3 Hz. Although both of the error values decrease as D t becomes small, the error by equation 3.32 is smaller than that by equation A.39 even with a smaller integration time interval, such as D t D 0.05 ms. Therefore, our method is more accurate in obtaining the membrane density and the ring probability than the Fokker-Planck equation method. Moreover, according to our simulation, our method using equations 3.28 and 3.32 exhibits good calculation even if D t is as large as D t D 0.5 ms; our method is fairly insensitive to the integration time interval. 5.2.2 Comparison with the Backward Approach. In order to compare our approach with the backward approach (Plesser & Tanaka, 1997), we assume that the density whose samples re at time tn can be described by d( vn ¡ h )r ( tn ). This means that the membrane potential for every ring is exactly equal to h. This assumption is valid when D t ! 0. The density after ring g f ( vm ), which is dened by equation 3.11, is given by g f ( vm ) D
m Z X
1
Z dvn . . .
n D 0 ¡1
1 ¡1
) . . . g ( vm | fvgm¡2 ) dvm g( vm | fvgm¡1 n n
. . . g ( vn C 1 | vn )d( vn ¡ h )r ( tn )
D
m X
nD 0
g ( vm | vn D h ) y ( tn ) D t,
(5.1)
where y ( t) ´ r ( t) / D t. In D t ! 0, we nd Z g f ( vt ) D
t 0
0
g ( vt | vt D h ) y ( t0 ) dt0 .
(5.2) 0
Because g f ( vt ) D g ( vt ) for vt ¸ h, and g( vt ) and g ( vt | vt D h ) are known in the gaussian process, y ( t) can be obtained by resolving the Volterra integral equation 5.2 from time t to 0 in a backward fashion over the discretized time sequence. (See Burkitt & Clark, 2000, and Press, Teukolsky, Vetterling, 2
In both cases, equation 3.28 is used.
Gaussian Process Approach to Spiking Neurons
2785
& Flannery, 1988, for details. The results obtained by our forward-type calculation method become identical to those by the backward approach when D t ! 0. Both the forward and backward approaches can deal with the spike response functions involving multiple Markov property (e.g., in the cases of function 2.6 or 2.7). However, our method is advantageous when we consider the reset effect after ring, as discussed in section 4.3. Here, the computational complexity is compared. The calculation of y ( t) from equation 5.2 needs to discretize the time interval [0, t]. Let K be the number of the discretization and t D KD tv . The computational complexity of the backward approach is O ( K2 ). According to our numerical calculation, the permissible value for D tv is 0.05 ms. On the other hand, the computational complexity of our forward approach is O ( mM2 ) , where m is the number of discretized time sequence, that is, t D mD t. M is the number of the numerical integration for dvm¡1 in equation 3.28. The permissible value for D t is 0.5 ms, and an appropriate value for M is 1000. Therefore, if we need the calculation for more than 5s, our forward approach is more efcient. In the double Markov case, however, our forward approach needs the complexity of mM3 . The backward approach is more efcient in this case if the reset effects are ignored. 5.3 Plausibility of the Stochastic Input. In section 1, we introduced physiological studies suggesting that the information provided to a neuron is expressed by the input spike rates and the spikes are stochastic events. However, the assumption that the spikes are inhomogeneous Poisson events, that is, they are mutually independent, is stronger than what the physiological studies suggest. Here, the plausibility of the Poisson assumption is discussed. In the cerebral cortex, one of the important characteristics of the spikes is their irregularity. According to the existing studies (Gerstein & Mandelbrot, 1964; Shadlen & Newsome, 1994, 1998; Tsodyks & Sejnowski, 1995), the irregularity is considered to be a result of a balanced state. Even when excitatory inputs are large, inhibitory inputs increase proportionallyto them so that the mean of the membrane potential is suppressed. By assuming the balanced state, the spike irregularity at high ring rate up to 100–200 Hz (Softky & Koch, 1993) can be explained (Shadlen & Newsome, 1994, 1998). From a physiological data analysis based on a stochastic process method, a random spontaneous ring state is plausible and structurally stable (Amit & Brunel, 1997a). A deterministic mechanism such as the complexity of connections in neuronal circuits can also produce the irregularity similarly to the balanced states. For example, a mean-eld approximation theoretically reveals that the asynchronous McCulloch-Pitts model having random and sparse connections and inhibitory reccurents produces Poisson-like spike sequences (van Vreeswijk & Sompolinsky, 1996, 1998). A large-scale computer sim-
2786
Ken-ichi Amemori and Shin Ishii
ulation for an integrate-and-re neuron model shows that random and sparse connections produce Poisson-like spikes even if the inputs to the neuron are deterministic and stational (Kubo, Saito, & Amemori, 2000). These studies suggest that a large-scale network like the cerebral cortex is in a balanced state, and hence the spikes are likely to lose their correlation. The response of neurons in such a random but structurally stable balanced state (Tsodyks & Sejnowski, 1995) is very fast in comparison to the response determined by the membrane integration time constant. Such rapid responses are considered to be physiologically important (Thorpe, Fize, & Marlot, 1996). The Poisson assumption is not valid when a neuron bursts. Burst spikes are correlated with each other even in a large-scale network. Thus, we believe that our Poisson assumption is valid when the neurons tonically re and their ring rate is as large as the spontaneous ring rate. 5.4 Extension of the Stochastic Spiking Neuron Model. Here, we briefly discuss an extension of our approach to the analysis of the network of neurons. The spike intensity input from a presynaptic neuron i to a postsynaptic neuron j can be represented by the ring rate of neuron i, yi ( t ) ´ r ( t) / D t. In the network, yi ( t) may vary with time; the ring rate is inhomogeneous. Moreover, in order to observe temporal behaviors of the network, the assumption that the element neurons may re many times is important. Therefore, we consider that our approach is suitable for analyzing the interesting behaviors of network dynamics, including stability of the balanced state (Amit & Brunel, 1997a), synaptic plasticity (Kempter, Gerstner, & van Hemmen, 1999; Tsodyks and Markram, 1997), syn-re chain (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997; Diesmann, Gewaltig, & Aertsen, 1999), and so on. Two important issues have not been considered enough in this article. First is the reduction of a higher-order Markov process into a single one. As discussed in section 5.2, our approach to a higher-order response function is more expensive than the existing backward approach. A simple reduction scheme discussed in section A.3 is inaccurate especially when the ring probability is low and b D 1 / ts is small. Second is the extension of the model itself. Based on the assumption of the spike response model (Gerstner, 1998), we have considered the neuron model employing the response function 2.6, 2.7, or 2.8. However, our spike response model may not be sufcient for modeling the neurons in the brain. Recent studies have suggested that the consideration of reversal potential for excitatory and inhibitory inputs and the partial reset effect after ring will change the stability of the balanced state (Troyer & Miller, 1997). Our future work will include the introduction of such effects to our approach.
Gaussian Process Approach to Spiking Neurons
2787
6 Conclusion
In this article, we theoretically examined the responses of spiking neurons to spike sequences that obeyed an inhomogeneous Poisson process. When inhomogeneous Poisson inputs are provided to the neuron, the stochastic process of the membrane potential becomes a gaussian process. Assuming a Markov property, we derived a calculation method that could precisely obtain the dynamics of the membrane potential and the ring probability density. With frequently used response functions, a single or double Markov property can be assumed. Since we did not introduce any approximation, the theoretical results exhibited good agreement with the results by Monte Carlo simulations. We also found that the ring probability was signicantly dependent on the synaptic time constant. Appendix A.1 Joint Probability Density. The joint probability density, 3.7, is de-
ned by 1 m g ( fvgm 1 ) ´ hd( v ¡ v ( tm ) ) . . . d( v ¡ v ( t 1 ) ) i ( )* ³ m ´+ Z Z m X X dj1 djm ... exp ijk vk exp ¡ ij kv ( tk ) . D 2p 2p kD 1 kD 1
(A.1)
Let the logarithm of the joint characteristic function, * F (j1 , . . . , jm I t1 , . . . , tm ) ´ exp
³
m X
´+ ij kv ( tk )
,
(A.2)
kD 1
be expanded by the second order of j k. Then we obtain ln F (j1 , . . . , jm I t1 , . . . , tm ) * + m m X m X 1X 3 ij kv ( tk ) ¡ j kjl v ( tk ) v ( tl ) C O (j k ) D ln 1 ¡ 2 kD 1 l D 1 kD 1 D ¡
m X kD 1
ij kg(t k ) ¡
m X m 1X jkjl S kl C O (j k3 ) , 2 kD 1 l D 1
(A.3)
where
S kl ´ hv ( tk ) v ( tl ) i ¡ hv ( tk ) ihv ( tl ) i.
(A.4)
2788
Ken-ichi Amemori and Shin Ishii
When l(t) and tm are not small, the third-order term O (j k3 ) can be omitted due to the central limit theorem. Then the joint probability density is approximated as Z
Z 1 dj1 djm ... ¡1 2p ¡1 2p ( ) m m X m X 1X k £ exp ij k ( v ¡ g( tk ) ) ¡ jkjl S kl 2 D1 D1 kD 1 k l ³ ´ Z 1 Z 1 1 dj1 djm ... exp ijE 0 ( vE ¡ g) D E ¡ jE 0 SjE , 2 ¡1 2p ¡1 2p
g (fvgm 1) ¼
1
(A.5)
where vE D ( vt1 , . . . , vtm ) 0 , gE D (g( t1 ) , . . . , g( tm ) ) 0 , jE D (j1 , . . . , jm ) 0 , and S is an m-by-m autocorrelation matrix of vE . Prime (0 ) denotes a transpose. By dening jE D xE C iyE , we obtain Z g (fvgm 1) D
1 ¡1
³ 1 djE exp ¡ xE 0 S xE ¡ ( vE ¡ g) E 0 yE ( 2p ) m 2 ´ 1 0 0 C y E S yE C ixE ( vE ¡ g E ¡ S yE ) , 2
(A.6)
because S is positive denite. When yE D S ¡1 ( vE ¡ g) E , the integral is done only for the real part xE . Then,
³ ´Z 1 ³ ´ 1 1 0 dxE 0 ¡1 ) ( ( exp ¡ E ¡ g) E ¡ g) exp ¡ E E g (fvgm D v E v E x x S S 1 m 2 2 ¡1 ( 2p ) ³ ´ 1 1 0 ¡1 exp ¡ ( vE ¡ g) (A.7) D p E S ( vE ¡ g) E , (2p ) m | S | 2
where | S | is the determinant of S . A.2 Examples of Multiple Markov Transition Probability Density. In this appendix, the Markov order of the transition probability density for the response functions 2.6, 2.7, and 2.8 is obtained through the calculation of m, ( S m ) ¡1 . From equation 3.22, the correlation coefcients are k im D ( umm ) 2fim m ¡1 m where ( fij ) ´ ( S ) . Case 1: Equation 2.8, u ( s ) D qe ¡as H ( s) .
From the incremental calculation of equation 3.19, the jth column vector of matrix C m is calculated as j Cjm D
jC1
¡ ¢0 1 p 0, . . . , 0, 1, ¡e¡aD t , 0, . . . , 0 . qw l(tj ) D t
(A.8)
Gaussian Process Approach to Spiking Neurons
2789
The component-wise vector notation expresses that the jth and ( j C 1)th components are 1 and ¡e¡aD t , respectively, and the others are 0. From equation A.8, we see that S ¡1 D CC 0 is a tridiagonal matrix. Therefore, k im D 0 for i < m ¡ 1. This means that the gaussian process dened by S becomes a Markov process. Case 2: Equation 2.7, u ( s) D qse¡as H ( s ) .
Since the diagonal elements of U m , uii in equation 3.17 are zeros, S m becomes singular. In order to avoid the singularity, we redene the components of U m as » p w ¢ u ( tj C 1 ¡ ti ) l( ti ) D t, uij D 0,
i·j i > j.
(A.9)
From the incremental calculation of equation 3.19, the jth column vector of the inverse of matrix U m is strictly calculated as
Cjm
j jC1 jC2 ¡ ¢0 1 p 0, . . . , 0, eaD t , ¡2, e ¡aD t , 0, . . . , 0 . D qw l( tj¡1 ) D t
(A.10)
From equation A.10, we see that ( S m ) ¡1 becomes a band diagonal matrix whose bandwidth is ve. Then the transition probability density of this gaussian process is determined by two past components; the process becomes a double Markov process. Case 3: Equation 2.6, u ( s) D q ( e ¡as ¡ e¡bs ) H ( s ) .
Similarly to the previous case, the diagonal elements of Um are zeros, and we use denition A.9. From the incremental calculation of equation 3.19, the jth vector column of the inverse of matrix Um is strictly calculated as C jm D
p
1
qw l( tj¡1 ) D t j
³ £ 0, . . . , 0,
jC1
¡a D t
e ¡aD t
¡b D t
jC2
¡aD t ¡b D t
´0
Ce 1 e e e , ¡ ¡aD t , , 0, . . . , 0 ¡ e ¡b D t ¡ e¡b D t e ¡aD t ¡ e¡b D t e
.
(A.11)
( S m ) ¡1 becomes a band diagonal matrix whose bandwidth is ve, and the stochastic process becomes a double Markov process. A.3 Simplication Method from Multiple to Single Markov Process.
Here, the simplication of the membrane dynamics from a multiple Markov process to a single Markov process is considered. By assuming the single
2790
Ken-ichi Amemori and Shin Ishii
Markov property, the membrane dynamics is given by g 0 ( v1 ) D g ( v1 ) I Z h g 0 ( vm ) D g ( vm | vm¡1 ) g0 ( vm¡1 ) dvm¡1 ¡1
( m D 2, 3 . . .),
(A.12)
where g ( vm | vm¡1 ) D p
1 2p cO m
³
£ exp ¡
2 m ( m¡1 ¡ g(t ( vm ¡ g( tm ) ¡ kOm¡1 v m¡1 ) ) )
´
2cO m
.
(A.13)
m Here, kOm¡1 and cO m are dened as m kO m¡1 D
Sm m¡1 , Sm¡1 m¡1
cOm D
Sm m Sm¡1 m¡1 ¡ ( Sm m¡1 ) 2 , S m¡1 m¡1
(A.14)
where S is the autocorrelation matrix of the membrane potential, whose denition is given by equation A.4. In order to simplify a multiple Markov process to a single Markov process, therefore, the mean g( t) and autocorrelation S should be estimated. Here, an estimation method only for the response function 2.6 is shown, but the cases for equations 2.7 and 2.8 can be similarly considered. For the response function u ( s) D q( e¡as ¡ e ¡b s ) H ( s) , the mean of the membrane potential is incrementally calculated as g( tm ) D e ¡aD t A( tm¡1 ) ¡ e¡b D t B( tm¡1 ) ,
(A.15)
where Z A ( tm ) D w 0 q Z B ( tm ) D w 0 q Here, w0 D
PN
tm
e¡a(tm ¡s) l( s) ds,
0 tm
e¡b ( tm ¡s) l( s ) ds.
(A.16)
0
wj . A( tm ) can be incrementally calculated as
j D1
A ( tm ) D e¡aD t A ( tm¡1 ) C w 0ql( tm ) D t,
(A.17)
and B( tm ) can be similarly calculated. The autocorrelation matrix S is calculated by the decomposition
Sm m¡1
D e
¡aD t
¡ (e
D ( tm¡1 )
¡aD t
C e¡b D t ) E ( tm¡1 ) C e¡b D t G ( t m¡1 ) ,
(A.18)
Gaussian Process Approach to Spiking Neurons
2791
where Z D ( tm ) D w 2 q 2
tm
e ¡2a(tm ¡s) l(s) ds,
0
Z G ( tm ) D w 2 q 2 Z E ( tm ) D w 2 q 2
tm
e ¡2b ( tm ¡s) l( s) ds,
0 tm
e ¡(a C b
) ( tm ¡s)
l( s) ds.
(A.19)
0
P Here, w2 D jND1 wj2 . D( t) , E ( t) , and G( t) can be incrementally calculated in a similar manner to equation A.17. A.4 Continuous Formulation in First-Order Case. Here, we consider the continuous formulation in the rst-order case; the response function u ( s) D qe¡as H ( s) is employed. In this case, our formulation approaches that based on the Ornstein-Uhlenbeck process as the time step D t approaches zero. We conrm the equivalence and compare with the standard FokkerPlanck equation approach. Introducing a higher-order time derivative @tn g ( vt ) , time evolution of the probability density of the higher order case can also be derived.
The time derivative of the expectation of any function f ( vt ) with respect to density g ( vt ) can be dened (Gardiner, 1985) by
Density Dynamics.
Z dvt f ( vt ) g( vt )
@t
´ lim
D t!0
D lim
D t!0
1
Dt 1
Dt
»Z »Z
1
Z dv
tC D t
¡1 1
¡1
µZ
f ( vt C D t ) g ( vt C D t ) ¡
1
¼ t
dv f ( vt ) g( vt )
¡1
dvt g ( vt ) 1
¡1
¶¼ dv
t CD t
f ( vt C D t ) g ( vt C D t | vt ) ¡ f ( vt )
.
(A.20)
In order to clarify the time dependence, we rewrite vt as v0 , vt C D t as v1 , g( vt ) as g ( v 0 , t) , and g ( vt C D t ) as g( v1 , t C D t) . For the calculation of Z
1 ¡1
Z dv
t CD t
f ( vt C D t ) g ( vt C D t | vt ) ´
1 ¡1
dv1 f ( v1 ) g ( v1 , t C D t | v 0 , t)
´ I ( v 0, t ) ,
(A.21)
2792
Ken-ichi Amemori and Shin Ishii
we use a moment expansion, f ( x) D Z
1 ¡1
g( x | z ) f ( x) dx D
1 X 1 ( x ¡ g) n @xn f (g) , n! nD 0
Z 1 1 X 1 n (g) ( x ¡ g) n g ( x | z ) dx, @ f n! x ¡1 nD 0
(A.22)
where 0! D 1 by denition and g is the mean of g ( x | z) with respect to x. If g ( x | z) is a gaussian distribution with mean g and variance s 2 , g ( x | z) is a symmetric function on both sides of g. In this case, the odd terms in equation A.22 vanish, and we obtain Z 1 s2 2 (A.23) g( x | z ) f ( x) dx D f (g) C @ f (g) C O ( s 4 ) . 2 x ¡1 When we use the response function u ( s) D qe ¡as H ( s) , the transition probability density, 3.30, becomes g ( v1 , t C D t | v0 , t )
" # ( v1 ¡ v0 ¡ A( t, v 0 ) D t) 2 1 exp ¡ D p , 2B(t) D t 2p B( t) D t
(A.24)
where A ( t, v 0 ) D @tg(t) ¡ a( v 0 ¡ g( t) ) and B( t ) D w2 q2 l( t) . Then integral A.21 is given by Z 1 I ( v 0, t ) D dv1 g( v1 , t C D t | v 0 , t) f ( v1 ) ¡1
D f ( v0 C A( t, v0 ) D t) C
B( t ) D t 2 2 @v0 f ( v 0 ) C O ( ( D t) ) . 2
(A.25)
By using equation A.25, equation A.20 becomes Z @t dv 0 f ( v0 ) g ( v0 , t) Z D
µ ¶ B( t ) 2 dv 0 g( v 0 , t) A( v 0, t) @v0 f ( v 0 ) C @v0 f ( v0 ) ] , 2 ¡1 1
(A.26)
and by integrating by parts, Z 1 dv 0 f ( v0 ) @t g ( v 0 , t) ¡1
D
µ ¶ B( t ) 2 dv 0 f ( v0 ) ¡@v0 [A( v0 , t) g ( v 0 , t) ] C @v0 g( v 0, t) 2 ¡1 C Constant.
Z
1
(A.27)
Gaussian Process Approach to Spiking Neurons
2793
Finally, the time evolution of the probability density follows: @t g ( v, t) D ¡@v [A( v, t) g ( v, t) ] C
B( t ) 2 @ g ( v, t) . 2 v
(A.28)
Because @tg( t) / e¡at , this term can be ignored when t » 1 / a D tm ms. Then, equation A.28 becomes £
¤
@t g ( vt ) D a@vt ( vt ¡ g( t) ) g ( vt ) C
1 2 2 w q l(t) @v2t g ( vt ) . 2
(A.29)
This equation corresponds to the rst-order Fokker-Planck equation for the OU process. Corresponding stochastic differential equation can be written as p dvt D ¡a( vt ¡ g(t) ) dt C wq l(t) dW ( t) ,
(A.30)
where W ( t) is the Wiener process. The straightforward stochastic differential equation of the Stein’s model is dvt D ¡a( vt ¡ g(t) ) dt C wq dn( t) ,
(A.31)
where n ( t) is the Poisson process whose intensity is l(t) . The replacement from equation A.31 into A.30 is called the diffusion approximation. If the gaussian approximation discussed in section 3.1 is valid, equation A.30 has the same statistics as equation A.31, even though they produce different sample paths. Firing Probability.
Standard ring probability estimation from the Fokker-
Planck equation, £
¤
@t gu ( vt ) D a@vt ( vt ¡ g(t) ) gu ( vt ) C
1 2 2 w q l( t) @v2t gu ( vt ) , 2
(A.32)
is obtained by the absorbing boundary condition gu (h ) D 0 (Abbott & van Vreeswijk, 1993; Brunel & Hakim, 1999). The Fokker-Planck equation can be interpreted by the continuity equation (Brunel, 2000), @t gu ( v, t) D ¡@v S ( v, t) ,
(A.33)
where S ( v, t ) is the probability current through v at time t and forms S ( v, t) D ¡
B( t ) @v gu ( v, t) C A ( v, t) gu ( v, t) . 2
(A.34)
2794
Ken-ichi Amemori and Shin Ishii
Because the probability current S (h , t) can be considered as the ring probability density y ( t) ´ r ( t) / D t, the ring probability can be calculated from the boundary condition gu (h ) D 0 as B( t ) D t r ( t) D ¡ @v gu ( v, t) . 2 v Dh
(A.35)
Here, it is shown that the ring probability r ( t) obtained by our method is equivalent to equation A.35 when D t ! 0. The ring probability we dene is written as Z 1 Z h r ( tI D t) ´ (A.36) dv1 g ( v1 , t C D t | v0 , t) gu ( v0 ) dv 0. ¡1
h
Using the moment expansion with respect to v 0 , Z
I ( v1 , D t ) ´
h ¡1
aD t
D e
g ( v1 | v 0I D t) gu ( v 0 ) dv 0 µ gu (g 0 ) C
¶ ( s0 ) 2 2 4 @v gu (g 0 ) C O ( ( s0 ) ) , 2
(A.37)
where ( s0 ) 2 D e2aD t B( t) D t, g 0 D v1 C aD t ( v1 ¡ g) . In D t ! 0, we nd
I ( v1 , D t ) D ( 1 C aD t ) gu ( v1 ) C aD t ( v1 ¡ g( t) ) @v1 gu ( v1 ) C
B( t ) D t 2 @v1 gu ( v1 ) C O ( ( D t ) 2 ) . 2
(A.38)
Considering gu ( v1 ) D 0 for v1 ¸ h, the direct integral with respect to v1 becomes Z 1 B( t ) D t [@v1 gu ( v1 ) ]v1 Dh , r ( t) D I ( v1 , D t) dv1 D ¡ (A.39) 2 h which is equivalent to equation A.35 based on the Fokker-Planck equation. Acknowledgments
We thank the referees and the editor for their insightful comments in improving the quality of this article. We appreciate helpful comments and suggestions from Masayoshi Kubo, Yoichiro Matsuno, and Yuichi Sakumura. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Physical Review E, 48, 1483–1490.
Gaussian Process Approach to Spiking Neurons
2795
Amemori, K., & Ishii, S. (2000). Ensemble average and variance of a stochastic spiking neuron model. IEICE Transaction Fundamentals, E83-A, 575–578. Amit, D., & Brunel, N. (1997a). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network: Computation in Neural Systems, 8, 373–404. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Computation, 8, 1185–1202. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of Computational Neuroscience, 8, 183– 208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Computation, 11, 1621–1671. Burkitt, A. N., & Clark, G. M. (1999). Analysis of the integrate-and-re neurons: Synchronization of synaptic input and spike output. Neural Computation, 11, 871–901. Burkitt, A. N., & Clark, G. M. (2000). Calculation of interspike intervals for integrate-and-re neurons with Poisson distribution of synaptic inputs. Neural Computation, 12, 1789–1820. Cepeda, C., Radisavljevic, Z., Peacock, W., Levine, M. S., & Buchwald, N. A. (1992). Differential modulation by dopamine of responses evoked by excitatory amino acids in human cortex inputs. Synapses, 11, 330–341. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–532. Durstewiz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Dopamine-mediated stabilization of delay-period activity in a network of prefrontal cortex. Journal of Neurophysiology, 83, 1733–1750. Gardiner, C. W. (1985). Handbook of stochastic methods for physics, chemistry and the natural sciences. Berlin: Springer-Verlag. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysics Journal, 4, 41–68. Gerstner, W. (1998). Spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 3–49). Cambridge, MA: MIT Press. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gomi, H., Shidara, M. Takemura, A., Inoue, Y., Kawano, K., & Kawato, M. (1998). Temporal ring patterns of Purkinje cells in the cerebellar ventral paraocculus during ocular following responses in monkeys I: Simple spikes. Journal of Neurophysiology, 80, 818–831. Hida, T. (1980). Brownian motion. Berlin: Springer-Verlag. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Physical Review E, 59, 4498–4514.
2796
Ken-ichi Amemori and Shin Ishii
Kistler, W., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of HodgkinHuxley equations to a single-variable threshold model. Neural Computation, 9, 1015–1045. Kita, H., Oda, K., & Murase, K. (1999). Effects of dopamine agonists and antagonists on optical responses evoked in rat frontal cortex slices after stimulation of the subcortical white matter. Experimental Brain Research, 125, 383–388. Kubo, M., Saito, K., & Amemori, K. (2000). High variance of ISIs in balanced dynamics of cortical network models (unpublished report). Kyoto: Kyoto University. Papoulis, A. (1984). Probability, random variables, and stochastic processes. New York: McGraw-Hill. Plesser, H. E., & Geisel, T. (1999). Markov analysis of stochastic resonance in a periodically driven integrate-and-re neuron. Physical Review E, 59, 7008– 7017. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Physics Letters A, 225, 228–234. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1988).Numerical Recipes in C (2nd ed.). Cambridge: Cambridge University Press. Ricciardi, L. M., & Sato, S. (1988). First-passage-time density and moments of the Orstein-Uhlenbeck process. Journal of Applied Probability, 25, 43–57. Rice, S. O. (1944). Mathematical analysis of random noise. Bell System Technical Journal, 23, 282–332. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18, 3870–3896. Shidara, M., Kawano, K., Gomi, H., & Kawato, M. (1993). Inverse-dynamics encoding of eye movement by Purkinje cells in the cerebellum. Nature, 365, 50–52. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Stein, R. B. (1967). Some models of neuronal variability. Biophysics Journal, 7, 37–68. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971– 983. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proceedings of the National Academy of Sciences U.S.A., 94, 719–723. Tsodyks, M., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network: Computation in Neural Systems, 6, 111–124.
Gaussian Process Approach to Spiking Neurons
2797
Tuckwell, H. C. (1988). Introduction to theoretical neurobiology:Volume 2, Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Tuckwell, H. C., & Cope, D. K. (1980). Accuracy of neuronal interspike times calculated from a diffusion approximation. Journal of Theoretical Biology, 83, 377–387. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry (2nd ed.). Amsterdam: North-Holland. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371.
Received April 28, 2000; accepted February 13, 2001.
LETTER
Communicated by Ad Aertsen
Dual Information Representation with Stable Firing Rates and Chaotic Spatiotemporal Spike Patterns in a Neural Network Model Osamu Araki School of Knowledge Science, Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa 923-1292, Japan Kazuyuki Aihara Department of Complexity Science and Engineering, University of Tokyo, Bunkyoku, Tokyo 113-8656, Japan, and CREST, Japan Science and Technology Corporation, Kawaguchi, Saitama 332-0012, Japan Although various means of information representation in the cortex have been considered, the fundamental mechanism for such representation is not well understood. The relation between neural network dynamics and properties of information representation needs to be examined. We examined spatial pattern properties of mean ring rates and spatiotemporal spikes in an interconnected spiking neural network model. We found that whereas the spatiotemporal spike patterns are chaotic and unstable, the spatial patterns of mean ring rates (SPMFR) are steady and stable, reecting the internal structure of synaptic weights. Interestingly, the chaotic instability contributes to fast stabilization of the SPM FR. Findings suggest that there are two types of network dynamics behind neuronal spiking: internally-driven dynamics and externally driven dynamics. When the internally driven dynamics dominate, spikes are relatively more chaotic and independent of external inputs; the SPMFR are steady and stable. When the externally driven dynamics dominate, the spiking patterns are relatively more dependent on the spatiotemporal structure of external inputs. These emergent properties of information representation imply that the brain may adopt a dual coding system. Recent experimental data suggest that internally driven and externally driven dynamics coexist and work together in the cortex. 1 Introduction
Although various possible means of information representation in the cortex have been considered, the fundamental generation mechanism for such representation is not well understood. Various kinds of temporal coding such as synchrony, oscillation, and correlation have been suggested based on the relation between observed behaviors and neuronal spiking patterns c 2001 Massachusetts Institute of Technology Neural Computation 13, 2799– 2822 (2001) °
2800
Osamu Araki and Kazuyuki Aihara
(Ahissar, Ahissar, Bergman, & Vaadia, 1992; Arieli, Sterkin, Grinvald, & Aertsen, 1996; Eckhorn et al., 1988; Gray & Di-Prisco, 1997; Gray & Singer, 1989; Matsumura, Chen, Sawaguchi, Kabota, & Fetz, 1996; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997; Sakurai, 1996; Vaadia et al., 1995). How does each of these representation systems encode and decode information? What are the relations among the different kinds of coding? It is difcult to answer these questions because little is known about the fundamental mechanism that, if it were understood, would undoubtedly clarify the dynamical function of information representation in the brain. To clarify the fundamental mechanism, the relation between the dynamics of neural networks and the properties of information representation generated needs examining. This is because each coding system can be understood to emerge from the dynamics of neural networks, which include input patterns from the outside, interactions among neurons, and properties peculiar to the network. Some experimental data show, for example, that olfactory recognition corresponds to dynamic states in the olfactory system (Freeman, 1987). Further, there are theoretical studies in realistic neural network models about the relation between neural network dynamics and emergent information representation properties. Abeles (1991) and Abeles, Bergman, Margalit, and Vaadia (1993), for example, showed that synchronous ring of synre chains emerges and remains stable in a layered convergent-divergent neural network. Aertsen and Gerstein (1991) and Aertsen, Erb, & Palm (1994) showed that cell assemblies (Hebb, 1949) are generated through functional connectivity between neurons in a spiking neural network model. The concept of cell assembly has been discussed recently in the context of dynamic information representation (Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996; Sakurai, 1996; Watanabe, Aihara, & Kondo, 1998). The purpose of this study was to clarify information representation properties of spatial patterns of mean ring rates (SPMFR) and spatiotemporal spike patterns in an interconnected spiking neural network model. Since this concerns the fundamental mechanism by which spatial patterns of ring rates emerge from spatiotemporal spike patterns, we also examined the information structure carried by each spatial pattern of ring rates and the mechanism of generating ring rates from spatiotemporal spike patterns. We examined spatiotemporal spiking patterns and their spatial patterns of ring rates in an interconnected neural network model composed of spiking neurons. When Poisson process asynchronous patterns or constant inputs were introduced into our model, spiking patterns of each neuron were aperiodic and chaotic. Steady and stable SPMFR were also observed, however. It is remarkable that the spatial patterns had a high correlation with an internal structure pattern that is the pattern of the sum of effective synaptic weights for each neuron. When the frequency of external Poisson process inputs was low, however, neither chaotic spatiotemporal spike patterns nor stable spatial patterns of ring rates appeared.
Dual Information Representation with Stable Firing Rates
2801
This suggests there are two types of network dynamics behind neuronal spiking, which we term internally driven dynamics and externally driven dynamics. When the decay time constant of the membrane potential t increases, the decay time constant of the refractory effect tref decreases, or as the frequency of external inputs increases, the internal activity of each neuron easily increases. Since the internal activity thus reaches its threshold more quickly, the neurons re more frequently, and, due to mutual interactions, the output pattern of spikes becomes increasingly independent of external inputs. We consider the internally driven dynamics to be dominant in this scenario. When internally driven dynamics are dominant, the SPMFR are steady and stable despite chaotic spatiotemporal spike patterns. In contrast, when t decreases, tref increases, or as the frequency of external inputs decreases, the spiking patterns become more dependent on external input patterns. The effects of mutual interactions between neurons are not as strong when the internally driven dynamics are dominant. We consider the externally driven dynamics to be dominant in this scenario. These emergent properties of information representation imply that the neural network may adopt a dual coding system. Recent experimental data suggest that internally driven and externally driven dynamics may coexist and work together in the cortex. 2 A Spiking Neural Network Model
We investigated a simple neural network model composed of spiking neurons to analyze nonlinear dynamics of spatiotemporal spiking patterns and mean ring rates. This model is based on the chaotic neural network model (Aihara, Takabe, & Toyoda, 1990), but the output function is the Heaviside function. As such, it is an application and extension of the Nagumo-Sato neuron model (Nagumo & Sato, 1972; Aihara et al., 1990). Each neuron receives input spikes from presynaptic neurons and sends spikes to all of its postsynaptic neurons when it res. After it res, the neuron becomes more refractory to further ring for a time. Assuming that xi represents internal activity of each neuron and yi represents an output value, the dynamics of this model are represented by the following equations:
xi ( t C 1) D
D ij N t¡ X X jD0 rD0
C
± r² exp ¡ sj wij yj ( t ¡ r ¡ D ij ) t
± r² exp ¡ Ai ( t ¡ r) t rD 0
t X
¡a
t X rD0
³
exp ¡
´
r tref
yi ( t ¡ r) ¡ h,
(2.1)
2802
Osamu Araki and Kazuyuki Aihara
yi ( t) D
» 1 0
if xi ( t ) > 0 otherwise,
(2.2)
where N is the total number of the cells, t is the decay time constant of the membrane potential, sj is the discrimination parameter as to whether #j cell is excitatory or inhibitory (sj D 1.0 or ¡3.0, respectively), wij is the synaptic weight from #j cell to #i cell, D ij is the delay time of spike transmission from #j cell to #i cell, Ai ( t) is the external input to #i cell, a is the scaling parameter of the refractory effect, tref is the decay time constant of the refractory effect, and h is the resting threshold for ring. For simplicity, time is assumed to be discrete. Each cell has an alternative excitatory or inhibitory property, which determines the value of sj . In the simulations, about 80% of all cells are excitatory, and the rest (20%) are inhibitory, as in the cortex (Abeles, 1991; Douglas & Martin, 1998). The network features of our model are as follows. The probability of forming more than one synapse between two pyramidal cells in the cortex is generally estimated to be between 0.1 and 0.3 (Liley & Wright, 1994; Braitenberg & Schuz, ¨ 1991). Thus, in our model, the neurons are assumed to be partially, not fully, connected, as in the cortex. The rate of connections is about 20% in the computer simulations, that is, each neuron receives synaptic inputs from about 20% of all neurons except itself (wii D 0). Transmission delays are randomly selected from limited, discrete values, from 1 6 wji , to 6 here. The synaptic weights are generally asymmetrical, that is, wij D and of random values in a uniformly distributed range in [0.8, 1) . For simplicity, synaptic plasticity is not considered here; thus, synaptic weights do not change. Our model does not assume any structures in synaptic weights since we are primarily concerned with emergent properties without special considerations such as an autoassociative matrix in synaptic weights. 3 Computer Simulations
Information representation properties of SPMFR and spatiotemporal spike patterns in the model are rendered by computer simulation. 3.1 Aperiodic Spatiotemporal Firing. Even when input to the neural network model is constant, the spatiotemporal structure of generated spikes appears aperiodic. Parameter values are as follows: N D 225, a D 50.0, t D 5.0, tref D 3.0, and h D 4.0. These values are used unless otherwise indicated. Concerning the value of t , an order of 10 ms has been reported as the membrane time constant in cortical neurons (Konig, ¨ Engel, & Singer, 1996). Thus, t D 5 assumes that one unit of time in this simulation corresponds to a few milliseconds. We studied the effect of a small perturbation to measure the instability of these dynamics. Assuming a state-space of 1800 dimension, namely, the product of 225 neurons and 8 states, we calculated the rate of expansion
Dual Information Representation with Stable Firing Rates
2803
after a small perturbation was given. The 8 states consist of the refractory effect (the third term in equation 2.1), the internal activity (the other terms in equation 2.1), and the delay queues of 6 units of time. When a perturbation d0 ( 0) was introduced to the initial state in the state-space, we represented the difference vector between the perturbed orbit and the unperturbed standard orbit after time t0 as d0 ( t0 ) . We dened the next perturbation d1 ( 0) as
d1 ( 0) D | d0 ( 0) | ¢
d0 (t0 ) . | d0 (t0 ) |
The rate of expansion was calculated as lD
n¡1 | di (t0 ) | 1 X log , | di ( 0) | nt0 iD 0
where n is the number of iterations. The indicator corresponds to the largest Lyapunov exponent in dynamical systems (Wolf, Swift, Swinney, & Vastano, 1985). We assume t0 D 1, n D 2000 (the rst 1000 of 3000 iterations are omitted as a transient period), and | d 0 ( 0) | D 1.0. Two types of patterns were used as external inputs in our model: Poisson process inputs and constant inputs. If interspike intervals of the input spikes follow exponential distribution of the probability density in continuous time, such temporal spikes follow the Poisson process. The term Poisson process inputs indicates a temporal series of input signals in which the continuous-time interspike intervals of a Poisson process are approximated to the smallest integers not less than the corresponding values. Poisson process inputs are considered because most cortical neurons re with high variability (Softky & Koch, 1993). To trigger spikes, the value of Ai is set at above the resting threshold when the inputs follow Poisson process patterns. We refer to the interval between external input spikes as simply the interspike interval (ISI). For simplicity, the mean interval time and the amplitude of each signal are uniform for all neurons in each simulation. The term constant inputs indicates that Ai ( t) is constant for all i and t. Since there is no spatiotemporal structure in the constant inputs, recognition of the effects of spatiotemporal structures in Poisson process inputs is anticipated when the results from these two types of inputs are compared. Figure 1 shows mean values of l for different pairs of networks and input patterns as t or tref changes. We assume that when t is changed, tref is xed at 3.0, and when tref is changed, t is xed at 5.0. In the networks, the sets of synaptic weights and transmission delays differ because they are chosen randomly in independent procedures. For Poisson process inputs, the input patterns are also chosen independently. Mean values are shown, and the error bars indicate standard deviation (SD) (n D 100). The positive values of l indicate the orbital instability of this network model. As shown in Figure 1, the instability increases when t increases, but it decreases toward zero when tref increases, suggesting that each neuron’s
2804 0.2 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 -0.2
Osamu Araki and Kazuyuki Aihara
3 3.5 4 4.5 5 5.5 6 6.5 7
0.25 0.2 0.15 0.1 0.05 0 -0.05 -0.1
1 1.5 2 2.5 3 3.5 4 4.5 5
Figure 1: The mean values of l (n D 100) for two kinds of input patterns: Poisson process ( ; average ISI D 50, Ai D 5) and constant (indicated by ; Ai D 1.0 for all i).
integrate-and-re effect contributes toward instability, and its refractoriness contributes toward stability. Not every perturbation expands in this network dynamic. If a given perturbation is too small to affect the spatiotemporal structure of spikes, l becomes less than zero because the effects of the perturbation, as well as internal activities, decay. We cannot conclude, then, that the unstable behavior is due strictly to sensitive dependence on initial conditions, which is peculiar to deterministic chaos. In this simulation, we assumed that the perturbation is large enough to affect internal activities in the neurons, which leads to changes in the spatiotemporal spike patterns (| di ( 0) | D 1). The random uctuation in somatic membrane potential is commonly known as synaptic noise in actual nervous systems. We expect that this perturbation amplitude is not too large in comparison to the synaptic noise in the brain. Therefore, we consider the property of instability triggered by such perturbation to be plausible as a neural network model. The Nagumo-Sato neuron model (Nagumo & Sato, 1972) almost always generates a periodic attractor to a constant input. A network consisting of a number of these kinds of neurons, however, may show a very long period of transient chaos that increases hyperexponentially with the number of cells (Crutcheld & Kaneko, 1988). Thus, the chaotic behaviors in our model are fundamentally transient phenomena. 3.2 Stable Spatial Pattern of Mean Firing Rates. Despite chaotic spatiotemporal spike patterns, stable mean ring rates were observed in our network model. When different Poisson process patterns triggered by different random seeds are input to a neural network with the same parameter values, the SPMFR are similar (see Figure 2, right). To show this similarity quantitatively, correlation coefcients between SPMFR were calculated when different patterns were input into the same networks and when the same patterns were input into different networks (see Figure 3).
Neuron Index
Dual Information Representation with Stable Firing Rates
200 150 100 50 0
Neuron Index
2805
0
50
100
150
200
0 0.1 0.2
0
50
100
150
200
0 0.1 0.2
200 150 100 50 0
Figure 2: Two examples of spatiotemporal ring patterns and spatial patterns of mean ring rates, in response to two kinds of Poisson process input patterns (average ISI D 20 and Ai D 5.0 for both inputs). The mean ring rates are calculated by fnumber of spikesg / fmean period (D 200) g. The spatial patterns of mean ring rates look similar (the correlation coefcient is 0.975).
When a neural network structure, that is, synaptic weights and transmission delays, is xed, the mean correlation coefcients of SPMFR are approximately 0.9 for Poisson process inputs and greater that 0.98 for constant inputs. Each correlation coefcient is calculated between one standard SPMFR and the other SPMFR at a different average ISI of Poisson process inputs or the amplitude Ai of a constant input. In the case of Poisson process inputs (Ai D 5), the standard point is at a mean ISI of 50, and mean values and deviations are calculated (see Figure 3A-a). In the case of constant inputs, correlation coefcients between a spatial pattern at Ai D 2.5 and other values (Ai D 1, 2, . . . , 7) for a xed network are calculated repeatedly (see Figure 3A-b). The period during which the ring rates are calculated is T D 300. This value is used for the mean period of SPMFR unless otherwise indicated. As seen in Figure 3A-a, the variance marked by the error bars increases gradually as the average ISI increases. This may be because the entire ringrate level is lowered and the structure of the network may not be inuential. Spatial averages of mean ring rates (MFR) are shown at the bottom of Figure 3A and indicate that the spiking activity increases when input frequency or input amplitude increases. The mean period of MFR is T D 300. The same
Osamu Araki and Kazuyuki Aihara
Correlation Coefficient
Correlation Coefficient
2806
1 0.9 0.8 0.7 40 60 80 Av. of ISI
100
0.05 0
Av. of MFR
0.15 0.1
20
40 60 80 Av. of ISI
100
0.2
Correlation Coefficient
Correlation Coefficient
Av. of MFR
20
0.1 0 -0.1 -0.2
20
40 60 80 Av. of ISI
100
1 0.99 0.98 0.97 0.96 0.95
0.4 0.3 0.2 0.1 0
1
2
3
4 Ai
5
6
7
1
2
3
4 Ai
5
6
7
1
2
3
4 Ai
5
6
7
0.2 0.1 0 -0.1 -0.2
Figure 3: Correlation coefcients of spatial patterns of mean ring rates (SPMFR). (A) Correlations for different input patterns. (B) Correlations in different networks with the same input pattern. The arithmetic mean of the correlation coefcients (n D 100) is shown, and SD is indicated by the error bars at each mean ISI value or Ai value.
value is used in the computer simulations below. For Poisson process inputs, while mean activity decreases 42.7% from the average ISI of 10 to 100, the mean correlation coefcients decrease only 8.2%. For constant inputs, the mean values increase 2.6 times, and the correlation coefcients change only slightly. In other words, the correlation is nearly maintained even though ring rates are greatly changed. This suggests that information is encoded in the spatial pattern of the ring rates rather than in each separate ring rate. To summarize, in both input patterns, correlation coefcients are close to 1 when the networks are the same even if frequency, magnitude, or Poisson process patterns of inputs are different. This suggests that the SPMFR are similar, even though spatiotemporal external information differs.
Dual Information Representation with Stable Firing Rates
2807
Figure 4: Correlation coefcients between the spatial pattern of mean ring rates and the Wi pattern. The arithmetic mean of the correlation coefcients (n D 100) is shown, and SD is indicated by the error bars.
When networks are different and input patterns are the same, however, the correlation coefcients are close to zero. At each average ISI or input amplitude of Ai value, the correlation coefcients of SPMFR with the same input to different networks are calculated repeatedly (n D 100). Figure 3B-a shows that correlation coefcients for Poisson process inputs are also close to zero except when the average ISI is large enough; higher correlation coefcients around an ISI of 100 are due to the relatively greater effect of the same input pattern at the lower input frequency. The correlation coefcients for constant inputs are close to zero for different amplitudes of Ai (see Figure 3B-b). Implications are that the spatial pattern of average ring rates depends on network structure such as synaptic weights and transmission delays. If the similarity of SPMFR depends on the network, we can expect that SPMFR should reect network structures. We show below that the SPMFR mainly reects the spatial pattern of the sum of synaptic weights, Wi (i D 1, . . . , N). Wi is dened as follows: X (3.1) Wi D sj wij , j
where sj and wij are the same variables as in equation 2.1. One hundred pairs of different networks and input patterns were randomly chosen, and correlation coefcients between W i (i D 1, . . . , N) and SPMFR were calculated (see Figure 4). The correlations are as high as those in Figure 3A, and they increase as the input frequency (see Figure 4a) or the amplitude Ai (see Figure 4b) increases. When Ai increases, the value of the correlation coefcient is gradually saturated as shown in Figure 4b. Thus, we can conclude that the correlations between SPMFR are mainly due to the similarity between SPMFR and the Wi pattern. The fact that cor-
2808
Osamu Araki and Kazuyuki Aihara
relations with the Wi pattern are lower than those between SPMFR suggests the existence of other factors that contribute to the similarity in SPMFR. Since similar tendencies are observed in both the correlations among SPMFR and that between SPMFR and the Wi pattern, we show below only the correlations with the Wi pattern. 3.3 Effects of Integrate-and-Fire Mechanism. In the simulations, we observed steady and stable SPMFR, even though the spatiotemporal spike patterns were chaotic. How are such steady and stable SPMFR generated? At the neuron level, each neuron can work as both an integrator and a coincidence detector (Abeles, 1982) depending on the values of t in the leaky-integrator mechanism. Here, to approach the dynamics of activity in each neuron, we examine the variation in SPMFR when the value of the membrane constant t is changed. Figure 5A shows the correlation coefcients between the SPMFR and the W i pattern driven by Poisson process inputs (mean ISI D 50, Ai D 5) or a constant input (Ai D 1) for various values of t. The arithmetic mean of the correlation coefcients (n D 100) is shown, and SD is indicated by the error bars. Three error bars for each t indicate different mean periods (T D 50 ( ), T D 100 ( C ), T D 300 ( )), during which the ring rates are calculated. In the case of Poisson process inputs, when t decreases, the correlation decreases too. When t is small, the correlation remains low irrespective of mean periods. With the constant input, no ring occurs when t is too small (t D 1, 2, and 3, for example). When t increases beyond a threshold, mean ring rates suddenly become high. This is because the constant input is common to all neurons, and every neuron tends to re if a single neuron can be red. To clarify whether the lower correlation with the W i pattern is due to either lower integrate-and-re effects or lower ring frequencies, we examine the correlation as the input frequency increases, xing the value of t at 1. As shown in Figure 5B-a, the correlation with the Wi pattern decreases (or increases) as the input frequency decreases (or increases) in spite of a small t. This implies that the lower correlation in Figure 5A is due mainly to the lower ring frequencies. Although higher frequencies (shorter ISI) produce higher correlations, the frequency does not remarkably affect the convergence speed with changing of the mean period. In the case of constant inputs, the increase of Ai hardly hastens the convergence speed (see Figure 5B-b). Figure 5A also shows that when t is greater, the spatial pattern of ring rates converges more quickly to the Wi pattern (see the variation property of the correlation coefcients with change in mean period), and when t is smaller, it converges more slowly. 3.4 Effects of Refractoriness. Refractoriness gives rise to complex behaviors in our model, as the conventional chaotic neural network models
Dual Information Representation with Stable Firing Rates
2809
Figure 5: Correlation coefcients between each spatial pattern of ring rates and each Wi pattern in different (pairs of) networks (and input patterns in the case of Poisson process input). (A) Correlation coefcients as the value of t changes. (B) Correlation coefcients as the input changes when t D 1. The bottom gures indicate the spatial average of mean ring rates (MFR). The points and bars are shifted a little horizontally to avoid being overwritten.
show (Aihara et al., 1990; Adachi & Aihara, 1997). To clarify the relation between refractoriness and stability of SPMFR, we examined the effects of changing the value of parameter tref . The value of this parameter indicates the strength of the refractory effect on each neuron and also affects the stability of the dynamics, as shown in Figure 1 (right). The correlation coefcients between SPMFR and the Wi patterns driven by Poisson process inputs (mean ISI D 50, Ai D 5) or a constant value input (Ai D 1) for various values of tref are shown in Figure 6A. Three error
2810
Osamu Araki and Kazuyuki Aihara
Figure 6: Correlation coefcients between each spatial pattern of ring rates and each Wi pattern. (A) Correlation coefcients for different values of tref . (B) Correlation coefcients for different values of (a) the input frequency or (b) the amplitude when tref is xed: (a) tref D 5; (b) tref D 7, respectively. The bottom gures indicate the spatial average of mean ring rates (MFR). The points and bars are shifted a little horizontally to avoid being overwritten.
bars for each tref indicate different mean periods (T D 50 ( ), T D 100 ( C ), T D 300 ( )) during which the ring rates are calculated. The difference of correlations between T D 50 and T D 300 increases monotonically with the increase in tref , which means that the speed of convergence lessens as tref increases. We also examined the correlation as the input frequency increases, xing the value of tref (see Figure 6B). Although higher frequencies yield higher correlations, the frequency does not remarkably affect the convergence speed.
Dual Information Representation with Stable Firing Rates
2811
Summarizing, the speed of convergence increases when t increases or tref decreases; the dynamics become more unstable when t increases or tref decreases, as shown in Figure 1. We can conclude that the stability affects the convergence, that is, the fast convergence needs the instability of the network dynamics. The chaotically unstable behavior contributes to quick representation of internal information in a spatial pattern of ring rates. 3.5 In the Case of Low Input Frequency. When the frequency of external inputs is low, neither chaotic spatiotemporal spike patterns nor stable spatial patterns of ring rates appear. Here, correlation coefcients between SPMFR and the Wi pattern are examined when the frequency of external Poisson process inputs is low. When the frequency decreases, the similarity between SPMFR and the Wi pattern decreases and the variance becomes greater. The correlation coefcients between SPMFR and the Wi pattern for 100 different pairs of networks and the Poisson process input patterns are shown in Figure 7A. The values of the correlations continue from those shown in Figure 4a when the average ISI increases from 100. An output pattern by this network in response to a lower-frequency Poisson process input seems to be a corresponding transformed spatiotemporal pattern in the sense described below. For example, when low-frequency Poisson process input patterns with lengths of 100 are repeatedly applied, the output spikes in the same period also tend to repeat. The autocorrelations of the output spikes and raster plots of externally input spikes ( C ) and output spikes ( ) for two cases (mean ISI D 220 and 120) are shown in Figure 7B. This autocorrelation is calculated as follows:
1. After xing the spatiotemporal template pattern within a time window from t D 200 to 300, slide it along time forward and backward for D t D § 200, respectively. 2. Count the number if a spike coincides with another spike at each time point. 3. Normalize them by the number of spikes within the time window. It should be noted that the periodicity is not only due to spikes generated directly by the periodic external inputs but also to other responding spikes (see Figure 7B, bottom). In the case shown in Figure 7B-a, the ratio of spikes triggered directly by the external inputs is 88.6% of all the output spikes. In the case shown in Figure 7B-b, this ratio is 23.1%. These repetitive output patterns suggest that an input spatiotemporal pattern is transferred into a corresponding spatiotemporal pattern with some preservation of the input structure. The corresponding spatiotemporal pattern does not depend much on the past; we may be able to call this property context free. When the input frequency increases, however, the mapping function does not work well (see
2812
Osamu Araki and Kazuyuki Aihara
Figure 7: Effects of low average frequencies of Poisson process input patterns. (A) Correlation coefcients between the spatial pattern of mean ring rates and the Wi pattern. (B) Autocorrelations of ring spikes and raster plots of external inputs ( C ) and output spikes ( ). The method of calculation is described in the text.
Figure 7B-b). This is due to the internal dynamics of the network activated by a higher-frequency input. These ndings suggest that context-dependent behaviors and context-free behaviors can be mixed among spatiotemporal spiking patterns in response to higher-frequency inputs. 4 Discussion 4.1 Internal Mechanism. Here, we discuss the fundamental mechanism suggested by our simulation results. Let us rst summarize the results. As t (or tref ) increases, the correlation of SPMFR with the Wi pattern tends to increase (or decrease) gradually. When t increases or tref decreases, l tends to increase, and the speed of convergence of SPMFR increases. In addition,
Dual Information Representation with Stable Firing Rates
2813
when the input frequency is low, the correlation of SPMFR with the Wi pattern is also low. These phenomena lead us to hypothesize two kinds of network dynamics behind spatiotemporal neuronal spikes. We call these internally driven dynamics and externally driven dynamics. The former refers to the dynamics of spikes driven by internal activities of neurons in the focused network; the latter refers to the dynamics driven directly by inputs that come from outside the focused network. When t increases, tref decreases, or when the frequency of external inputs increases, the internal activity increases in each neuron. Since each neuron becomes better able to re the other neurons, the neurons re more frequently by mutual interactions rather than by external inputs. Consequently, the output spikes become increasingly independent of external inputs. We assume dynamics that evoke these frequent mutual interactions between neurons, and we call them internally driven dynamics. When the internally driven dynamics are active, the spikes look more chaotic and independent of external inputs with steady and stable spatial patterns of mean ring rates. These features can be considered to emerge from these internally driven dynamics. When t decreases, tref increases, or when the frequency of external inputs decreases, the spiking patterns will become more dependent on external input patterns. We assume that this is due to the externally driven dynamics. However, when the frequency of the external inputs increases, the internally driven dynamics increase. This is because each neuron receives coincidently more spikes from the other neurons, and the internal activity increases even if t is small. This explains why high correlations of SPMFR with the Wi pattern are obtained in spite of small t (see Figure 5B). To examine the tendency of inputs before the generation of spikes, we observe a reverse correlation (Mainen & Sejnowski, 1995; Softky, 1995), which is a spike-triggered average of the internal activity in response to external inputs and presynaptic spikes. Figure 8 shows reverse correlations for t D 3, 5, and 7, when a set of Poisson process inputs is given. The broken line represents the reverse correlation by input stimuli only from the other neurons (except direct external inputs). When t is small, the external input effect is so strong that the sharp peak evokes ring. Since populational ring rates increase as t increases, the amplitude of the reverse correlations tends to increase. The reverse correlation before ring shows gradual and rapid increases when t D 5 and 7. This is because the mutual interactions among neurons uctuate input patterns which affect the generation of ring. Two “blunt hills” before and after ring (see Figures 8b and 8c) suggest that there are groups of neurons that give to and take stimuli from each other despite aperiodic ring, which means that the each neuron’s ring affects its own inputs indirectly. The “valley” between the hills (see Figure 8c) will be due to the refractory effect of presynaptic neurons.
I n te r n a l a c tiv ity
2814
Osamu Araki and Kazuyuki Aihara
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0 -30 -20 -10 0 Time
0 -30 -20 -10 0 Time
10
20
10
20
0 -30 -20 -10 0 Time
10
20
Figure 8: Reverse correlations (solid lines) with t D 3, 5, and 7. The external inputs are mean ISI D 50 Poisson-process. The broken line represents the reverse correlation of internal activity except the external inputs.
I n te r n a l a c tiv ity
The correlation coefcients between SPMFR and the Wi pattern are 0.42, 0.89, and 0.82 for t D 3, 5, and 7, respectively. Thus, it is expected that this uctuation effect can increase the similarity between SPMFR and the Wi pattern as well as the chaotic dynamics. In other words, the sharp peak represents the stimuli that produce the externally driven dynamics, and the blunt hill before ring represents those that generate the internally driven dynamics. Figure 9 shows reverse correlations depending on tref for a constant input (the rate of connections is 5%). Although the external input is constant, we observe a slight increase immediately before ring, which will be due to the internally driven dynamics. When the value of tref is larger, the internal activity before ring tends to be suppressed because of the longer refractory effect of presynaptic neurons. The externally driven dynamics do not
tau _ref= 3 tau _ref= 4 tau _ref= 5
4 3 2 1 0
-15
-1 0
-5 0 T im e
5
Figure 9: Reverse correlations with tref D 3, 4, and 5. The external input is a constant input (Ai D 1).
Dual Information Representation with Stable Firing Rates
2815
appear explicitly for constant inputs, while both of externally driven and internally driven dynamics appear as the sharp and blunt peaks for Poisson process inputs (see Figure 8). This will be the partial reason that remarkable differences of reverse correlations were not observed for constant inputs even if the correlation coefcients were different. These reverse correlations show that the change of dynamics by changing parameter values of t and tref leads to the change of relation between inputs and generation of ring. This result suggests that the network dynamics affect input-response mechanism in each neuron of this model. 4.2 Related Analytical Results. There have been interesting analytical studies using stochastic assumptions in integrate-and-re models (Abbott & van Vreeswijk, 1993; Amit & Brunel, 1997; Brunel & Hakim, 1999; Brunel, 2000; Gerstner & van Hemmen, 1993; Gerstner, 1995; Gerstner, van Hemmen, & Cowan, 1996; Stein, 1967; Tuckwell, 1988; van Vreeswijk & Sompolinsky, 1996, 1998). The stochastic assumptions make it possible to apply useful approximations such as mean-eld theory to understand populational properties in the spiking neural network models. On the basis of the mean-eld theory, temporal evolution of populational activities of the neurons and stability of synchronous and asynchronous activities have been successfully analyzed. In this article, we have observed each neuron’s ring rates and chaotic properties in a deterministic spiking neural network model without stochastic assumptions. To examine the properties of the deterministic dynamics of neural networks, we adopt a deterministic model. From this viewpoint, one of the most interesting results is that the chaotic dynamics seem to stabilize the ring rates like stochastic phenomena. On the other hand, it is difcult to apply the analytical methods to our model, although some analytical studies are possible (Chen & Aihara, 1997, 1999, 2000; Komuro & Aihara, 2001), because some requisite assumptions of the methods are not satised in our simulations. For example, we cannot assume that the correlations of the synaptic inputs of different neurons are neglected, which is a main assumption of the mean-eld theory. This is mainly because the rate of connectivity is not so small compared with the number of neurons. Thus, our study is different from the former analytical studies in the standpoint (deterministic or stochastic model), methods (computer simulations or mathematical analysis), and focused results (each neuron’s or populational activity). The reverse correlations depending on the dynamics (see Figures 8 and 9) suggest that these properties may not be realized by a simple stochastic assumption. This is because the shapes of reverse correlations are more complicated than those of a leaky integrate-and-re neuron model with a Poisson process (Bugmann, Christodoulou, & Taylor, 1997; Softky, 1995). However, the results do not seem to be contradictory to the former analytical ones. For example, it would be an interesting problem to compare the chaotic properties of the dynamics we found (see also
2816
Osamu Araki and Kazuyuki Aihara
Watanabe & Aihara, 1997) with the asynchronous activity found by Abbott and van Vreeswijk (1993) and Brunel (2000). Mathematical analysis of our model is one of our future problems. 4.3 Information Representation. In this section, we discuss possible functions of information representation in the model. The simulation results suggest that spatiotemporal spiking patterns and their ring rates of internally driven dynamics and externally driven dynamics have different properties. Concerning internally driven dynamics, the spatiotemporal spike patterns are very sensitive to uctuations with a kind of chaotic instability. However, the spatial pattern of ring rates is steady and stable, mainly reecting the internal structure or the Wi pattern. Since these internally driven dynamics are inuenced by the internal structure of the neural network module, the information processing under this representation should concern its memory or internal information structure. The externally driven dynamics, which are independent of the history of the internal dynamics, seem to stably transform an input pattern to a spatiotemporal pattern. Since the transformation implicitly reects the internal structure of the network, the output should include both the external and internal information. Therefore, similarity with the internal structure is not as salient as that in the internally driven dynamics. If different neural network modules have different values for t and tref , they may be able to play different roles in information representation. Even in the same module, if the external input frequency changes dynamically, each representation can change the roles, thus raising the possibility of dynamic dual coding systems in the brain. 4.4 Related Experimental Data. Recent experimental data suggest that dual coding systems may exist in the brain. Here, we interpret and discuss two sets of reported interesting experimental data based on our model. First, Arieli et al. (1996) observed two kinds of activities of evoked and ongoing spatiotemporal patterns by real-time optical imaging in the visual cortex (areas 17 and 18) of cats. The former is the reproducible response, which is estimated by trial averaging, and the latter is the uctuating ongoing activity approximated by the initial state. The fact that each response in a trial can be predicted for tens of milliseconds by summing the reproducible response and each ongoing activity suggests a dual coding system. Assuming that both internally driven and externally driven dynamics exist in the observed module, we show how the data can be interpreted. We assume two kinds of dynamically changing spikes originated from the internally driven and externally driven dynamics. Since the activity from the internally driven dynamics is similar to the Wi pattern, the averaged reproducible responses may include the activity from the internally driven dynamics as well as a reproducible component of the activity from the ex-
Dual Information Representation with Stable Firing Rates
2817
Figure 10: Spatial patterns of ring rates when uctuating external inputs are given. (a) Initial states of two trials. (b) Observed responses at (A) t D 60 and (B) t D 50 after the initial state. (c) Predicted responses at (A) t D 60 and (B) t D 50.
ternally driven dynamics. We can interpret that the initial state is generated by the uctuating externally driven dynamics. If the externally driven activity is stable within a trial and uctuating trial by trial, results similar to those of Arieli et al. (1996, Figure 4) are obtained in a computer simulation. Figure 10 shows two examples in the trials of (a) the initial state, (b) the observed response at (A) t D 60 or (B) t D 50 after the initial state, and (c) the predicted response at (A) t D 60 or (B) t D 50. The gray scale in the contour gure denotes normalized mean ring rates within a period of 20. The darker intensity indicates higher mean ring rates. For the uctuated external input of each trial, ve kinds of simple spatiotemporal patterns were input sequentially and periodically. Each trial consisting of the external input for the period of T D 80 and no stimuli for T D 20 was repeated 10 times. Each predicted response was calculated by summing each of the initial state patterns and the averaged response at t D 50 or t D 60 over all trials. The correlation coefcients between Figures 10b and 10c are 0.77 (A) and 0.71 (B), respectively. The suggestion is that both internally driven and externally driven dynamics coexist and generate spatiotemporal responses with reproducible and nonreproducible SPMFR in the visual cortex.
2818
Osamu Araki and Kazuyuki Aihara
Second, Riehle et al. (1997) showed that some neurons in the primary motor cortex synchronize in response to internal events, while the ring rate of each neuron is almost constant. Since the primary motor cortex has reciprocal connections with the supplementary motor cortex, the premotor cortex, and the thalamus (Kandel, Schwartz, & Jessell, 1991), it is not easy to explain the experimental result from the mechanism of the neural networks. Here we do not discuss the mechanistic details but propose to explain the experimental data qualitatively on the basis of our dual dynamics framework. When an internal event occurs, we assume that the neurons in the primary cortex receive information originating in higher functional areas, such as the prefrontal cortex. If the focused module receives spatiotemporal spike patterns as external inputs, synchronous spikes can occur via the externally driven dynamics. At the same time, mean ring rates may become stable in the module including cortico-cortical recurrent connections via the internally driven dynamics. Thus, synchronous spikes without rate modulations can occur in the mixed internally driven and externally driven dynamics (for example, see Figure 7.B-b, bottom). The internally-driven and externally-driven dynamics together can play a dynamic role in dual information representation in the motor cortex. 4.5 Related Issues. There are several problems for future study. First, other properties of information representation in the spatiotemporal patterns should be examined. These properties involve temporal information such as correlation, synchrony, and oscillation. In addition, temporal or spatial structures in external inputs should be examined because the spatiotemporal structure of the external inputs strongly affects the properties of information representation. A second problem is the effects of synaptic plasticity on information representation. Since the spatial pattern of mean ring rates reects the Wi pattern, synaptic plasticity such as LTP / LTD (Bliss & Collingridge, 1993) should affect at least the SPMFR. Synaptic plasticity is known to be affected by ne temporal timing of presynaptic and postsynaptic spikes (Markram, Lubke, Frotscher, & Sakmann, 1997). Thus, how interactions between the spatiotemporal dynamics and synaptic plasticity affect the properties of information representation should be investigated. A third problem is biological plausibility. We need experimental data that conrm the simulation results and our representation of the mechanism. Experiments using multi-electrodes or optical recording techniques may be necessary to conrm the stability of mean spatial ring rates with chaotic spatiotemporal spike patterns. 5 Conclusion
With an interconnected neural network model composed of spiking neurons, spatiotemporal spiking patterns and their spatial patterns of ring
Dual Information Representation with Stable Firing Rates
2819
rates were examined. As shown in the computer simulations, although the spatiotemporal ring patterns are chaotic and unstable, the SPMFR are steady and stable, reecting the internal structure of synaptic weights Wi . The effects of the values of t and tref suggest that the chaotic instability hastens the speed of convergence to steady SPMFR. By contrast, when the frequency of external inputs is low, neither chaotic spatiotemporal spike patterns nor stable spatial patterns of ring rates appear. The input patterns seem to be transformed into spatiotemporal patterns without context in this case. Two kinds of network dynamics behind neuronal spikes are suggested: internally driven dynamics and externally driven dynamics. The internally driven dynamics are dominant when t is large, tref is small, or the frequency of the external input is high. The internally driven dynamics make spikes more chaotic and independent of external inputs. Since these dynamics depend on the internal structure of the network, spatial patterns of mean ring rates are steady and stable. When t is small, tref is large, or the frequency of the external input is low; however, the externally driven dynamics are dominant. The externally driven dynamics make the spiking patterns more dependent on the spatiotemporal structure of external inputs. These two dynamics can coexist, playing respective roles in the neural network. Such different properties of the dual information dynamics suggest that the brain may adopt a exible way to represent information. Since the dynamics in the same module can be changed by the external frequency, information representation of the neural system can be changed dynamically and drastically. In addition, it may be possible to represent some information in multiple ways simultaneously with the mixed dynamics. Our internally driven and externally driven dynamics framework can explain recent experimental data that have suggested dual coding systems. This representation implies that internally driven and externally driven dynamics coexist in the cortex and that these dynamics themselves can be dynamically changed. Acknowledgments
We thank Masataka Watanabe, University of Tokyo, for his valuable suggestions. We also thank the anonymous referees for their helpful comments. This research is partially supported by the grant on Research for the Future Program from JSPS (the Japan Society for the Promotion of Science) (No. 96I00104). References Abeles, M. (1982). Role of cortical neuron: Integrator or coincidence detector? Israel Journal of Medical Sciences, 18, 83–92.
2820
Osamu Araki and Kazuyuki Aihara
Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. Journal of Neurophysiology, 70(4), 1629–1638. Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Physical Review E, 48(2), 1483–1490. Adachi, M., & Aihara, K. (1997). Associative dynamics in a chaotic neural network. Neural Networks, 10(1), 83–98. Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-based interpretation. Physica D, 75, 103–128. Aertsen, A., & Gerstein, G. L. (1991). Dynamic aspects of neuronal cooperativity: Fast stimulus-locked modulations of effective connectivity. In J. Kruger ¨ (Ed.), Neuronal cooperativity (pp. 52–67). Berlin: Springer-Verlag. Ahissar, M., Ahissar, E., Bergman, H., & Vaadia, E. (1992). Encoding of soundsource location and movement: Activity of single neurons and interactions between adjacent neurons in the monkey auditory cortex. Journal of Neurophysiology, 67(1), 203–215. Aihara, K., Takabe, T., & Toyoda, M. (1990). Chaotic neural networks. Physics Letters A, 144(6,7), 333–340. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–39. Braitenberg, V., & Schuz, ¨ A. (1991). Anatomy of the cortex. Berlin: Springer-Verlag. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of Computational Neuroscience, 8, 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Computation, 11, 1621–1671. Bugmann, G., Christodoulou, C., & Taylor, J. G. (1997). Role of temporal integration and uctuation detection in the highly irregular ring of a leaky integrator neuron model with partial reset. Neural Computation, 9, 985–1000. Chen, L., & Aihara, K. (1997). Chaos and asymptotical stability in discrete-time neural networks. Physica D, 104, 286–325. Chen, L., & Aihara, K. (1999).Global searching ability of chaotic neural networks. IEEE Trans. CAS-I, 46(8), 974–993. Chen, L., & Aihara, K. (2000). Strange attractors in chaotic neural networks. IEEE Trans. CAS-I, 47(10), 1455–1468. Crutcheld, J. P., & Kaneko, K. (1988). Are attractors relevant to turbulence? Physical Review Letters, 10(26), 2715–2718. Douglas, R., & Martin, K. (1998). Neocortex. In G. M. Shepherd (Eds.), The synaptic organization of the brain (pp. 459–509). New York: Oxford University Press.
Dual Information Representation with Stable Firing Rates
2821
Eckhorn, R., Bauer, R., Jorden, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism for feature linking in the visual cortex? Biological Cybernetics, 60, 121–130. Freeman, W. J. (1987). Simulation of chaotic EEG patterns with a dynamic model of the olfactory system. Biological Cybernetics, 56, 139–150. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9(12), 1303–1350. Gerstner, W. (1995). Time structure of the activity in neural network models. Physical Review E, 51(1), 738–758. Gerstner, W., & van Hemmen, J. L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Physical Review Letters, 71(3), 312–315. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking? Neural Computation, 8, 1653–1676. Gray, C. M., & Di-Prisco, G. V. (1997). Stimulus-dependent neuronal oscillations and local synchronization in striate cortex of the alert cat. Journal of Neuroscience, 17(9), 3239–3253. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences, USA, 86, 1698–1702. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (1991). Principles of neural science (3rd ed.). New York: Elsevier. Komuro, M., & Aihara, K. (2001). Hierarchical structure among invariant subspaces of chaotic neural networks. Japan J. Industrial and Applied Mathematics, 18(2). Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revised Trends in Neurosciences, 19, 130– 137. Liley, D. T. J., & Wright, J. J. (1994). Intracortical connectivity of pyramidal and stellate cells: estimates of synaptic densities and coupling symmetriy. Network: Computation in Neural Systems, 5, 175–189. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Matsumura, M., Chen, D-f., Sawaguchi, T., Kubota, K., & Fetz, E. E. (1996). Synaptic interactions between primate precentral cortex neurons revealed by spike-triggered averaging of intracellular membrane potentials in vivo. Journal of Neuroscience, 16(23), 7757–7767. Nagumo, J., & Sato, S. (1972). On a response characteristic of a mathematical neuron model. Kybernetik, 10, 155–164. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953.
2822
Osamu Araki and Kazuyuki Aihara
Sakurai, Y. (1996). Population coding by cell assemblies—what it really is in the brain. Neuroscience Research, 26, 1–16. Softky, W. R. (1995). Simple codes versus efcient codes. Current Opinion in Neurobiology, 5, 239–247. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13(1), 334–350. Stein, R. B. (1967). The information capacity of nerve cells using a frequency code. Biophysical Journal, 7, 797–826. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371. Watanabe, M., & Aihara, K. (1997). Chaos in neural networks composed of coincidence detector neurons. Neural Networks, 10(8), 1353–1359. Watanabe, M., Aihara, K., & Kondo, S. (1998). A dynamic neural network with temporal coding and functional connectivity. Biological Cybernetics, 78, 87–93. Wolf, A., Swift, J. B., Swinney, H. L., & Vastano, J. A. (1985). Determining Lyapunov exponents from a time series. Physica, 16D, 285–317. Received January 4, 2000; accepted February 21, 2001.
LETTER
Communicated by Susanna Becker
Neurons with Two Sites of Synaptic Integration Learn Invariant Representations Konrad P. K ording ¨ [email protected] Peter K onig ¨ [email protected] ¨ ¨ Institute of Neuroinformatics, ETH/University Zurich, 8057 Zurich, Switzerland Neurons in mammalian cerebral cortex combine specic responses with respect to some stimulus features with invariant responses to other stimulus features. For example, in primary visual cortex, complex cells code for orientation of a contour but ignore its position to a certain degree. In higher areas, such as the inferotemporal cortex, translation-invariant, rotation-invariant, and even view point-invariant responses can be observed. Such properties are of obvious interest to articial systems performing tasks like pattern recognition. It remains to be resolved how such response properties develop in biological systems. Here we present an unsupervised learning rule that addresses this problem. It is based on a neuron model with two sites of synaptic integration, allowing qualitatively different effects of input to basal and apical dendritic trees, respectively. Without supervision, the system learns to extract invariance properties using temporal or spatial continuity of stimuli. Furthermore, top-down information can be smoothly integrated in the same framework. Thus, this model lends a physiological implementation to approaches of unsupervised learning of invariant-response properties. 1 Introduction 1.1 Invariant Response Properties in the Visual System. The textbook view of the mammalian visual system describes a series of processing steps (Hubel & Wiesel, 1998). Starting with light-sensitive photoreceptors in the retina, signals are relayed by lateral geniculate nucleus and primary visual cortex toward higher visual areas. Along this pathway, neurons acquire increasingly complex response properties. Whereas an appropriately placed dot of light on a dark background is sufcient to strongly activate neurons in the lateral geniculate nucleus, cells in primary visual cortex respond best to elongated stimuli and oriented edges. At higher levels, complex arrangements of features are the optimal stimuli (Kobatake & Tanaka, 1994; Wang, Tanaka, & Tanifuji, 1996). Finally, in inferotemporal cortex of monkeys, neurons are tuned to rather complex objects, like faces or toys the monkey subject played with (Perrett et al., 1991; Rolls, 1992; Booth & Rolls, 1998). c 2001 Massachusetts Institute of Technology Neural Computation 13, 2823– 2849 (2001) °
2824
Konrad P. Kording ¨ and Peter Konig ¨
However, a description of the visual system as a hierarchy of more and more complex lters is incomplete. In parallel to the increasing sophistication of receptive eld properties, other aspects of the visual stimulus cease to affect ring rates (Rolls & Treves, 1997). In primary visual cortex, one major neuron type, simple cells, is highly sensitive to the contrast of the stimulus; if an oriented edge is effectively activating a neuron, the contrast-reversed stimulus usually is not (Hubel & Wiesel, 1962). Other neurons, complex cells, show a qualitatively different behavior. If an oriented edge is effectively activating a neuron, the contrast-reversed or phase-changed stimulus is usually equally efcient. Thus, the response of complex neurons is invariant with respect to the phase polarity of contrast of the stimulus (Hubel & Wiesel, 1962). Along similar lines, neurons in inferotemporal cortex show some invariance with respect to translation, scaling, rotations, and changes in contrast (Rolls, 1992). An even more extreme combination of specicity and invariance can be found in premotor cortex. Neurons may respond with high specicity to a stimulus, irrespective of its’ being heard, seen, or felt (Graziano & Gross, 1998). Thus, a highly specic response to one variable, the position, is combined with invariance with respect to another variable, modality. The higher an area is located in the cortical hierarchy (Felleman & Van Essen, 1991), the more complex its neurons’ receptive eld properties are. Simultaneously, translation, scaling, and viewpoint invariance are more pronounced. 1.2 The Computational Role of Invariances. As invariant response properties are such a ubiquitous property of sensory systems, what are their computational advantages? In many categorization tasks, the output should be unchanged—or invariant—when the input is subject to various transformations. An important example is the classication of objects in twodimensional images. A particular object should be assigned the same classication even if it is rotated, translated, or scaled within the image (Bishop, 1995). Invariant-response properties are especially important because they counteract the combinatorial explosion problem: highly specic neuronal responses imply small receptive elds in stimulus space. For a complete coverage of stimulus space, vast numbers of units are needed. In fact, the number of representative elements needed rises exponentially with the dimensionality of the problem. By obtaining invariant responses to some variables, receptive elds are enlarged in the respective dimensions, and the total number of neurons needed to describe a stimulus stays manageable. Systems obtain invariant representations by several means (Barnard & Casasent, 1991). First, appropriate preprocessing can supply a neuronal network with invariant input data. Another option is to design the structure of the system so that invariances gradually increase. Finally, a system can learn invariances from the presented stimuli following principles of supervised or unsupervised learning.
Invariance Learning with Two Sites of Integration
2825
1.3 Invariances by Preprocessing. Due to the advantages they offer to recognition systems, these invariances are frequently used in preprocessing for neural systems (Bishop, 1995). Applying a Fourier transformation and discarding the phase information produces data that are invariant with respect to translations of whole image. Even transformations that generate translation-, scaling-, and rotation-invariant representations are used (e.g., Bradski & Grossberg, 1995). Alternatively, the input might be scaled and translated before being processed by a neuronal network. As the generation of invariances by preprocessing is explicit, it is guaranteed that the network generalizes over the desired dimensions. They thus present a form of a priori knowledge about the kind of problems encountered. With this information, the network performs better, faster, and more reliably. However, this approach is limited to operations that can be explicitly dened. For example, it seems extremely difcult, if not impossible, to implement viewpoint invariance in this fashion. 1.4 Mixing Invariance Generation with Processing. In one class of networks, called weight-sharing systems, both processes—the generation of invariances and the feature extraction—are done step by step in the same network (Fukushima, 1980, 1988, 1999; Le Cun et al., 1989; Riesenhuber & Poggio, 1999). In the well-known Neocognitron (Fukushima, 1988), neurons at one level of the hierarchy have identical receptive elds but are translated in the visual eld. Combining such weight sharing with converging connections leads to response properties at the next level that are somewhat translation invariant. The main drawback is that, similar to the case of generation of invariances by preprocessing, the constructor of the network needs to specify explicitly the desired invariances and thus to have specic a priori knowledge about which variables the processing is supposed to be invariant of. 1.5 Supervised Learning of Invariances. The obvious way to avoid the need to put in such specic a priori knowledge is the use of learning algorithms. Here, supervised and unsupervised approaches have to be differentiated. The former need labeled training data, and for large networks, a lot of these. With such data at hand, it is possible to learn invariant recognition, for example, training the network with a variant of the backpropagation algorithm (Hinton, 1987). To alleviate the problem of getting enough training data, the training set can be enlarged by applying appropriate transformations to individual examples. However, then we are back with an a priori specication of the invariance operation. 1.6 Unsupervised Learning of Invariances. Following this observation, several researchers started to investigate unsupervised learning of invariances. At this point we have to ask, When should two different stimuli be classied as instantiations of the same “real” thing? Let us consider a vi-
2826
Konrad P. Kording ¨ and Peter Konig ¨
sual system for object recognition acting in the real world. A few principles about our world seem obvious: 1. If an object is present right now, it is likely to be present the next moment. 2. If an object covers a certain part of the retina, it is likely to cover its vicinity. 3. Objects tend to inuence several modalities. Several algorithms have been put forward for learning invariances in neural systems corresponding to these principles. Most notable are studies where variables are extracted from the input that smoothly vary in time (principle 1, proposed by Hinton, 1989; Foldiak, ¨ 1991; Stone & Bray, 1995) or space (principle 2, Becker & Hinton, 1992; Stone & Bray, 1995; Phillips, Kay, & Smyth, 1995, cf. Becker, 1996). Principle 3 has been used by de Sa and Ballard (1998), but also is often considered a special case of principle 2, for example, auditory, visual, and somatosensory systems all allow a spatial localization. Still, this principle is more general and could enhance learning further. These systems can be compared to the networks described above in mixing invariances with processing. They share the advantages of needing few weights and combine them with the advantage of being able to learn those invariances in a way optimal to the task. However, these mechanisms do not seem to map straightforward on biological principles (see section 4). In this article, we address this issue. First, we summarize recent electrophysiological results. Based on these, we dene a suitably abstracted yet realistic learning rule. And nally, we demonstrate that it allows unsupervised learning of invariances exploiting spatial as well as temporal continuity. Thus, this learning rule is able to lend a physiological implementation to the algorithms described above. 1.7 Relevant Physiological Results. The most abundant type of neuron in cerebral cortex, the pyramidal cell, is characterized by its prominent apical dendrite. Recent research on the properties of layer V pyramidal neurons suggests that the apical dendrite acts, in addition to the soma, as a second site of synaptic integration (Larkum, Zhu, & Sakmann, 1999; Kording ¨ & K¨onig, 2000a). Each site integrates input from a set of synapses dened by their anatomical position and is able to generate regenerative potentials (Schiller, Schiller, Stuart, & Sakmann, 1997). The two sites exchange information in well-characterized ways (see Figure 1A). First, signals originating at the soma are transmitted to the apical dendrite by actively backpropagating dendritic action potentials (see Figures 1A and 1B; Amitai, Friedman, Connors, & Gutnick, 1993; Stuart & Sakmann, 1994; Buzsaki & Kandel, 1998) or passive current ow. Second, signals from the apical dendrite to the
Invariance Learning with Two Sites of Integration
2827
soma are sent via actively propagating slow regenerative calcium spikes (see Figures 1D and 1E), which have been observed in vitro (Schiller et al., 1997) and in vivo (Hirsch, Alonso, & Reid, 1995; Helmchen, Svoboda, Denk, & Tank, 1999). These calcium spikes are initiated in the apical dendrites and cause a strong and prolonged depolarization, typically leading to bursts of action potentials (see Figures 1D and 1E; Stuart, Schiller, & Sakmann, 1997; Larkum, Zhu, & Sakmann, 1999). Experimental studies support the view that excitation to the apical dendrite is strongly attenuated on its way to the soma unless calcium spikes are induced (see Figure 1A; Schiller et al., 1997; Stuart & Spruston, 1998; Larkum, Zhu, & Sakmann, 1999). In conclusion, a subset of synapses on the apical dendrite is able to induce rare discrete events of strong, prolonged depolarization combined with bursts. Experiments on hippocampal slices by Pike, Meredith, Olding, and Paulsen (1999) support the idea that postsynaptic bursting is essential for the induction of long-term potentiation. Furthermore, independent experiments support the idea that strong postsynaptic activity is necessary for induction of Hebbian learning (Artola, Brocher, ¨ & Singer, 1990; Dudek & Bear, 1991). Integrating this with research on apical dendrites, we conclude that whenever the apical dendrite receives strong activation, calcium spikes are induced and synapses are modied according to a Hebbian rule (Hebb, 1949). The generation of calcium spikes is very sensitive to local inhibitory activity. Even the activity of a single inhibitory neuron can effectively block calcium spikes (Larkum, Zhu, & Sakmann, 1999; Larkum, Kaiser, & Sakmann, 1999). Thus, it seems reasonable to assume that the number of neurons generating calcium spikes on presentation of a stimulus is limited on the scale of tangential inhibitory interactions. Here we allow only one neuron per stream and layer to feature a calcium spike in one iteration. To complete the picture, we have to consider which afferents are targeting the apical and basal dendritic tree. The anatomy of a cortical column is complicated; nevertheless, some regular patterns can be discerned. The apical dendrites of the layer 5 pyramidal cells receive local inhibitory projections and long-range cortico-cortical projections (Zeki & Shipp, 1988; Cauller & Connors, 1994). Top-down projections from areas higher in the hierarchy of the sensory system usually terminate in layer 1, where many apical tufts can be observed (cf. Salin & Bullier, 1995). This supports the idea that topdown connections from higher to lower areas preferentially terminate on the apical dendrites. The basal dendrites of the considered neurons receive direct subcortical afferents (e.g., the koniocellular pathway in visual cortex) in addition to projections from layer 4 spiny stellate cells. These are the main recipients of afferents from sensory thalamus and areas lower in the cortical hierarchy. Therefore, we use the approximation that the bottom-up input targets the basal dendritic tree (see Figure 1F).
2828
Konrad P. Kording ¨ and Peter Konig ¨
2 Methods
The whole simulation is described by this pseudocode: Every neuron is
For all iterations
Calculate stimuli. Calculate activities A going from lower areas to higher areas. Calculates dendritic potentials D going from higher areas to lower areas. Determine in which neurons learning and thus calcium spikes are triggered. Update the weights of those neurons according to Hebbian learning.
described by two main variables, corresponding to the two sites of integration (see Figure 1F): A is referred to as the activity of the neuron, and D represents the average potential at the apical dendrite. We simulate a rate coding neural network where a unit’s output is a real number representing the average ring rate. Figure 1: Facing page. The neuron model. (A) Experimental setup. A neuron is patched simultaneously at three positions: at the soma (recordings in solid line), 0.450 mm (recordings in dashed lines), and 0.780 mm (recordings in dotted lines) out on the apical dendrite. (B) Effect of inducing an action potential by current injection into the soma. The action potential does not only travel anterogradely into the axon but as well retrogradely into the apical dendrite, where it leads to a slow and delayed depolarization. (C) Injection of small currents at the apical dendrite does not trigger regenerative potentials and induces only a barely noticeable depolarization at the soma. (D) Injection of larger currents into the apical dendrite elicits a regenerative event with a long depolarization (dotted line). This slow potential is less attenuated and reaches the soma with a delay of only a few milliseconds (solid line). There, a series of action potentials is triggered, riding on the slow depolarization. (E) Combining a subthreshold current injection into the apical dendrite with a somatic action potential leads to a calcium spike and a burst of action potentials. Thus, somatic activity decreases the threshold for calcium spike induction. (F) Essential features incorporated into the model. Action potentials interact nonlinearly with calcium spike generation at the apical dendrite. Calcium spikes induce burst ring, gating plasticity of all active synapses. (In vitro data kindly supplied by M. E. Larkum (MPI fur ¨ medizinische Forschung, Heidelberg), modied after a gure in Larkum, Zhu, & Sakmann, 1999.)
Invariance Learning with Two Sites of Integration
2829
Depending on the layer, two different transfer functions are used that ( j) dene the activity Ai of the neuron i in layer j: Sum:
² ¡ ¡ ¢ ¢ ± Ai D H sum A pre . ¤ W i ¡ hAj ij / Npre hAi i2t
(2.1)
Max:
² ¡ ¡ ¢ ¢ ± Ai D H max A pre . ¤ W i ¡ hAj ij / Npre hAi i2t ,
(2.2)
where H is the Heaviside function, . ¤ is the element-wise product, A pre is the presynaptic activity vector, Npre is the number of neurons presynaptic
2830
Konrad P. Kording ¨ and Peter Konig ¨
to the basal dendrite, hAj ij is the average activity within the layer, and hAi it is the running average of the neuron’s activity with exponential decay and a time constant of 1000 iterations. Depending on the layer, the activity of a neuron results from the standard scalar product (sum case) or is determined by the most effective input only (max case). These types of activation functions have been discussed by Riesenhuber and Poggio (1999). The hAj ij term mediates a linear inhibition. Summarizing, the neuron’s activity is a linear threshold process with soft activity normalization and an activation function that is specic for each layer. The apical potential D is calculated according to: Di D W i Apre C aAi
(2.3)
Due to the different connectivity to basal and apical dendrite, apical and basal dendrite have different Apre and thus also different W i ; Apre of the third-layer basal dendrite is 50-dimensional (number of second-layer neurons) and Apre of the apical dendrite is 4-dimensional (number of third-layer neurons). D has two components: the input from other neurons and an effect of the neuron itself. For both the apical and the basal dendrites, the weight change is calculated as:
D Wi
D g¤ ( A pre C C pre ¡ W i )
D Wi
D 0
C Q¤ ( t / Ni,pre ¡ 0.5)
: if Di D max( D ) : otherwise,
(2.4)
where g is the learning rate, Ni,pre the number of neurons presynaptic to the apical dendrite, t is the number of iterations since the neuron last learned, and Q is a constant. The Q term ensures that no neuron can stop learning and thus avoids the occurrence of so-called dead units. C D 1 for the one presynaptic neuron with the highest D, and thus a calcium spike associated with learning, and 0 for the others. Only the neuron with highest D learns; neurons that do not learn for a large number of iterations increase their probability of learning. Initially all weights are chosen randomly in the interval [0. . . 1[. The default parameters are g D 0.002, Q D 0.00005, a D 1. The system is simulated for 40,000 iterations unless stated differently. All networks examined have three layers unless otherwise indicated. The network itself is organized into streams. Each stream receives input from a disjoined set of neurons. Only neurons of the highest layer receive contextual information to their apical dendrites by self-feedback or feedback from other areas. Otherwise the network is purely feedforward, with connections targeting the basal dendrites. Coupled layers are fully connected; there are no connections of any layer onto itself.
Invariance Learning with Two Sites of Integration
2831
To explore different properties of the system, two sets of simulations were performed. In the rst set (three-layer simulations), stimuli are rectangular similar to those used in physiological experiments. The second set (twolayer simulations) uses sparse, preprocessed input. In the three-layer simulations, the rst layer (input) consists of 10 by 10 neurons; the second layer of 50 neurons uses the SUM activation function; and the third layer of 4 neurons uses MAX. Stimuli resemble “bars” as used in physiological experiments. Their luminance has a gaussian prole orthogonal to the long axis with a length constant of 1. They are described by two parameters: the orientation of the bar and the position when projected on a line that is perpendicular to the bar ’s main axis. The stimuli presented to each stream are correlated in orientation and position, as described in section 3. We quantify the results of these simulations using different measures. For each stimulus, characterized by its orientation (#) and position (r), the average response is calculated during the second half of the simulation (iteration 20,000 to 40,000). Stimulus parameters # and r, which are drawn from a continuous distribution, are binned onto a 20 £ 20 grid covering the complete stimulus space. The responses to all stimuli that fall into the same bin are averaged and plotted in the #-r diagram. Orientation- and positionspecic receptive elds have a single peak in stimulus space, which drops along both dimensions. Neurons with translation-invariant responses are characterized by anisotropic #-r diagrams. A layer’s responses to orientation and position can be quantied by a bar specicity index calculated as follows. The #-r diagrams are summed along the position axis, and the result is divided by its mean. Its standard deviation (over orientation) is averaged over all cells of the layer, yielding the orientation specicity s(orientation). Position specicity s(position) is calculated analogously, summing along the orientation axis and calculating the standard deviation along the orientation axis. To compare representations on the highest level in both streams, a crosscorrelation coefcient CC of the activity patterns of both third layers is calculated: q (2) (1) (1) (2) (2) CC D Sij (h Ai(1) Aj i2t ) / ( Sij ( hAi Aj i2t ) Sij ( hAi Aj i2t ) , (2.5) where superscripts denote the number of the stream and i, j in the sum run over all combinations, from 1 through 4, and hit denotes the average over the considered iterations. This is a measure of the coherence of variables extracted by the two streams. If neurons in both streams have extracted identical variables, CC is 1. For the control in Figure 3D, a special type of simulation is performed. The set of neurons with two sites of synaptic integration of the third layer is exchanged by a set of neurons with just one site of synaptic integration.
2832
Konrad P. Kording ¨ and Peter Konig ¨
Both the second layer of the same stream and the third layer of the other stream are presynaptic to the same cell. We thus calculate the activities in the following iterative way: ( )
i Iinput ( 0) D max( A pre,same . ¤ W i )
(2.6)
( j)
i ( 0) ¡ hIinput ( 0) ij ) / ( Npre hAi i2t ) Ai ( 0) D H( Iinput ( )
( )
i Iinput ( n C 1) D Ai ( 0) C m¤ A pre,other ¤ W ( )
( j)
i ( n C 1) ¡ hIinput ( n C 1) ij ) , Ai ( n C 1) D H ( Iinput
(2.7) (2.8) (2.9)
where n is the number of the iteration, A pre,same is the cell’s input from the second layer, A pre,other the input from the other stream’s third layer, and m is a scaling factor. Since A changes depending on the other stream’s A and vice versa, the activities change over subsequent updates. We use Ai ( 20) as an approximation of the convergence value. Twenty iterations are sufcient for the process to converge to values very near the nal value (data not shown) unless the process diverges to innite activities (at m of about 1). After this process, the neuron with the highest activity learns with the same D W described above. In the two-layer simulations, the rst layer consists of 12 neurons and the second layer of 4 neurons. The main difference from the three-layer simulations is that the inputs are already preprocessed; the input layer for the two-layer simulations can be compared to the second layer of the three-layer simulation. The stimuli are 2D maps of size 4 £ 3. One axis is labeled class (4 neurons) and the other instantiation (3 neurons). To generate a stimulus, we rst select a number of active classes, which are identical for all streams. The probability for n classes active at the same iteration is proportional to a parameter pc to the power of n. Then for each class, we set an instantiation, which is chosen independently in all streams and thus can also be different. All of the chosen elements of the input map are set to 1, and the rest are set to 0. These stimuli have the following property: at low pc , the stimuli are very similar to the ones of the three-layer simulations, class is perfectly correlated, and instantiation is not correlated. The higher pc , the larger the spurious correlation in the class. It thus is a measure that makes the task difcult. We used this more abstract simulation to be able to do extensive explorations in larger systems with several streams. To quantify properties of a whole layer of neurons, a class specicity index is calculated. If a class is assigned to a neuron, we can quantify how specic it is to that class by calculating its average activity for that class minus the average activity for all other classes: S D hAiclass ¡ hAinonclass, where hix is the average over a number of interations under the condition
Invariance Learning with Two Sites of Integration
2833
that the stimulus is chosen from x. The class specicity is dened as the maximum of that specicity measure over all neuron class assignments, where every class is assigned to exactly one neuron. If the class specicity is large, neurons are specic to one class and ignore the rest. It is small if several neurons code for the same class. 3 Results
Several simulations are performed to characterize the system (see equations 2.1–2.4) and effects of different inputs on the basal and apical dendrite. We start with a system that implements an approximation to the spatial smoothness criterion. Next, we move on to a system that approximates the temporal smoothness criterion and explore the interaction with top-down signals. Finally, we show how lateral connectivity improves learning in the system. 3.1 Spatial Continuity. We explore a system with two input streams (see Figure 2A), which are treated as analyzing spatially neighboring but nonoverlapping parts of the visual scene. In this simulation, we rst address principle 1 from above: the spatial continuity of stimuli. The input to each stream consists of rectangular stimuli characterized by their orientation and position. The orientations of the stimuli presented to each stream are perfectly correlated, whereas stimulus position is not correlated. As a control, these assumptions are relaxed in additional simulations, and the results are reported further down. The activity of neurons in the rst layer is directly set by the luminance of the stimulus. At the third level, the output of each stream projects to the apical dendrites in the other stream. Receptive elds of three neurons randomly chosen from the second layer are shown in Figure 2B and closely resemble individual input stimuli. They are position and orientation specic and thus localized in stimulus space (see Figure 2C). This can be understood from the absence of specic excitatory projections to the apical dendritic tree. As a consequence, calcium spikes are triggered by the depolarization induced by backpropagating action potentials only, and the learning rule applied is effectively equivalent to competitive Hebbian learning. The calcium spike dynamics here has the effect of ensuring a mechanism for competition. Inhibitory interactions in the network enforce that different neurons learn different stimuli (Kording ¨ & Konig, ¨ 2000b). Thus, these neurons tend to cover the input space uniformly. Third-layer neurons show qualitatively different response properties (see Figure 2D). Each neuron responds selectively to stimuli of a single orientation, regardless of the position. Reciprocal projections between the two streams targeting the apical dendrites favor correlated stimuli to trigger calcium spikes. As stimulus orientation is correlated across streams, neurons extract this variable and discard information on the precise position. Thus,
2834
Konrad P. Kording ¨ and Peter Konig ¨
the “supervision” of each stream by the other leads to invariant-response properties and approximates criterion 2. 3.2 Controls. To analyze the properties of the proposed system further, we investigated coverage of stimulus space by neurons in layers 2 and 3. Figure 3A demonstrates that the standard deviations of total activity in layers 2 and 3 are small compared to the mean (5.3% and 6.5% in layers 2 and 3, respectively). Thus, competition implemented by the inhibitory action on calcium bursts is sufcient in the second as well as in the third layer to guarantee an even coverage of stimulus space.
Invariance Learning with Two Sites of Integration
2835
As a next step, we compare the effects of different learning rates (see Figure 3B). To quantify convergence, the coherence CC (see equation 2.5) of responses between streams is determined as described in section 2. A low learning rate (g D 0.0005) leads to a slow convergence (»13,500 stimulus presentations until reaching CC = 0.75), but CC reaches high levels, saturating at 0.96 (mean for the last fourth of the simulation) very close to the theoretical maximum of 1. When 4 and 16 times higher learning rates (the 4 times larger learning rate is used in the other simulations) are used, convergence is faster by a factor of 2 and 4 (»7000 and »4000 stimulus presentations, respectively). However, CC is lower than before (0.94 and 0.88 for 4 times and 16 times learning rates respectively). We want to state explicitely that the system converges rather slowly and theoretically it should be possible to learn invariances more quickly. Nevertheless, the system is robust enough to support extraction of coherent variables and development of invariant responses in a reasonable time. An important question is how sensitive the model is with respect to mixing the bottom-up input into the lateral learning signal. To investigate this question, we systematically changed the ratio (a) of the learning signal determined by the stimulus to the part dened by the contextual input. In Figure 3C, it can be seen that at an a of about 1, implying that the effect of the cell’s own activity on learning is of the order of the effect of the context cells, invariance generation breaks down. This furthermore leads to an experimentally testable hypothesis. The lateral and the somatic effect on calcium spike generation can experimentally be distinguished. It is necessary to control that the same results could not be obtained using a neuron model with just one site of synaptic integration. To test this, we constructed a simulation where the third-layer cells were exchanged by cells with just one site of synaptic integration (see equations 2.6–2.9). The relative contribution of the input from the other stream can be gauged by a scaling parameter m. At a scaling parameter m of about 1, the activities Figure 2: Facing page. Basic mode of processing. (A) The network consists of two separate streams. Only neurons on the highest layer have cross-stream connections. An example stimulus as given to the input layer is shown in gray scale. (B) The resulting receptive elds (columns of the weight matrix from the input layer) of three neurons in the second layer are shown gray scale coded. Light shades indicate sensitive regions; dark areas indicate insensitive spots. (C) Response strength (ring rate A) as a function of stimulus position and orientation of layer 2 neurons. For stimuli with orientation given by the ordinate and position given by the abscissa, the activity of the corresponding neuron is shown on a gray scale (see section 2). The same gray scale is used for all three units. The neurons analyzed in C are identical to those in B. (D) The corresponding diagram as in C is shown for neurons in layer 3. Note the homogeneity of receptive elds along the horizontal axis, indicating translation-invariant response properties.
2836
Konrad P. Kording ¨ and Peter Konig ¨
start to diverge, and no valid values can be obtained. Figure 3D shows that in the one-cell model for small m, the cells do not properly get invariant. For higher m of 0.6, the cells get more invariant, albeit still far less invariant than in the two-site model. But at m D 0.6, the cells are no longer really local detectors. Sixty percent of their activity is determined by the other stream. Then, however, neurons are no longer local feature detectors. In learning networks of the architecture investigated here, where neurons perform only one synaptic integration, a trade-off seems to exist: either neurons act as local detectors and do not properly learn invariant responses, or neurons are no longer local detectors but learn correctly. This is the reason that we argue that two sites of synaptic integration indeed improve processing. In the simulations described above, we assumed a perfect correlation of the orientation of stimuli presented to the two streams and a complete lack of correlation of position. Indeed, in natural occurring stimuli, orientation is correlated much stronger than position (Betsch, Einh¨auser, K¨ording, & Konig, ¨ 2001). Nevertheless, the assumptions made in the above simulation might be too strong. Therefore, we investigated performance of the learning rule while varying correlation of orientation and position. The orientation of the stimulus presented to the rst stream (#1 ) is randomly selected in the interval [0 . . . p [ as before. The orientation of the other stimulus is chosen equally distributed in the interval [#1 ¡ p ( 1 ¡ r ) / 2 . . . #1 C p ( 1 ¡ r ) / 2]. The parameter r allows the correlation of stimulus orientation Figure 3: Facing page. Controls: (A) The diagram shows total activity in layer 2 (left) and layer 3 (right) as a function of stimulus orientation and position. The small variations in shading indicate that the network activity is comparable for all stimuli. (B) The cross-stream correlation is shown (see section 2) for different learning rates. (C) The coupling of the two integration sites (a) is varied; measures for the bar specicity (see section 2) for position s(position) (solid line) and for orientation s(orientation) (dotted line) are plotted. (D) As a control, a simulation where the neurons of the third layer exhibit just one site of synaptic integration was performed (see section 2). The inuence of lateral inputs to the cell is gauged by a parameter m (low m means low lateral effects). The orientation s(orientation) and position s(position) specicity is plotted against the inuence parameter m. (E) As in C, the bar specicities s(position) and s(orientation) are plotted, changing the position of stimuli between both streams (left column) or changing orientation of stimuli between both streams (right column). The correlation of stimulus properties across the two streams (r) is varied for two values of a (a D 1, strong coupling of integration sites and a D 0.1 medium coupling of integration sites). The solid and dotted lines show data of simulation runs where the correlation of stimulus position or orientation was varied respectively. In either case r D 0 corresponds to the parameters used in the other simulations. (F) Response properties of layer 3 neurons are shown when identical stimuli presented to both streams.
Invariance Learning with Two Sites of Integration
2837
a
to change smoothly, with r D 1 resulting in perfect correlation as used in most simulations presented here and r D 0 resulting in zero correlation. In Figure 3D, the dotted lines show that depending on the strength of coupling between the two sites a, that is, when learning is determined to a large extent from the context, even big stimulus intervals (r D 0.25) corresponding to a small correlation still allow learning of invariances.
2838
Konrad P. Kording ¨ and Peter Konig ¨
Along similar lines, we analyzed the inuence of varying degrees of correlations in the position on learning of invariances. Again we observe a robust behavior of the learning rule, tolerating signicant variations of the parameter r, here describing the correlation of position. In fact, it turns out that at least for a strong coupling of the two integration sites (a D 1), a nite correlation of stimulus position is helpful in learning invariances. This can be understood by investigating dynamics of learning. Weak correlation of position leads to a symmetry breaking, and the system leaves metastable states more quickly (data not shown). Obviously if both variables are perfectly correlated, learning of invariances is no longer possible. Then neurons develop receptive elds specic in both space and orientation (see Figure 3E). Receptive elds are large, as they have to be, in order for the set of neurons to cover stimulus space. But more important, response properties of all neurons are selective for orientation as well as for position. Variables, which are correlated across streams, are considered relevant according to criterion 2. Uncorrelated variables induce invariant responses and, thus, any information pertaining to these feature dimensions is discarded. 3.3 Temporal Continuity. In the simulation above, we used a pseudorandom sequence of stimuli with the time constants of the relevant variables set to zero. Thus, no interaction between subsequent stimuli occurred. On the other hand, temporal continuity of objects in the real world allows generating invariant responses. Here we analyze a system operating on this principle consisting of one stream only (see Figure 4A). The time constant of the potential at the apical dendrite ( D( t) D S ( A ( t ¡ n ) ¤ ( 1 ¡ 1 / tD ) n where n starts from 0) is set to a nite value (tD D 10 iterations) and stimulus orientation changes with the same time constant ( #new D ( #old C 0.1¤p ¤ (rand ¡ 0.5) )modp , where rand is uniformly drawn from [0. . . 1[ ). Figure 4B shows that neurons in the third layer obtain orientation-selective but positioninvariant receptive elds. Thus, the effect of a nite-time constant is comparable to the “supervising” input by a second stream, and the simulation using two streams could equally be interpreted as the slow temporal dynamics unfolded in space. 3.4 Top-Down Contextual Effects. The simulations described exploit temporal or spatial regularities of stimuli. Thus, they represent a bottom-up approach of generation of invariances. As a next step, we explore whether top-down signals can exploit the same mechanism. Similar to the simulation above, a single processing stream is used (see Figure 5A). A set of 10 units is added, representing an independent source of information. Their response properties are specied to be position selective and orientation invariant, and the union of their receptive elds evenly covers stimulus space. Training the network with a pseudo-random sequence of stimuli as before leads to an interesting result. Receptive elds of neurons in layer 3 are specic with
Invariance Learning with Two Sites of Integration
2839
Figure 4: Learning from temporally smoothly varying patterns. (A) The network consists of one stream only. The apical dendrites have a nite time constant, and the orientation of the stimulus changes slowly. Note the similarities of the delayed interaction with the two-stream setup in Figure 3. (B) Response strength of layer 3 neurons shown as a function of stimulus orientation and position.
respect to position but invariant with respect to orientation (see Figure 5B). This demonstrates that the type of invariance is not solely dened by the stimulus set; the neurons learn to transmit the part of the information that is correlated with the activity of the other set of neurons. Thus, this represents a special case of the “maximization of relevant information” principle proposed in Kording ¨ and Konig ¨ (2000a). As a next step a combination of top-down and bottom-up invariance extraction is investigated. The network consists of two coupled streams, one of these receiving an additional projection from one position-selective neuron (see Figure 5C). This neuron can be activated by stimuli of any orientation, but from just one-quarter of the positions. The network is trained with stimuli correlated in orientation but not in position. One neuron in the left module of the third layer learned to extract the information relevant for the position-selective neuron. The other neurons picked up the information correlated across streams (see Figure 5D). However, these two types of input
2840
Konrad P. Kording ¨ and Peter Konig ¨
to the apical dendrite show some undesired interaction. Due to the strong competition implemented by the local inhibitory interaction on the calcium spikes, each stimulus is learned by one neuron only. As the top-down input induces one orientation-invariant, position-selective receptive eld, this position is cut out of the receptive elds of the remaining orientation-selective, position-invariant neurons. Thus, they are actually only partly position invariant. It remains a major issue for future research how to extract several
Invariance Learning with Two Sites of Integration
2841
variables from several such principles simultaneously (Phillips et al., 1995). Nevertheless, given that we deliberately have chosen conicting demands of top-down and bottom-up information, the network behaves reasonably well and shows that learning of invariance by spatial coherence can be complemented by top-down information on the relevance of different features. Apart from this obvious problem, this simulation demonstrates that lateral connections sufce to learn invariances and that these processes can be rened using top-down information. 3.5 Multiple Streams. In the simulations above, at most two streams were involved. This is, of course, a gross simplication; in the biological system, each cortical patch, analyzing part of the visual eld, has many more neighbors. Therefore, we extend the network to include multiple streams and vary their number and the degree of correlation of inputs to these streams. To allow more extensive simulations, we concentrate on the generation of invariant-receptive elds in the step from layer 2 to layer 3 in the previous simulations. Thus, we do not deal with the feature extraction from layer 1 to layer 2 but set activity levels of layer 2 neurons directly. The input is a preprocessed two-dimensional map. A stimulus consisting of a bar would activate mainly one layer 2 neuron. It is therefore represented by a binary pattern, with all entries set to zero but the one neuron with the corresponding orientation and position set to one. Presentation of several overlapping bars is coded accordingly. Because feature extraction is not dealt with in this simulation, we drop the terms orientation and position and use neutral descriptors of class and instantiation, respectively. In the simulation, instantiations (positions) are not correlated between different streams, and the correlation between classes (orientation) is regulated by the parameter pc . pc represents the amount of spurious correlations of orientation due to the presence of multiple oriented stimuli (see section 2). At small pc in all streams, one unit of a class is activated. At larger pc , spurious correlations between different classes are introduced, making the learning task harder. In natural scenes, typically a large number of objects and features is seen at Figure 5: Facing page. Effect of the relevant infomax. (A) The network matches a single stream as shown in Figure 2 with another set of neurons added above the higher layer of that stream. (B) The resulting responses of third-layer neurons as a function of orientation and position are shown. (C) The second network is the same as in Figure 2 except for another set of neurons added above the higher layer of the left stream. The neurons on the third layer receive input from the third layer of the other stream and the highest layer. (D) Results with relevant infomax. The responses of the left stream’s third-layer neurons are shown. The neurons on the right side develop position-selective and orientation-invariant response properties, whereas the other neurons respond orientation specic and translation invariant.
2842
Konrad P. Kording ¨ and Peter Konig ¨
the same time. We therefore consider these simulations an important step on the way to being able to deal with natural images. We use the class specicity measure (see section 2) to characterize the performance of the system. They converge to a rather high value and stabilize there (see Figure 6B for pc D 0.2, 4 streams). Thus, typically neurons in all streams learn to represent one class. Occasionally it happens that two neurons in one stream do not code for a single class, but instead code for two classes each (see Figure 6C, pc D 0.2, four streams). Such a metastable state can last for prolonged periods before typically collapsing into the correct solution (see Figure 6D, third panel). The convergence behavior and
Invariance Learning with Two Sites of Integration
2843
receptive elds are shown for the typical case of Figure 6C. To analyze the behavior further, we varied pc as 0.01, 0.2, and 0.4 and determined the effect of the number of streams (see Figure 6D). Increasing the number of streams speeds up convergence. The number of effective streams in visual cortex can be estimated by the ratio of extent of long-range connections to the minimal distance of neurons with nonoverlapping receptive elds. For the cat we obtain a (linear) ratio of about 7 mm / 2 mm = 3.5 (Salin & Bullier, 1995), giving an estimate (p r2 ) of 38 for the total number of streams. This high number places the cortical architecture in the parameter range where lateral connectivity allows much faster and more stable learning. 4 Discussion
We have shown that a reasonable neuron model with two sites of synaptic integration can be used to implement learning principles of spatial and temporal continuity. Furthermore, within the same framework, it can be combined with principles that approximate maximization of relevant information. Finally, lateral connections, as commonly seen in mammalian neocortex, speed up the learning process. 4.1 Approximations. In our implementation obviously several physiological aspects are simplied. First, experiments on the physiological properties of dendrites of pyramidal neurons reveal a high degree of complexity (Johnston, Magee, Colbert, & Cristie, 1996; Segev & Rall, 1998; Helmchen et al., 1999). Some theoretical studies argue for a superlinear interaction of postsynaptic signals in the dendrite (Softky, 1994). Other studies imply linear summation (Bernander, Koch, & Douglas, 1994; Cash & Yuste, 1999), or provide evidence for sublinear dendritic properties (Mel, 1993). Furthermore, the effectivity of input via Figure 6: Facing page. Multiple streams. (A) Each stream consists of two layers: an input layer where preprocessed sparse activity is present, projecting to the second layer. The modules of the second layer are connected to all other modules’ second layer. The number of the streams is varied and is either 2, 4, and 6. (B) The left panel shows the class specicity index (see section 2) in the typical case as a function of the iteration. The other panels show the corresponding receptive elds for each neuron. (C) The class specicity is shown for a case where not every neuron was associated to exactly one class. (D) The class specicity is shown for each stream averaged over four runs. Each line represents one stream. The left-hand panel shows the average class specicity as a function of the iteration and the number of streams at pc D 0.01. Note that the four-stream and the six-stream lines largely overlap. The middle panel shows that introducing spurious correlations (pc D 0.2) slows the convergence. This effect is more pronounced for a small number of streams. The right-hand panel (pc D 0.4) shows that at very strong false correlation, the effect also holds.
2844
Konrad P. Kording ¨ and Peter Konig ¨
the apical dendritic tree is discussed. Obviously it is not possible to subscribe to all these views simultaneously. In a gross simplication, we assume linear summation of all inputs to the basal dendritic tree, and the function of the apical dendrite is reduced to a threshold mechanism triggering calcium spikes. This choice avoids extreme views on the unresolved matters described above and also avoids obscuring the article with too many details and parameters. Second, in this article, we did not include an effect of calcium spikes on postsynaptic activity. However, it is known that calcium spikes can induce postsynaptic bursting activity (Larkum, Zhu, & Sakmann, 1999; Williams & Stuart, 1999; see Figure 1C). Within the framework of lateral and topdown projections terminating on the apical dendrite, this would result in contextual information not only gating learning, but also inuencing signal processing itself (Phillips et al., 1995; Kay, Floreano, & Phillips, 1998). Indeed, it has been shown that top-down projections can enhance processing of sensory signals (Siegel, Kording, ¨ & Konig, ¨ 2000). Third, the inhibitory neurons are reduced to a linear effect on somatic activity and act as mediators of the winner-take-all mechanism on calcium spikes. Larkum, Zhu, & Sakmann (1999) found a dramatic effect of inhibition on the calcium spike generation mechanism. The activity of a single inhibitory neuron seems to be sufcient to prevent triggering of calcium spikes. This effectively describes a winner-take-all mechanism that we implemented algorithmically. (For an alternative implementation, see Kording ¨ & Konig, ¨ 2000b.) Fourth, in our simulations, we use a normalization scheme for the network activity. Actually, such a procedure is often used and may be implemented by superlinear inhibition. As currents mediated by GABAB receptors are observed at high presynaptic ring rates (Kim, Sanchez-Vives, & McCormick, 1997), this is a promising candidate mechanism. Finally, in this work, a rate-coding system was implemented, ignoring the precise timing of action potentials. Indeed, in several of the experiments cited above (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Larkum, Kaiser, & Sakmann, 1999) synaptic plasticity has been found to depend on the relative timing of afferent and backpropagating action potentials. Evidence is available that the mean ring rate of cortical neurons maps directly on the relative timing of action potentials (Konig ¨ et al., 1995). Thus, our approximation appears reasonable, and from existing knowledge, we can be optimistic that the effects observed here hold up in a more detailed simulation study. 4.2 Comparison with Other Unsupervised Systems That Learn Invariances. As discussed in section 1, several possible criteria can be used to
learn invariances. The spatial and the temporal criteria are commonly used (Hinton, 1989; Foldiak, ¨ 1991; Becker & Hinton, 1992; Stone & Bray, 1995; cf. Becker, 1996, 1999; Kay et al., 1998). Those studies at rst sight do not seem to map directly on known physiology. We do not claim to outper-
Invariance Learning with Two Sites of Integration
2845
form those studies with respect to convergence speed or the maximization of a goal function, but investigate physiological mechanisms that may be used by the brain to implement algorithms like these. Nevertheless, we do hope that investigating mechanisms used by the brain will nally help us conceive algorithms that outperform today’s algorithms. The main virtue of the learning rule proposed here is that it explicitly addresses the existence of a trade-off between correctly representing local stimulus features and correctly learning invariant representations. Furthermore, it allows smooth integration of top-down information with contextually guided feature extraction to result in extraction of relevant information. To learn invariant representations from natural stimuli is an important line of research, which is currently becoming feasible (Hyvarinen & Hoyer, 2000). O’Reilly and Johnson (1994) indicate that at least the temporal smoothness criterion could be implemented in one site of the synaptic integration model where delayed recurrent excitatory activity takes the role of supplying the signal necessary for learning. In their approach, the same trade-off is likely to hold; with weak delayed excitation, the learning rule effectively is Hebbian, and with strong excitation, neurons are not local feature detectors but have temporal low-pass characteristics. Eisele (1997) presents an interesting approach enhancing the temporal smoothness criterion to cases in which transitions are not symmetric. Two types of dendrites represent states a system can go to or can come from. These different kinds of properties are then mapped on apical and basal dendrites. Rao and Sejnowski (2000) propose a system where temporally predictive coding in a recurrent neural network leads to the generation of invariant-response properties. This model from our point of view uses a variant of the temporal smoothness criterion, and we assume it to be subject to the trade-off described above as well. Mel, Ruderman, & Archie (1998) investigate the development of complex receptive elds. Modeling cortical pyramidal neurons, bars of identical orientation are presented at different positions and Hebbian learning applied. Their approach thus needs a mechanism ensuring that neurons learn only when the same orientation is shown. Thus, a mechanism to gate learning is needed; our approach may be considered a physiological implementation. We must note, though, that due to its derivation from physiological results, it is not directly amenable to mathematical analysis, and thus it is not possible to write down its goal function explicitly. By virtue of the physiological implementation, we nevertheless obtain a unied framework allowing us to use temporal continuity and spatial continuity for the development of invariantresponse properties and combine this with a bias to relevant information by top-down signals. 4.3 Cortical Building Block. Analyzing the mammalian neocortex shows an astonishing degree of similarity across areas as well across animal species. It is a shared feeling of many neuroscientists that with this anatomical similarity should go a common system for computing and learning. “The
2846
Konrad P. Kording ¨ and Peter Konig ¨
typical wiring of the cortex, which is invariant irrespective of local functional specialization, must be the substrate of a special kind of operation which is typical for the cortical level” (Braitenberg, 1978). The principles proposed for supporting unsupervised learning of invariances should be part of such a common cortical building block; these principles in turn may shed some light on the signicance of a prominent property of the architecture of cortical columns, the apical dendrite of pyramidal neurons. Acknowledgments
We thank Matthew E. Larkum for kindly supplying Figure 1. We also thank the Boehringer Ingelheim Fonds and the Swiss National Fonds for nancial support. References Amitai, Y., Friedman, A., Connors, B. W., & Gutnick, M. J. (1993). Regenerative activity in apical dendrites of pyramidal cells in neocortex. Cereb. Cortex, 3, 26–38. Artola, A., Brocher, ¨ S., & Singer, W. (1990). Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature, 347, 69–72. Barnard, E., & Casasent, D. P. (1991). Invariance and neural nets. IEEE Transactions on Neural Networks, 2, 489–508. Becker, S. (1996). Models of cortical self-organisation. Network: Comput. Neural Syst., 7, 7–31. Becker, S. (1999). Implicit learning in 3D object recognition: The importance of temporal context. Neur. Comput., 11, 347–374. Becker, S., & Hinton, G. E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163. Bernander, O., Koch, C., & Douglas, R. J. (1994). Amplication and linearization of distal synaptic input to cortical pyramidal cells. J. Neurophysiol., 72, 2743– 2753. Betsch, B. Y., Einh¨auser, W., Kording, ¨ P. K., & Konig, ¨ P. (2001). What a cat sees— statistics of natural videos. Manuscript submitted for publication. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Booth, M. C., & Rolls, E. T. (1998). View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex. Cereb. Cortex, 8, 510–523. Bradski G., & Grossberg, S. (1995). Fast-learning viewnet architectures for recognizing three-dimensional objects from multiple two-dimensional views. Neural Networks, 8, 1053–1800. Braitenberg, V. (1978). Cortical architectonics: General and areal. In M. A. B. Brezier & H. Petsch (Eds.), Architectonics of the cerebral cortex. New York: Raven.
Invariance Learning with Two Sites of Integration
2847
Buzsaki, G., & Kandel, A. (1998). Somadendritic backpropagation of action potentials in cortical pyramidal cells of the awake rat. J. Neurophysiol., 79, 1587– 1591. Cash, S., & Yuste, R. (1999). Linear summation of excitatory inputs by CA1 pyramidal neurons. Neuron, 22, 383–394. Cauller, L. J., & Connors, B. W. (1994).Synaptic physiology of horizontal afferents to layer I in slices of rat SI neocortex. J. Neurosci., 14, 751–762. de Sa, V. R. & Ballard, D. H. (1998) Category learning through multimodality sensing. Neur. Comp., 10, 1097–1117. Dudek, S. M., & Bear, M. F. (1991). Homosynaptic long term depression in area CA1 of hippocampus and the effects on NMDA receptor blockade. Proc. Natl. Acad. Sci. U.S.A., 89, 4363–4367. Eisele, M. (1997). Unsupervised learning of temporal constancies by pyramidaltype neurons. In S. W. Ellacon, J. C. Mason, & I. J. Anderson (Eds.), Mathematics of neural networks (pp. 171–175). Norwell, MA: Kluwer. Felleman, D., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex, 1, 1–47. Foldiak, ¨ P. (1991). Learning invariance from transformation sequences. Neural Comput., 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36, 193–202. Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Networks, 1, 119–130. Fukushima, K. (1999). Self-organization of shift-in variant receptive elds. Neural Networks, 12, 791–801. Graziano, M. S., & Gross, C. G. (1998). Spatial maps for the control of movement. Curr. Opin. Neurobiol., 8, 195–201. Hebb, D. O. (1949). The organisation of behaviour. New York: Wiley. Helmchen, F., Svoboda, K., Denk, W., & Tank, D. W. (1999). In vivo dendritic calcium dynamics in deep-layer cortical pyramidal neurons. Nat. Neurosci., 2, 989–996. Hinton, G. E. (1987). Learning translation invariant recognition in a massively parallel network. In G. Goos & J. Hartmanis (Eds.), PARLE: Parallel architectures and languages Europe (pp. 1–13) Berlin: Springer-Verlag. Hinton, G. E. (1989). Connectionist learning procedures. Articial Intelligence, 40, 185–234. Hirsch, J. A., Alonso, J. M., & Reid, R. C. (1995). Visually evoked calcium action potentials in cat striate cortex. Nature, 378, 612–616. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106– 154. Hubel, D. H., & Wiesel, T. N. (1998). Early exploration of the visual cortex. Neuron, 20, 401–412. Hyvarinen, A., & Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Comput., 12, 1705–1720.
2848
Konrad P. Kording ¨ and Peter Konig ¨
Johnston D., Magee, J. C., Colbert, C. M., & Cristie, B. R. (1996). Active properties of neuronal dendrites. Ann. Rev. Neurosci., 19, 165–186. Kay, J., Floreano, D., & Phillips, W. A. (1998). Contextually guided unsupervised learning using local multivariate binary processors. Neural Networks, 11, 117– 140. Kim, U., Sanchez-Vives, M. V., & McCormick, D. A. (1997). Functional dynamics of GABAergic inhibition in the thalamus. Science, 278, 130–134. Kobatake E., & Tanaka K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71, 856–867. Konig, ¨ P., Engel, A. K., Roelfsmar, P. R., & Singer, W. (1995). How precise is neuronal synchronization? Neural Comp., 7, 469–485. Kording, ¨ K. P., & Konig, ¨ P. (2000a). Learning with two sites of synaptic integration. Network: Comput. Neural Syst., 11, 1–15. Kording, ¨ K. P., & Konig, ¨ P. (2000b). A learning rule for dynamic recruitment and decorrelation. Neural Networks, 13, 1–9. Larkum, M. E., Kaiser, K. M., & Sakmann, B. (1999). Calcium electrogenesis in distal apical dendrites of layer 5 pyramidal cells at a critical frequency of backpropagating action potentials. Proc. Natl. Acad. Sci. USA, 96, 14600– 14604. Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999). A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398, 338– 341. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E, Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Comput., 1, 541–551. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70, 1086–1101. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant tuning in visual “complex” cells could derive from intradendritic computations. J. Neurosci., 18, 4325–4334. O’Reilly, R. C., & Johnson, M. H. (1994). Object recognition and sensitive periods: A computational analysis of visual imprinting. Neur. Comp., 6, 357–389. Perrett, D. I., Oram, M. W., Harries, M. H., Bevan, R., Hietanen, J. K., Venson, P. J., & Thomas, S. (1991). Viewer-centred and object-centred coding of heads in the macaque temporal cortex. Exp. Brain. Res., 86, 159–173. Phillips, W. A., Kay, J., & Smyth, D. (1995). The discovery of structure by multistream networks of local processors with contextual guidance. Network:Comput. Neural Syst., 6, 225–246. Pike, F. G., Meredith, R. M., Olding, A. W. A., & Paulsen O. (1999). Postsynaptic bursting is essential for “Hebbian” induction of associative long-term potentiation at excitatory synapses in rat hippocampus. J. Phys., 518, 571–576. Rao, R., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.).
Invariance Learning with Two Sites of Integration
2849
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nat. Neurosci., 2, 1019–1025. Rolls, E. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical visual areas. Phil. Trans. R. Soc. Lond. B, 335, 11–21. Rolls, E., & Treves, A. (1997). Neural networks and brain function. Oxford: Oxford University Press. Salin, P. A., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiol. Rev., 75, 107–154. Schiller, J., Schiller, Y., Stuart, G., & Sakmann, B. (1997). Calcium action potentials restricted to distal apical dendrites of rat neocortical pyramidal neurons. J. Physiol. (Lond.), 505, 605–616. Segev, I., & Rall, W. (1998) Excitable dendrites and spines: Earlier theoretical insights elucidate recent direct observations. Trends Neurosci., 21, 453–460. Siegel, M., Kording, ¨ K. P., & Konig, ¨ P. (2000). Integrating bottom-up and topdown sensory processing by somato-dendritic interactions. J. Comput. Neurosci., 8, 161–173. Softky, W. (1994).Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Stone, J. V., & Bray A. J. (1995). A learning rule for extracting spatio-temporal invariances. Network: Comput. Neural Syst., 6, 429–436. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Stuart, G. J., Schiller, J., & Sakmann, B. (1997). Action potential initiation and propagation in rat neocortical pyramidal neurons. J. Physiol. (Lond.), 505, 617–632. Stuart, G. J., & Spruston, N. (1998). Determinants of voltage attenuation in neocortical pyramidal neuron dendrites. J. Neurosci., 18, 3501–3510. Wang, G., Tanaka, K., & Tanifuji, M. (1996). Optical imaging of functional organization in the monkey inferotemporal cortex. Science, 272, 1665–1668. Williams, S. R., & Stuart G. J. (1999). Mechanisms and consequences of action potential burst ring in rat neocortical pyramidal neurons. J. Physiol. (Lond.), 521, 467–482. Zeki, S., & Shipp, S. (1988). The functional logic of cortical connections. Nature, 335, 311–317. Received March 23, 2000; accepted March 7, 2001.
LETTER
Communicated by Yann LeCun
Linear Constraints on Weight Representation for Generalized Learning of Multilayer Networks Masaki Ishii [email protected] Itsuo Kumazawa [email protected] Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8552, Japan In this article, we present a technique to improve the generalization ability of multilayer neural networks. The proposed method introduces linear constraints on weight representation based on the invariance natures of training targets. We propose a learning method that introduces effective linear constraints into an error function as a penalty term. Furthermore, introduction of such constraints leads to reduction of the VC dimension of neural networks. We show bounds on the VC dimension of the neural networks with such constraints. Finally, we demonstrate the effectiveness of the proposed method by some experiments. 1 Introduction
In applications of multilayer neural networks, their generalization ability, that is, how well they can work for novel inputs, is important. In general, neural networks are trained to realize input-output relations of a training set using the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986). However, it is not always ensured that they can work correctly for novel inputs undened in the set. A number of methods have been investigated to improve the generalization ability of neural networks. As such methods, regularization techniques have been extensively studied (Plaut, Nowlan, & Hinton, 1986; Weigend, Rumelhart, & Huberman, 1991; Ishikawa, 1996). Furthermore, it has been considered that better generalization ability can be expected by removing irrelevant weights and hidden units. From this idea, some pruning methods that remove irrelevant weights were proposed (LeCun, Denker, & Solla, 1990; Hassibi & Stork, 1993). In addition, one of the approaches to improve generalization ability is to construct networks whose outputs are invariant with respect to some transformations of input patterns. For example, in the case of character recognition, character patterns should be classied to the same category even if they are translated, dilated, and rotated. A regularization method to construct networks whose outputs are invariant with respect to small distortions of input patterns c 2001 Massachusetts Institute of Technology Neural Computation 13, 2851– 2863 (2001) °
2852
Masaki Ishii and Itsuo Kumazawa
was proposed (Simard, Victorri, LeCun, & Denker, 1992). Furthermore, relationships between the approach of such a regularization method and the one to enhance the training patterns by adding transformed patterns were discussed (Leen, 1995). In this article, we take the same approach to improve generalization ability; that is, we construct neural networks whose outputs are invariant with respect to some transformations of input patterns. We use linear transformations as approximations of such distortions of input patterns. We show that invariance natures are formulated as linear constraints on weight representation. If distortions of input patterns are small, the condition on weight representation can be also derived using the idea given in Simard et al. (1992). Then we give a learning method that introduces such requirements on weight representation as a penalty term of an error function. A special case of our method can be considered as weight decay (Plaut et al., 1986). Furthermore, for the restricted case of linear transformations, bounds on the VC dimension (Vapnik, 1982) of neural networks can be derived theoretically. We show the upper and the lower bounds on the VC dimension of the neural networks with such constraints. Finally, in order to verify the effectiveness of the proposed method, we show some experimental results. 2 Networks Formulation
Let us consider a neural network with n inputs, a hidden layer of h units, and an output layer of s units. By x D ( 1, x1 , . . . , xn ) t , which includes a constant input 1, we denote an input pattern where xt is a transpose of x. Let w i D ( wi0 , wi1 , . . . , win ) t 2 R n C 1 be a weight vector of the ith hidden unit. Among the weights, wi0 corresponds to a threshold value. An output of the ith hidden unit is dened as w ( w ti x ), where w ( ¢) is an activation function of units. Network outputs are formulated as
³ fj ( x ) D w vj0 C
h X
´ vji w ( w ti x)
i D1
,
j D 1, . . . , s,
(2.1)
where vj0 is a threshold value of the jth output unit and vji is a weight between the jth output unit and the ith hidden unit. Let fxp gpND 1 be a set of training patterns; we denote it by X. Neural networks are generally trained by minimizing an error function given by JD
s ¡ XX x2X j D1
tj ( x) ¡ fj ( x )
¢2
,
(2.2)
where t1 ( x) , . . . , ts ( x ) are desired outputs for x 2 X. This error function, however, does not ensure that neural networks can work correctly for novel
Linear Constraints on Weight Representation for Generalization
2853
input patterns. In order to improve generalization ability, we consequently need another criterion or strategy that leads to their correct behavior for nontraining patterns. 3 Proposed Method
As one way to improve generalization ability, we can construct neural networks whose outputs are invariant with respect to some transformations that do not affect network outputs. In this article, we assume that there exist some invariance natures of training targets, that is, formulated by an operator g, so that fj ( x) D fj (g( x, a) ) holds. g( x, a) transforms x as a function of a and such that g( x, 0) D x. Here, we approximate g as follows. Given that g is analytic in the parameter a, the Taylor expansion of g around a D 0 can be written as g( x, a) D g( x, 0) C a
@g( x, a) @a
C O (a2 )
(3.1)
or g( x, a) D x C aA( x, a) C O ( a2 ).
(3.2)
Let us suppose that A( x, a) is expressed as Ax, where A is a constant matrix. By Ha we denote a linear transformation that approximates g, and then dene it as
Ha D E C aA,
(3.3)
where E is a unit matrix. In order to construct networks whose outputs are invariant with respect to Ha, an error function dened as
³
s XX ¡ x2X j D1
tj ( x ) ¡ fj ( x)
¢2
C
X¡ Ha
fj ( x) ¡ fj ( Hax )
¢2
´ (3.4)
needsto be minimized. The second term evaluates the invariances of network outputs with respect to Ha . To enable evaluating the second term of equation (3.4) without transformed patterns, we utilize the following condition on weight representation. The sufcient condition under which network outputs are invariant with respect to Ha , that is, fj ( x) D fj ( Ha x ), for any x is to satisfy ¡
¢ Hat ¡ E w i D O ,
i D 1, . . . , h.
(3.5)
2854
Masaki Ishii and Itsuo Kumazawa
In effect, if this condition is satised, then wti x D w ti Ha x, i D 1, . . . , h hold for any x. Therefore, fj ( x) D fj ( Ha x) holds. From equation (3.3) and (3.5), we have
Atw i D O,
i D 1, . . . , h.
(3.6)
Equation (3.6) means that the invariance natures of training targets can be formulated as linear constraints among weights of hidden units. It is also regarded as a way of restricting the class of functions. Let r be the rank of A, then r indicates the degree of constraints. In case distortions of input patterns are small and w ( ¢) is differentiable, this condition can also be derived using the idea of the tangent prop algorithm (Simard et al., 1992). If fj ( x) is differentiable, then the condition under which network outputs are invariant with respect to g( x, a) is @fj (g( x, a) ) / @a D 0. Moreover, @w ( w tig( x, a) ) / @a D 0, i D 1, . . . , h also lead the invariances. If distortions of input patterns are small, then we have
@w ( wtig( x, a) ) @a
0 t lim D w ( w i x ) a!0
aD 0
wti Ha x ¡ w ti x a¡0
D w 0 ( w ti x ) w ti Ax D 0.
(3.7)
From this, the same condition as equation (3.6) is derived. Now we are ready to dene an error function. We evaluate the second term of equation (3.4) by the sum of squared norm of equation (3.6), that is, KD
h X iD 1
kAt w i k2 .
(3.8)
By replacing the second term of equation (3.4) with K, the error function is given by JK D J C c K,
(3.9)
where c is a positive constant. By training neural networks through minimizing JK , we improve generalization ability. Here, A is modeling volumes around each point in the input space. This is less specic than the a priori tangent vectors, and it has more regularizing power because the space generated by A is larger than the space generated by the typically few tangent vectors. 4 The VC Dimension of Neural Networks with Linearly Constrained Weights
When the linear constraints on weight representation are fully satised, it is theoretically possible to derive bounds on the VC dimension of the neural
Linear Constraints on Weight Representation for Generalization
2855
networks (Ishii & Kumazawa, 2000). The VC dimension is related to generalization ability of neural networks. In this section, we consequently show bounds on the VC dimension of neural networks in case K is zero. Although in practical cases, K cannot be ideally zero, the theoretical evaluation in this section will provide some ideas about the effectiveness of introducing the linear constraints on weight representation. Let us consider neural networks with linear threshold units and only one output unit. In order to derive bounds on the VC dimension of the neural networks introduced the linear constraints dened as equation (3.6), we construct other neural networks that have no dependency among weights but have the equivalent memorizing capability to the ones with constraints among weights. Then by applying existing approaches to the converted networks, we derive bounds on the VC dimension. Let F be a class of all functions mapping R n to f0, 1g that are realized by the neural networks with the linear constraints among weights. As the neural networks with the equivalent memorizing capability, let us consider neural networks with n¡r inputs, a single layer of h hidden units, and only one output unit. Let G be a class of all functions that are realized by such neural networks. Moreover let V be an ( n ¡ r C 1) £ ( n C 1) matrix and let Y D fy D V x | x 2 Xg be a set of input patterns, where the rst component of y is xed to be 1. Under these notations, we have the following lemma: The number of distinct dichotomies of X induced by f 2 F is equal to that of Y induced by g 2 G. Lemma 1.
As for the VC dimension of F, we have the following theorem from lemma 1: Theorem 1.
Let F and G be as in lemma 1. Then the VC dimension of F is equal
to that of G. This theorem means that we should derive the VC dimension of G instead of that of F. By applying existent approaches (Baum & Haussler, 1989; Sakurai, 1992), the VC dimension of G is bounded by ( n ¡ r) h C 1 · VCdim( G ) · 2W 0 log(eU) ,
(4.1)
where e is the base of the natural logarithm, log is the logarithmic function with the base 2, U is the total number of linear threshold units, and W 0 D W ¡ rh (W is the total number of weights and threshold values of f ). The value of r indicates the degree of reduction of the VC dimension of the neural networks with linear constraints. Consequently, introducing linear constraints based on invariance natures of training targets is equivalent to reducing the VC dimension of neural networks.
2856
Masaki Ishii and Itsuo Kumazawa
5 Learning Algorithm
In this section, we give a weight-update rule. Let w (¢) be the sigmoid function: w ( u ) D 1 / (1 C exp( ¡u) ) . Then by applying the gradient-descent method to minimize JK , we have the following weight-update rule: @ JK
@J
D vji
D ¡2
D wi
D ¡2 rwi JK D ¡2 rwi J ¡ 22 0 AAt w i ,
@vji
D ¡2
@vji
(5.1) (5.2)
where 2 is the learning-rate parameter and 2 0 D 2 c . rwi JK and rwi J are the gradients of JK and J with respect to w i , respectively. A special case of the proposed method corresponds to weight decay (Plaut et al., 1986), which is one of regularization methods. If A is an orthogonal matrix, then D wi D ¡2 rwi J ¡ 22 0 w i . In this special case, the proposed method can be considered as weight decay. If weight decay is applied, then the weight parameters are encouraged to be small. On the other hand, in the proposed method, no constraints are introduced if w i ’s 2 N ( At ) , where N ( At ) is the null space of At . 6 Simulations
In order to verify the effectiveness of our method, we show some experimental results in this section. 6.1 Handwritten Digits Recognition. We rst show the results of a fundamental experiment. We applied our method to a handwritten digits recognition problem. Each pattern is a binary image that has 32 £ 32 pixels. Each pattern is classied to one of 10 categories. We intended to construct a network whose outputs are invariant with respect to horizontal translations by one or two pixels. We used a set of 200 patterns for training and evaluated the generalization performance using two test sets: a set of 800 patterns generated by horizontally translating training patterns by one or two pixels and a joint set of 1000 nontraining patterns and 4000 patterns generated by horizontally translating them by one or two pixels. In this experiment, we did not use the approximation of g dened as equation (3.3) in the proposed method, but we used four linear transformations that shifted every row of an image to the left or right by one or two pixels. The linear transformations are expressed by
0
1
B B B @
O M ..
O
.
M
1 C C C, A
(6.1)
Linear Constraints on Weight Representation for Generalization
2857
where M is a square matrix that circularly shifts the row of an image. Let fH ( k) g4kD 1 be matrices that correspond to the linear transformations. In this case, the sufcient condition under which fj ( x) D fj ( H t( k) x) holds for any x is to satisfy ( H t( k) ¡ E ) w i D O , so that we used the error function given by JK D J C c
4 X h ®± ² ®2 X ® ® ® H t( k) ¡ E wi ® .
(6.2)
kD 1 i D 1
The backpropagation algorithm and the tangent prop algorithm were also applied to this problem for comparison. In the tangent prop algorithm, the tangent vectors were computed for each of the training patterns. We used a neural network with 1024 inputs, one layer of 20 hidden units, and 10 output units. Each output unit corresponded to one of 10 categories. Each algorithm used the same network structure as that of the proposed method. Ten runs were performed using the same training patterns but different initial weight congurations. The performance of the network trained by the proposed method and the network trained by the tangent prop algorithm is shown in Figure 1. Here, the regularization parameter of tangent prop is denoted by c t . The simulation results are summarized in Table 1. We can see from Table 1 that better generalization ability with respect to horizontal translations can be acquired by the proposed method. 6.2 Simulations Using Statistical Techniques. We next show some experimental results in which linear constraints are determined based on knowledge obtained by statistical techniques when exact transformations
100 Average recognition rate [%]
Average recognition rate [%]
100
95
90
85 0.01
0.1
g
1
95
90
85
1
10
g
100 t
Figure 1: Performance of the proposed method and the tangent prop algorithm on the handwritten digits recognition problem. (Left) Average recognition rates by the proposed method. (Right) Average recognition rates by the tangent prop algorithm, where c t is the regularization parameter of tangent prop. Circles denote average recognition rates on test set 1, and squares denote average recognition rates on test set 2.
2858
Masaki Ishii and Itsuo Kumazawa
Table 1: Summary of Generalization Performance of Three Techniques on the Handwritten Digits Recognition Problem. Learning Algorithm Backpropagation Tangent prop (c t D 20) Proposed method (c =1)
% Error on Test Set 1a 9.28 § 0.46% 2.61 § 0.36% 0.63 § 0.18%
% Error on Test Set 2b 14.23 § 0.72% 9.61 § 0.63% 8.65 § 0.66%
Notes: a A set of 800 patterns generated by horizontally translating training patterns. b A joint set of nontraining patterns and ones generated by translating them.
are not known. We use discriminant analysis (DA) and principal component analysis (PCA). By applying them, some feature vectors that exist in the input space are obtained. Such vectors are often used to reduce the input dimension. In this article, we make an assumption that recognition results do not change with respect to small distortions that are perpendicular to the subspace spanned by the feature vectors, and then determine the linear constraints based on such knowledge. We dene Ha as
Ha D E C aP ,
(6.3)
where 0
0 B. P D @ .. 0
1 0 C , Pr A
¢¢¢
(6.4)
and Pr is an orthogonal projection onto the orthogonal complement of the subspace spanned by the n-dimensional feature vectors. The error function consequently is given by JK D J C c
h X iD 1
kP w i k2 .
(6.5)
6.2.1 DNA Splicing Site Problem. We applied our method to real data called “gene1,” which is included in a set of neural network benchmark problems Proben1 (Prechelt, 1994). This data set comes from the splice junction problem data set from the UCI repository of machine learning databases. In this data set, each pattern is a certain DNA subsequence, which consists of 60 nominal elements. The purpose of this problem is to classify each DNA subsequence into one of three categories from a window of 60 DNA subsequence elements. Each DNA subsequence element, which is a four-valued nominal attribute, is encoded binary by two binary inputs,
Linear Constraints on Weight Representation for Generalization
2859
so that there exist 120 inputs. There exist 3175 patterns in this data set, and they were divided into three sets: a training set (1588 patterns), a data set to determine P (793 patterns), and a test set (792 patterns). By using the set of 793 patterns, two discriminant vectors were obtained. The vectors were transformed to an orthonormal basis fe kg2kD 1 using GramSchmidt orthonormalization procedure. Then the matrix Pr was dened as P E ¡ 2kD1 e ketk. For comparison, we also applied the backpropagation algorithm and weight Pdecay to this P problem. In weight decay, an error function given by J C c w ( i kwi k2 C ji vji2 ) was used, where c w was a positive constant. We used a neural network with 120 inputs, a single layer of four hidden units, and three output units. Each output unit corresponded to one of three categories. Each algorithm used the same network structure as that of the proposed method. The training rate 2 was set at 0.001 for each learning algorithm. Ten runs were performed using the same training patterns but different initial weight congurations. Figure 2 shows the performance of the network trained by the proposed method and the network trained by weight decay. A glance at Figure 2 will reveal that the recognition rates on the test set by the proposed method using DA are higher than those by weight decay. Furthermore, we used PCA to determine P . By using the set of 793 patterns, eigenvalues flkgnkD 1 and eigenvectors fe kgnkD 1 were obtained, where the eigenvalues fl kgnkD1 were in nonincreasing order, that is, l1 ¸ l2 ¸ . . . ¸ ln . Pm t Then the P matrix PP r was dened as E ¡ kD1 e k e k , where m was chosen m n to satisfy iD 1 li / iD 1 li < r. Figure 3 shows the performance of the network trained by the proposed method using PCA. It also shows average recognition rates by the approach that consists of reducing the input dimension using PCA and then training the network by the backpropagation
100
95
90
85 proposed methomd
80 0.001
0.01
g
BP
training set test set
0.1
Average recognition rate [%]
Average recognition rate [%]
100
95
90
85 weight decay training set test set
80 0.0001
0.001
g
0.01
w
Figure 2: Performance of the proposed method using DA and weight decay on the DNA splicing site problem. (Left) Average recognition rates by the proposed method using DA. (Right) Average recognition rates by weight decay, where c w is the weight decay parameter.
2860
Masaki Ishii and Itsuo Kumazawa 90 weight decay
Average error rates [%]
88 86 BP
84 82 80 78 76 0
0.2
0.4
r
0.6
0.8
1.0
Figure 3: Performance of the proposed method using PCA on the DNA splicing site problem. Circles denote average recognition rates on the test set by the proposed method using PCA. Squares denote average recognition rates on the test set by the approach that consists of reducing the input dimension using PCA and then training the network by BP. Table 2: Summary of Generalization Performance of Five Techniques on the DNA Splicing Site Problem. Learning Algorithm Backpropagation Weight decay (c w D 2) Proposed method using PCA (c D 5, m D 1) Proposed method using DA (c D 20) Backpropagation (after applying DA)
% Error on the Test Set 15.95 § 1.29% 12.59 § 0.76% 10.87 § 0.68% 10.33 § 0.37% 15.61 § 1.20%
algorithm. We can see from Figure 3 that the proposed method has better recognition rates than those by the approach in which the network is trained after the input dimension is reduced completely using PCA. Especially, we could get the highest recognition rate by the proposed method when m D 1. Table 2 shows the summary of generalization performance of each learning algorithm on the DNA splicing site problem. We see from Table 2 that the standard deviation by the proposed method using DA is smaller than any other learning algorithm. 6.2.2 Heart Disease Problem. We applied our method to another real data set, heart1, which is also included in Proben1. The purpose of this problem
Linear Constraints on Weight Representation for Generalization
2861
82
Average error rates [%]
weight decay 80
78 BP 76
74
72
0.2
0.4
0.6
0.8
1.0
r
Figure 4: Performance of the proposed method using PCA on the heart disease problem. Circles denote average recognition rates on the test set by the proposed method using PCA. Squares denote average recognition rates on the test set by the approach that consists of reducing the input dimension using PCA and then training the network by BP.
is to predict heart disease. There exist 920 patterns in this data set, and they were divided into three sets: a training set (460 patterns), a data set to determine P (230 patterns), and a test set (230 patterns). In this experiment, we did not use DA since there did not exist the inverse of the within-class covariance matrix. Therefore, we used only PCA. The matrix Pr was obtained by applying PCA to the set of 230 patterns. We applied the backpropagation algorithm and weight decay to this problem for comparison. We used a neural network with 35 inputs, a single layer of ve hidden units, and two output units. The training rate 2 was set at 0.01 for each learning algorithm. The performance of the network trained by the proposed method using PCA is shown in Figure 4. It also shows average recognition rates by the approach that consists of reducing the input dimension using PCA and then training the network by the backpropagation algorithm. From Figure 4, we see that the proposed method has better recognition rates than those by the approach in which the network is trained after reducing the input dimension completely using PCA. Moreover, in case m D 1, we could get the highest recognition rate by the proposed method. The summary of generalization performance of each learning algorithm on the heart disease problem is shown in Table 3. We can see that better generalization ability can be obtained by the proposed method.
2862
Masaki Ishii and Itsuo Kumazawa
Table 3: Summary of Generalization Performance of Three Techniques on the Heart Disease Problem. Learning Algorithm Backpropagation Weight decay (c w D 0.1) Proposed method using PCA (c D 1, m=1)
% Error on the Test Set 23.04 § 1.27% 19.65 § 0.36% 18.87 § 0.39%
From these experimental results of the DNA splicing site problem and the heart disease problem, we can see that the proposed method is effective and has better generalization ability even when linear constraints are determined based on knowledge obtained by statistical techniques. 7 Conclusion
In order to improve the generalization ability of neural networks, we took an approach to construct networks whose outputs were invariant with respect to some transformations of input patterns. We have discussed improving generalization ability in case such transformations can be approximated by linear transformations. We showed that invariance natures could be formulated as linear constraints on weight representation. Then we proposed a learning method that introduced the effective linear constraints into the error function as a penalty term. A special case of our method can be considered as weight decay. Furthermore, for the ideal case when the constraints are fully satised, we also evaluated our method from the viewpoint of the VC dimension. Finally, we demonstrated some experimental results. By applying our learning method to the handwritten digits recognition problem, we showed the effectiveness of our method in case such transformations of input patterns were obtained. In addition, by applying to the DNA splicing site problem and the heart disease problem, linear constraints that were determined using statistical techniques were shown to be effective in case it was not certain that data sets had true invariance natures. Acknowledgments
We thank anonymous referees for their valuable comments. References Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.),
Linear Constraints on Weight Representation for Generalization
2863
Advances in neural information processing systems, 5 (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Ishii, M., & Kumazawa, I. (2000). Restrictions of weight representation in multilayer networks for generalization. In Proceedings of the 4th International Conference on Computational Intelligence and Neurosciences (in conjunction with the 5th Joint Conference on Information Science (JCIS2000)) (pp. 855–858). Atlantic City, NJ. Ishikawa, M. (1996). Structural learning with forgetting. Neural Networks, 9(3), 509–521. LeCun, Y., Denker, J., & Solla, S. A. (1990).Optimal brain damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. Leen, T. K. (1995). From data distributions to regularization in invariant learning. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 223–230). Cambridge, MA: MIT Press. Plaut, D. C., Nowlan, S. J., & Hinton, G. E. (1986). Experiments on learning by back propagation (Tech. Rep. CMU-CS-86-126). Pittsburgh, PA: Carnegie Mellon University. Prechelt, L. 1994. Proben1—a set of neural network benchmark problems and bench¨ Informatik Unimarking rules (Tech. Rep. No. 21/94). Karlsruhe: Fakult¨at fur versit a¨ t Karlsruhe. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986).Learning representations by back-propagating errors. Nature, 323, 533–536. Sakurai, A. (1992). n-h-1 network store no less n¢h+1 examples, but sometimes no more. In Proceedings of IJCNN’92 (pp. 936–941). Simard, P., Victorri, B., LeCun, Y., & Denker, J. (1992).Tangent prop—a formalism for specifying selected invariances in an adaptive network. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 895–903). San Mateo, CA: Morgan Kaufmann. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with application to forecasting. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 875–882). San Mateo, CA: Morgan Kaufmann. Received March 29, 2000; accepted March 1, 2001.
LETTER
Communicated by Marcus Frean
Feedforward Neural Network Construction Using Cross Validation Rudy Setiono [email protected] School of Computing, National University of Singapore, Singapore 117543 This article presents an algorithm that constructs feedforward neural networks with a single hidden layer for pattern classication. The algorithm starts with a small number of hidden units in the network and adds more hidden units as needed to improve the network’s predictive accuracy. To determine when to stop adding new hidden units, the algorithm makes use of a subset of the available training samples for cross validation. New hidden units are added to the network only if they improve the classication accuracy of the network on the training samples and on the cross-validation samples. Extensive experimental results show that the algorithm is effective in obtaining networks with predictive accuracy rates that are better than those obtained by state-of-the-art decision tree methods. 1 Introduction
When applying the neural network approach to solve a pattern classication problem, one is faced with the question of what kind of network topology is most suitable for this problem. To address this question of model selection, numerous algorithms that automatically construct neural networks have appeared in the literature. The algorithms include the cascade correlation algorithm (Fahlman & Lebiere, 1990), the tiling algorithm (MÂezard & Nadal, 1989), the self-organizing neural network (Tenorio & Lee, 1990), the MPyramid-real, and the MTiling-real algorithms (Parekh, Yang, & Honavar, 2000). For a given problem, these algorithms generally build networks with many layers of processing units. The upstart algorithm (Frean, 1990) builds a network with layers of linear threshold units to learn boolean classication on patterns of binary variables. The generated network, however, can be converted into a network with one hidden layer by placing all the generated units in a new hidden layer and connecting them to a newly created output unit. Networks with a single hidden layer have been shown to be universal function approximators (Hornik, Stinchcombe, & White, 1989). It is also relatively easier to extract symbolic rules from such networks by applying the decompositional approach (Tickle, Andrews, Golea, & Diederich, 1998) than to extract rules c 2001 Massachusetts Institute of Technology Neural Computation 13, 2865– 2877 (2001) °
2866
Rudy Setiono
from networks with many layers of hidden units. Among the algorithms that construct networks with a single hidden layer are the DNC (dynamic node creation) method (Ash, 1989), FNNCA (feedforward neural network construction algorithm; Setiono, 1996), and CARVE (constructive algorithm for real-valued examples; Young & Downs, 1998). These algorithms create a feedforward neural network by sequentially adding hidden units to its hidden layer. The initial number of hidden units in the network is usually one or two. The network is trained to minimize its classication errors on the training samples. If the trained network does not achieve the required accuracy rate, one or more hidden units is added to the network, and the network is retrained. The process is repeated until a network that correctly classies all the input samples or meets some other prespecied stopping criteria has been constructed. FNNCA and CARVE algorithms produce networks that correctly classify all the training samples, while DNC may terminate without obtaining such networks. For general problems, however, it is usually not desirable to run any network construction algorithm until all the training samples have been correctly classied. Insisting on 100% correct classication on the training samples usually means generating networks with too many parameters. Such networks can be expected to overt the data and do not generalize well. Theoretical results (Baum & Haussler, 1989) and empirical results (Setiono & Liu, 1997; Treadgold & Gedeon, 1999) indicate that smaller networks generalize better than larger ones. In this article, we present a neural network construction algorithm that makes use of cross-validation or hold-out samples to determine when to stop adding hidden units to the network. Our approach is similar to that taken by Treadgold and Gedeon (1999). New nodes are added to the growing network only if they improve the accuracy of the network on the training and cross-validation samples. By checking the network accuracy on these samples, the requirement to specify a preset minimum accuracy rate for the growing network is eliminated. The network constructed by the proposed algorithm is the standard feedforward neural network with a single hidden layer. The growing network is trained using the conjugate gradient minimization algorithm. A very important criterion when choosing an optimization algorithm to minimize the errors of the growing network is the convergence speed of the algorithm, especially when large problems with many inputs or outputs are to be solved. The conjugate gradient algorithm converges faster than rst-order methods such as the backpropagation method, and its memory storage requirement is much less than second-order methods such as Newton’s method. While there are numerous network construction algorithms described in the literature, most of them have been tested on only a small number of data sets, and their performances have not been compared against other methods for classication. We test our algorithm on 32 publicly available data sets and compare our results with those obtained by state-of-the-art
Neural Network Construction
2867
methods that generate decision trees. We nd that neural networks give statistically better predictive accuracy than the decision trees for many of the test problems. The neural network construction with cross-validation samples (N2C2S) algorithm is described in section 2. The results of our experiments are presented in section 3. Finally, we conclude in section 4. 2 The N2C2S Algorithm
We assume that we have P data samples ( xp , yp ) , p D 1, 2, . . . , P where input xp 2 RN and target yp 2 [0, 1]M , N is the dimensionality of the input data, and M is the number of classes. The proposed algorithm requires that this data set be split randomly into two disjoint subsets: the training (T ) and the cross-validation (C ) sets. The samples in T determine the optimal weights of the connections of the growing network, while the samples in C help to determine the topology of the nal network. The outline of the network construction algorithm is as follows. Input: T , a set of training samples ( xp , yp ) , p D 1, 2, . . . , P1 , and C , a set of cross-validation samples ( xp , yp ) , p D 1, 2, . . . , P2 . Objective: A feedforward neural network with good generalization capa-
bility. Step 1: Let N
1 be a network with N input units, M output units, and H hidden units.
Step 2: Initialize the connection weights of N
1 randomly, and train this network to minimize an error function. Let the accuracy rates of the trained networks on the sets T and C be AT 1 and A C 1 , respectively.
Step 3: Let N
2 be a network with N input units, M output units, and H C h hidden units.
Step 4: Set the weights of the connections to and from the rst H hidden
units of N 2 to the optimal weights of N 1 , and set the remaining connection weights of N 2 randomly. Train N 2 and let its accuracy rates on the sets T and C be AT 2 and AC 2 , respectively. Step 5a: If ( AT
2 C
AC 2 ) > ( AT
1 C
AC 1 ), then
1. Set H :D H C h. 2. Let N 1 :D N 2 , AT 1 :D AT 2 , AC 3. If H < MaxH; go to step 3.
1
:D A C 2 .
Step 5b: Else:
1. Train a new network N 3 with H C h hidden units and with all its initial weights assigned randomly and let its accuracy rates on the sets T and C be AT 3 and AC 3 , respectively.
2868
Rudy Setiono
2. If ( AT
3 C
AC 3 ) > ( A T
1 C
AC 1 ) , then
(a) Set H :D H C h. (b) Let N 1 :D N 3 , AT 1 :D AT 3 , A C (c) If H < MaxH; go to step 3. Step 6: Output N
1
1
:D AC 3 .
as the nal constructed network.
The algorithm adds nodes to the growing network as long as the addition of the nodes improves the accuracy of the network. As a measure of accuracy, we use the total classication accuracy of the network on the training and cross-validation sets. For some problems, it would be reasonable to check the accuracy of the growing network on the cross-validation samples only. However, for small data sets with many classes or uneven distribution of the classes, we nd that using only the relatively few samples available for cross-validation leads to nal networks that generalize poorly. The addition of new hidden units usually results in a higher network accuracy on the training samples. However, when there are already sufcient hidden units in the network, this increase is usually accompanied by a decrease in the cross-validation accuracy. The new hidden units are added to the growing network only if they improve the accuracy of the network on both sets of samples or if the increase in the training accuracy is greater than the decrease in the cross-validation accuracy. Two networks with different number of hidden units N 1 and N 2 are trained, and their accuracy rates are compared so that we do not have to determine a prespecied accuracy level to terminate the network construction algorithm. The difculty associated with a xed prespecied accuracy level is obvious. When the required accuracy on the training samples is set too high, the network will have too many hidden units, and it will overt the data. This usually leads to poor prediction of new unseen samples. On the other hand, if the prespecied accuracy level is set too low, the algorithm would terminate prematurely with a small network that cannot predict well when given new samples. When the accuracy of the growing network on a set of cross-validation samples is checked, we usually do not know the level of accuracy that we can expect. Hence, the proposed algorithm attempts to overcome this difculty by comparing the accuracy of two successive networks generated along the way. When h new hidden units are added to the trained network N 1 in step 3 to form the larger network N 2, the optimal weights of network N 1 are used as the initial weights for the connections to and from the rst H hidden units of N 2 . By making use of the optimal weights of the smaller network, we hope to reduce the training time of the new network. A third network N 3 with the same number of hidden units as N 2 is trained when the accuracy of N 2 is not better than the accuracy of the smaller network N 1 in step 5b. The accuracy of N 2 may not be better than the accuracy of N 1 because the conjugate gradient algorithm used to minimize the error function is
Neural Network Construction
2869
trapped at a local minimum point of the weight space. In order to reduce the chances of getting trapped at such a local minimum, the algorithm trains N 3 , which has been initialized with a completely random set of weights. The construction algorithm terminates only if this new network does not predict better than the smaller network N 1 either. We now explain the error function that is minimized during the network construction process. Let w be the weights of the network connections from the input units to the hidden units and v be the weights of the connections from the hidden units to the output units. The error measure that we have adopted is the standard sum of squared errors E ( w , v ) augmented with a penalty term h ( w , v ) : E( w , v ) D
P1 X M ¡ X pD 1 `D1
¢2 yQp` ¡ yp` C h ( w , v ) .
The penalty term h ( w , v ) is dened as follows (Setiono, 1997): 0 1 H X N H X M X bw2ij X bv2`i A C h ( w, v) D 2 1 @ 1 C bw2ij 1 C bv2`i i D1 j D1 i D1 `D1 0 1 H X N H X M X X C2 2@ w2ij C v2`i A , i D 1 j D1
(2.1)
(2.2)
i D 1 `D 1
where yp` 2 [0, 1] is the target value for input xp at output unit `, yQp` is the predicted value for input xp at output unit `, 2 1 , 2 2 , b are positive penalty parameters, wij is the weight of the connections from input unit j to hidden unit i, v`i is the weight of the connection from hidden unit i to the output unit `, and the hidden unit activation value Aip for input xp and the predicted value yQ p for this input are computed as follows: 0 1 N X Aip D tanh @ wij xjp A ,
(2.3)
jD 1
yQp` D
H X
v`i Aip .
(2.4)
iD 1
The transfer function that we have used to compute the hidden unit activation of the network is the hyperbolic tangent function tanh(j ) D ( ej ¡ e¡j ) / ( ej C e¡j ). The penalty term h ( w , v ) is added to the error function to increase the chances of constructing the smallest networks with the best generalization ability. Extensive empirical experiments indicate that the addition of a penalty term during network construction improves the generalization
2870
Rudy Setiono
ability of the resulting networks (Treadgold & Gedeon, 1999). Among the different penalty terms employed, the cubic penalty term was found to be the most benecial for network generalization. Instead of the cubic penalty term, we have opted to incorporate the penalty term h ( w , v ) that we have used in conjunction with a neural network pruning algorithm (Setiono, 1997). Employing this penalty term enabled us to obtain pruned networks that are smaller than the networks obtained by other network pruning algorithms. A local minimum of the error function E ( w , v ) can be obtained by applying any nonlinear optimization methods such as the gradient-descent method or a higher-order method. In our implementation, we have used the conjugate gradient algorithm CONMIN implemented by Shanno and Phua (1976, 1980). Their implementation of the conjugate gradient method with line search ensures that after every iteration of the method, the error function value will decrease. This is a property of the method that is not possessed by the standard backpropagation method with a xed learning rate. While the conjugate gradient method does not converge as fast as the quasi-Newton methods, its memory storage requirement is less. For an unconstrained minimization problem with n variables, the conjugate gradient option of CONMIN requires 7n C 2 double-precision words of working storage, while a variant of the quasi-Newton methods, the Broyden-FletcherGoldfard-Shanno (BFGS) method, requires n2 / 2 C 11n / 2 words (Shanno & Phua, 1980). As some data sets can have both high-dimensional inputs and many classes, the total number of weights in the growing network easily exceeds one thousand, making the BFGS method prohibitively expensive in terms of memory storage requirement. 3 Experimental Results
Thirty-two classication problems were solved using the proposed algorithm. These are real-world classication problems with mixed discrete and continuous attributes. The data sets for these problems are available from the web site of the Machine Learning Research Group at the Department of Computer Science, University of Waikato. 1 They are also available from the UCI Machine Learning Repository (Blake & Merz, 1998). The size of the data sets and the characteristics of the input attributes of the data are summarized in Table 1. The performance of many new network learning and construction algorithms has been evaluated on only a relatively few problems, despite some criticism about this practice (Prechelt, 1996). We have selected the classication problems listed in Table 1 because the data sets are available publicly and the results from experiments using decision tree methods are
1
Available on-line at http://www.cs.waikato.ac.nz/»ml/weka/index.html.
Neural Network Construction
2871
Table 1: Data Sets Used in the Experiments. Data Set
Size
anneal audiology australian autos balance-scale breast-cancer breast-w german glass glass (G2) heart-c heart-h heart-statlog hepatitis horse-colic hypothyroid ionosphere iris kr-vs-kp labor lymphography pima-indians primary-tumor segment sick sonar soybean vehicle vote vowel waveform-noise zoo
898 226 690 205 625 286 699 1000 214 163 302 294 270 155 368 3772 351 150 3196 57 148 768 339 2310 3772 208 683 846 435 990 5000 101
Attributes Neural Network Missing Values (%) Continuous Binary Nominal Inputs Outputs 0.0 2.0 0.6 1.1 0.0 0.3 0.3 0.0 0.0 0.0 0.2 20.4 0.0 5.6 23.8 5.5 0.0 0.0 0.0 3.9 0.0 0.0 3.9 0.0 5.5 0.0 9.8 0.0 5.6 0.0 0.0 0.0
6 0 6 15 4 0 9 6 9 9 6 6 13 6 7 7 33 4 0 8 0 8 0 19 7 60 0 18 0 10 40 1
14 61 4 4 0 3 0 3 0 0 3 3 0 13 2 20 1 0 34 3 9 0 14 0 20 0 16 0 16 2 0 15
18 8 5 6 0 6 0 4 0 0 4 4 0 0 13 2 0 0 2 5 6 0 3 0 2 0 19 0 0 1 0 0
82 95 45 73 5 51 10 62 10 10 23 25 14 30 62 34 35 5 41 30 39 9 37 20 33 61 135 19 33 28 41 17
5 24 2 6 3 2 2 2 6 2 2 2 2 2 2 4 2 3 2 2 4 2 21 7 2 2 19 4 2 11 3 7
also available (Frank, Wang, Inglis, Holmes, & Witten, 1997). Hence, performance comparison can be made between the neural network approach and the decision tree methods. For each data set, the experimental setting was as follows: 1. Tenfold cross-validation scheme: We split each data set randomly into 10 subsets of equal size. Eight subsets were used for training, one subset for cross validating, and one for measuring the predictive accuracy of the nal constructed network. This procedure was performed 10 times so that each subset was tested once. Test results were averaged over 10 tenfold cross-validation runs. Data splitting was done without sampling stratication.
2872
Rudy Setiono
2. The same set of penalty parameter values in the penalty term (see equation 2.2) was used: 2 1 D 0.5, 2 2 D 0.05, and b D 0.1. 3. The starting number of hidden units was 2. The numbers of input units and output units are shown in Table 1. The number of input units includes one unit with a constant input value of one to implement hidden unit bias. The number of output units corresponds to the number of classes in the data. The winner-takes-all strategy for prediction is used. 4. Two hidden units, instead of one, were added at one time to the growing network (step 3 of the algorithm). This is to reduce the training time for problems that require a large number of hidden units. The maximum number hidden units allowed, MaxH, was 20. 5. During network training, the conjugate gradient algorithm was terminated (i.e., the network was assumed to have been trained) if the relative decrease in the error function value (see equation 2.1) after two consecutive iterations was less than 10¡5 or the maximum number of function evaluations was reached. This number was set to 2500. 6. One network input unit was assigned to each continuous attribute in the data set. Nominal attributes were binary coded. A nominal attribute with D possible values was assigned D network inputs. A binary attribute with no missing value required one network input unit, while a binary attribute with missing value required 2 input units (See item 8 below on how the missing values were handled). 7. Continuous attribute values were scaled to range in the interval [0, 1], while binary-encoded attribute values were either 0 or 0.2. We found that the 0/ 0.2 encoding produced better generalization than the usual 0 / 1 encoding. 8. A missing continuous attribute value was replaced by the average of the nonmissing values. A missing discrete attribute value was assigned the value unknown and the corresponding components of the input x were set to the zero. The experimental results are presented in Table 2. In this table, we show the average number of hidden units and the average accuracy rates of the constructed networks on the training data set, the cross-validation data set, and the test set. The average number of hidden units ranged from 2.46 for the labor data set to 19.44 for the soybean data set. Most of the networks for the latter data set contain the maximum 20 hidden units. It might be possible to improve the overall predictive accuracy of these networks if we had allowed them to have more than 20 hidden units. This was not done, however, because we attempted to solve all the problems using the same set of values for all the parameters in the algorithm. A general picture of the network accuracy on the training data set, cross-validation data set, and
Neural Network Construction
2873
Table 2: Number of Hidden Units and Accuracy Rates of the Neural Networks Constructed by the N2C2S Algorithm. Data Set anneal audiology australian autos balance-scale breast-cancer breast-w german glass glass (G2) heart-c heart-h heart-statlog hepatitis horse-colic hypothyroid ionosphere iris kr-vs-kp labor lymphography pima-indians primary-tumor segment sick sonar soybean vehicle vote vowel wave zoo
Hidden Units 6.58 § 17.18 § 7.22 § 8.38 § 10.12 § 6.06 § 3.40 § 9.54 § 9.58 § 7.06 § 5.60 § 5.94 § 5.52 § 5.66 § 4.88 § 10.02 § 4.20 § 3.34 § 7.20 § 2.46 § 6.34 § 5.88 § 11.38 § 14.68 § 4.80 § 5.18 § 19.44 § 13.26 § 4.42 § 17.44 § 10.82 § 12.12 §
0.29 1.21 1.38 1.01 1.03 0.84 0.51 1.43 1.41 0.84 1.13 1.14 0.99 1.16 0.38 1.52 0.87 0.49 0.71 0.40 0.84 0.99 1.38 0.70 0.57 0.48 0.70 1.13 0.61 1.38 2.61 0.10
Training Accuracy (%)
Cross-Validation Accuracy (%)
Test Accuracy (%)
99.65 § 95.51 § 92.26 § 97.78 § 93.76 § 91.86 § 97.47 § 96.56 § 79.88 § 89.98 § 92.36 § 90.17 § 94.08 § 97.49 § 98.47 § 97.24 § 99.20 § 98.31 § 99.57 § 99.98 § 96.92 § 80.43 § 62.09 § 95.99 § 98.03 § 99.92 § 96.37 § 91.15 § 98.55 § 94.25 § 89.71 § 100.00 §
99.45 § 80.47 § 86.90 § 78.73 § 93.38 § 74.46 § 96.94 § 74.00 § 71.72 § 81.90 § 85.60 § 84.96 § 83.44 § 86.75 § 83.20 § 96.95 § 92.65 § 97.40 § 99.40 § 92.63 § 85.22 § 78.49 § 50.28 § 95.52 § 97.72 § 82.42 § 93.28 § 86.17 § 96.64 § 90.56 § 86.63 § 94.75 §
99.41 § 79.50 § 84.83 § 72.31 § 92.32 § 67.80 § 96.58 § 70.10 § 65.55 § 77.91 § 82.47 § 81.63 § 77.56 § 81.94 § 78.97 § 96.82 § 89.52 § 96.60 § 99.28 § 91.90 § 82.97 § 76.04 § 45.97 § 95.19 § 97.56 § 76.51 § 92.97 § 84.29 § 96.09 § 88.86 § 85.66 § 94.34 §
0.06 1.14 0.71 0.57 0.51 0.96 0.06 1.27 1.63 1.10 0.98 0.92 1.35 1.22 0.52 0.11 0.15 0.14 0.11 0.07 0.47 0.38 1.41 0.26 0.03 0.14 0.96 0.43 0.13 1.23 0.89 0.00
0.24 1.65 0.62 2.23 0.54 2.24 0.27 1.14 1.10 3.43 0.74 1.20 1.08 2.08 1.27 0.11 1.16 0.49 0.22 2.11 1.82 0.62 0.96 0.21 0.09 2.44 0.67 0.82 0.48 1.17 0.39 1.42
0.17 1.61 0.71 1.83 0.89 2.17 0.24 1.65 3.00 2.50 1.95 1.60 1.00 3.34 1.29 0.11 2.26 0.66 0.18 4.10 1.97 0.86 1.37 0.43 0.11 1.56 0.72 0.79 0.48 1.46 0.25 1.46
Note: Figures shown are the averages and the standard deviations from 10 tenfold crossvalidation runs.
the test set can be seen in the table. The accuracy on the training data set is higher than the accuracy on the cross-validation set, which is in turn higher than the predictive accuracy on the test set. How accurate are the network predictions compared to those from other methods? To answer this question, we reproduce the results obtained by Frank et al. (1997) from their method M50 . M50 is a reimplementation of M5 algorithm developed by Quinlan (1992) for learning from data sets with continuous classes. These algorithm generate binary decision trees with linear regression function at the leaf nodes. Since both algorithms are designed
2874
Rudy Setiono
Table 3: Comparison of the Accuracy Rates of Neural Networks, C5.0 and M50 . Data Set
Neural Networks
anneal audiology australian autos balance-scale breast-cancer breast-w german glass glass (G2) heart-c heart-h heart-statlog hepatitis horse-colic hypothyroid ionosphere iris kr-vs-kp labor lymphography pima-indians primary-tumor segment sick sonar soybean vehicle vote vowel wave zoo
99.41 § 79.50 § 84.83 § 72.31 § 92.32 § 67.80 § 96.58 § 70.10 § 65.55 § 77.91 § 82.47 § 81.63 § 77.56 § 81.94 § 78.97 § 96.82 § 89.52 § 96.60 § 99.28 § 91.90 § 82.97 § 76.04 § 45.97 § 95.19 § 97.56 § 76.51 § 92.97 § 84.29 § 96.09 § 88.86 § 85.66 § 94.34 §
0.17 1.61 0.71 1.83 0.89 2.17 0.24 1.65 3.00 2.50 1.95 1.60 1.00 3.34 1.29 0.11 2.26 0.66 0.18 4.10 1.97 0.86 1.37 0.43 0.11 1.56 0.72 0.79 0.48 1.46 0.25 1.46
C5.0 98.7 § 76.5 § 85.3 § 80.0 § 77.6 § 73.3 § 94.5 § 71.2 § 67.5 § 78.7 § 76.8 § 79.8 § 78.7 § 79.3 § 85.3 § 99.5 § 88.9 § 94.5 § 99.5 § 78.1 § 75.4 § 74.5 § 41.8 § 96.8 § 98.8 § 74.7 § 91.3 § 72.9 § 96.3 § 79.8 § 75.4 § 91.8 §
0.3 1.4 0.5 2.5 ¦ 1.0 1.6 ¦ 0.3 1.0 2.6 2.1 1.4 0.9 1.4 1.2 0.6 ¦ 0.0 ¦ 1.6 0.7 0.1 ¦ 4.8 2.8 1.2 1.3 0.2 ¦ 0.1 ¦ 2.8 0.5 1.2 0.6 1.3 0.5 1.1
M50 98.8 § 76.7 § 85.8 § 74.4 § 86.4 § 69.6 § 95.3 § 72.9 § 70.5 § 81.8 § 80.9 § 79.0 § 82.2 § 81.9 § 84.6 § 96.6 § 89.7 § 94.7 § 99.4 § 79.7 § 79.8 § 76.2 § 45.1 § 97.0 § 98.3 § 78.5 § 92.5 § 76.5 § 96.2 § 81.7 § 82.0 § 92.1 §
0.2 1.0 0.9 1.9 0.7 2.3 0.3 0.7 ¦ 2.8 ¦ 2.2 ¦ 1.4 0.8 1.0 ¦ 2.2 0.7 ¦ 0.1 1.2 0.7 0.1 4.6 1.4 0.8 1.6 0.2 ¦ 0.1 ¦ 3.4 0.5 1.3 0.3 1.1 0.2 1.3
Notes: A closed circle indicates that the accuracy of neural networks is signicantly higher than that of the other method. A diamond indicates that the accuracy of the neural networks is signicantly lower.
for predictions of continuous classes, they can be used for classication problems by applying them to approximate the conditional class probability function. For a problem having N classes, N model trees are generated. Each of these trees approximates the conditional probability of a given sample belonging to class Ci , i D 1, 2, . . . N. The class of the input sample is determined by the model tree that gives the highest probability value. Frank et al. (1997) compared their results to those of C5.0, a successor to Quinlan’s widely used C4.5 (Quinlan, 1993) decision tree method for classication. Table 3 summarizes the results from the neural networks, M50 and C5.0. The results of M50 and C5.0 are the averages and standard deviations
Neural Network Construction
2875
Table 4: Summary of the Results from Neural Networks Compared to Those from C5.0 and M50 . Neural Networks versus: C5.0 M50
Wins () 16 13
Ties
Losses (¦)
9 12
7 7
from 10 runs of tenfold cross-validation experiments. We computed the standard error of the difference between the averages of the neural networks and those of the other method. The t statistic for testing the null hypothesis that the two means are equal was then obtained, and a two-tailed test was conducted. If the null hypothesis was rejected at the signicance level of a D .01, we checked which method has the higher average accuracy rate. In Table 3, a neural network win is denoted by a bullet closed circle, while a neural network loss is denoted by a diamond. Averages with no signicant difference are left unmarked. Table 4 provides the summary of the comparison. For exactly half of the 32 data sets tested, neural networks outperform C5.0. For the other 16 problems, C5.0 is more accurate than neural networks on 7 data sets. On the rest of the data sets, there are no signicant differences in the accuracy rates of the two methods. Compared to M50 , neural networks are signicantly better on 13 data sets and signicantly worse on only 7 data sets. There are no signicant differences in the predictive accuracy of the two methods on remaining 12 data sets. 4 Conclusion
We have presented an algorithm for constructing feedforward neural networks using cross-validation samples. This algorithm eliminates the difculty associated with having to determine the number of hidden units before network training begins. While there are numerous network construction algorithms in the literature, most of them have been tested on only a small number of data sets. We tested and compared the results of our network construction algorithm on 32 publicly available data sets with mixed continuous and discrete attributes. Our results show that compared to the decision tree methods, neural networks can provide better predictions for many of these data sets. There are several parameters that need to be set in the network construction algorithm such as the penalty parameters, the starting number of hidden units, the maximum number of hidden units, and the maximum number of iterations and epochs allowed during training and retraining of the growing network. Fine-tuning the values of these parameters for individual problems is very likely to improve the accuracy of the constructed
2876
Rudy Setiono
networks. With the goal of having a single set of parameter values, we have xed the parameter values as described in the previous section. They are found after extensive experimentation to nd a combination that gives the best overall performance on the 32 test data sets. The results show that with this xed set of parameter values, the constructed networks can outperform state-of-the-art methods that generate decision trees. Future work to improve the accuracy of the constructed networks includes: Checking for possible removal of hidden units from the constructed networks. An algorithm that adds and then removes the hidden units was shown to be effective in reducing the overall training time of backpropagation neural networks (Hirose, Yamashita, & Hijiya, 1991). The effect of such removal on the network’s generalization, however, remains to be investigated. Incorporation of a pruning algorithm to remove redundant attributes of the data. Modication of the algorithm so that it will be able to select different penalty terms or different values for the penalty parameters automatically depending on the characteristics of the data set. Better handling of data samples with missing attribute values. Acknowledgments
This work was done while I was spending a sabbatical leave at the Computational Intelligence Lab, University of Louisville, Kentucky. I am grateful to Jacek M. Zurada of the Department of Electrical and Computer Engineering, University of Louisville, for providing ofce space and supercomputer access. References Ash, T. (1989). Dynamic node creation in backpropagation networks. Connection Science, 1(4), 365–375. Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Blake, C. L., & Merz, C. J. (1998). UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. Available on-line at: http://www.ics.uci.edu/»mlearn/ MLRepository.html. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Frank, E., Wang, Y., Inglis, S., Holmes, G., & Witten, I. H. (1997). Using model trees for classication. Machine Learning, 32(1), 63–76.
Neural Network Construction
2877
Frean, M. (1990). The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation, 2, 198–209. Hirose, Y., Yamashita, K., & Hijiya, S. (1991). Back-propagation algorithm which varies the number of hidden units. Neural Networks, 4, 61–66. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. MÂezard, M., & Nadal, J. P. (1989). Learning in feedforward layered networks: The tiling algorithm. Journal of Physics A, 22(12), 2191–2203. Parekh, R., Yang, J., & Honavar, V. (2000). Self-organizing network for optimum supervised learning. IEEE Trans. on Neural Networks, 11(2), 436–451. Prechelt, L. (1996). A quantitative study of experimental evaluation of neural network learning algorithms. Neural Networks, 9, 457–462. Quinlan, R. (1992). Learning with continuous classes. In Proceedings of the Australian Joint Conference on Articial Intelligence (pp. 343–348). Singapore: World Scientic. Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufman. Setiono, R. (1996). A neural network construction algorithm which maximizes the likelihood function. Connection Science, 7(2), 147–166. Setiono, R. (1997). A penalty-function approach for pruning feedforward neural networks. Neural Computation, 9(1), 185–204. Setiono, R., & Liu, H. (1997). Neural network feature selector. IEEE Transactions on Neural Networks, 8(3), 654–662. Shanno, D. F., & Phua, K. H. (1976). Algorithm 500: Minimization of unconstrained multivariate functions. ACM Transactions on Mathematical Software, 2(1), 87–94. Shanno, D. F., & Phua, K. H. (1980). Remark on algorithm 500. ACM Transactions on Mathematical Software, 6(4), 618–622. Tenorio, M. F., & Lee, W. (1990). Self-organizing network for optimum supervised learning. IEEE Transactions on Neural Networks, 1(1), 100–110. Tickle, A. B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained articial neural networks. IEEE Transactions on Neural Networks, 9(6), 1057–1068. Treadgold, N. K., & Gedeon, T. D. (1999). Exploring constructive cascade networks. IEEE Transactions on Neural Networks, 10(6), 1335–1350. Young, S., & Downs, T. (1998). CARVE—A constructive algorithm for realvalued examples. IEEE Transactions on Neural Networks, 9(6), 1180–1190. Received July 26, 2000; accepted March 3, 2001.