NOTE
Communicated by Heiga Zen
How to Pretend That Correlated Variables Are Independent by Using Difference Observations Christopher K. I. Williams
[email protected] School of Informatics, University of Edinburgh, Edinburgh EH1 2QL, U.K.
In many areas of data modeling, observations at different locations (e.g., time frames or pixel locations) are augmented by differences of nearby observations (e.g., δ features in speech recognition, Gabor jets in image analysis). These augmented observations are then often modeled as being independent. How can this make sense? We provide two interpretations, showing (1) that the likelihood of data generated from an autoregressive process can be computed in terms of “independent” augmented observations and (2) that the augmented observations can be given a coherent treatment in terms of the products of experts model (Hinton, 1999).
1 Introduction In automatic speech recognition, it is often the case that hidden Markov models (HMMs) are used on observation vectors that are augmented by difference observations (so-called δ features; see Furui, 1986). Under the HMM, each observation vector is modeled as being conditionally independent given the hidden state. How can this make sense, as close-by differences are clearly not independent? A similar difficulty arises in image analysis tasks such as texture segmentation (see, e.g., Dunn & Higgins, 1995). Here derivative features obtained from Gabor filters or wavelet analysis, for example, are modeled as being independent at different locations, despite the fact that these features will have been computed sharing some pixels in common. In this article, we present two solutions to this problem. In section 2, we show that if the data are generated from a vector autoregressive (AR) model, then the likelihood can be expressed in terms of “independent” difference observations. In section 3, we show that the local models at each location can be combined using a product of experts model (Hinton, 1999) to provide a well-defined joint model for the data and that this can be related to AR models. Section 4 discusses how these interpretations are affected if the local models are conditional on a hidden state variable, as is the case for HMMs. Neural Computation 17, 1–6 (2005)
c 2004 Massachusetts Institute of Technology
2
C. Williams
2 An AR Model Consider a temporal vector autoregressive model, Xt =
p
Ai Xt−i + Nt ,
(2.1)
i=1
where the Ai ’s are square matrices and Nt is independent and identically distributed gaussian noise ∼ N(0, N ). Xt and Nt have dimension D for all t. To avoid complicated end effects, we will use periodic (wraparound) boundary conditions, so that the subscript t − i should be read mod(t − i, N). Thus, there are N random variables X0 , . . . , XN−1 , which collectively we denote as X, and similarly for N. Then X and N are related by N = TX for an appropriate matrix T. Thus, 1 −1 exp − NTt N Nt 2 t=0 p T p N−1 1 −1 exp − Ai Xt−i N Ai Xt−i , = 2 i=0 t=0 i=0
P(X) ∝
N−1
(2.2)
(2.3)
p where we have set A0 = −I so that Nt = − i=0 Ai Xt−i . p Now let Y0t , . . . , Yt be linearly independent linear combinations of Xt , . . . , Xt−p . For example, we could choose Y0t = Xt , Y1t = Xt − Xt−1 and so on. As the Yit ’s are simple linear combinations of Xt , . . . , Xt−p , we have p
Ai Xt−i =
i=0
p
Bi Yit ,
(2.4)
i=0
for some set of matrices Bi . We can now write T p p N−1 1 −1 exp − Bi Yit N Bi Yit , P(X) ∝ 2 i=0 t=0 i=0
(2.5)
showing that the likelihood of the underlying X process can be expressed in terms of a product of terms involving the difference observations up to p order p at each time. Stacking Y0t , Y1t , . . . , Yt as the vector Yt we have P(X) ∝
N−1 t=0
1 exp − YTt MYt , 2
(2.6) j
where the (i, j) block of the matrix M (between Yit and Yt ) has the form −1 Bj . Equation 2.6 almost looks like a product of independent gaussians, BTi N
How to Pretend Correlated Variables Are Independent
3
but note that M is singular (it has rank D as it arises from Nt ) so the correct normalization factor of the gaussian cannot be obtained from it. As a simple example, consider the scalar AR(1) process Xt = αXt−1 + Nt and set Yt0 = Xt , Yt1 = Xt − Xt−1 . Thus, Xt − αXt−1 = (1 − α)Xt + α(Xt − Xt−1 ) = (1 −
α)Yt0
+
αYt1 .
(2.7) (2.8)
To obtain the likelihood for the sequence X, the matrix M will have the form 1 (1 − α)2 α(1 − α) , (2.9) M= 2 α2 σn α(1 − α) where σn2 = var(Nt ). As expected, M has rank 1 (it is an outer product). Interestingly, the matrix M is not equal to the inverse covariance of the Yt ’s derived from the distribution for X. To show this, we first use the result that for the scalar AR(1) process on the circle, the covariance C[j] = Xt Xt−j is given by C[j] =
σn2 (α |j| + α |N−j| ) . (1 − α 2 )(1 − α N )
(2.10)
Thus, cov(Yt ) =
0 0 Yt Yt Yt0 Yt1
Yt0 Yt1 C[0] = (C[0] − C[1]) Yt1 Yt1
(C[0] − C[1]) . (2.11) 2(C[0] − C[1])
Inversion of cov(Yt ) shows that it is not equal to M as given in equation 2.9. Notice that the joint distribution of Y0 , . . . , YN−1 is singular. If we take an AR process on the X variables, then one can choose linear combinations of the Xt s that are truly independent by carrying out an eigenanalysis. (For the periodic boundary conditions described above and time-invariant coefficients, the eigenbasis would be the Fourier basis.) However, if we allow ourselves an overcomplete basis set, then we have shown that the likelihood of X under the AR process can readily be computed using “independent” densities at each location. Although we have given the derivation above using gaussian noise, in fact the conclusion concerning expressing the likelihood of the X sequence in terms of a product of terms involving Yt ’s is independent of the form of the noise driving the AR process. It is also possible to extend the AR model described above beyond the temporal one-dimensional chain. For example, Abend, Harley, and Kanal (1965) describe Markov mesh models in two dimensions. A simple example of such a model is a “third-order” Markov mesh, where Xi,j depends autoregressively on Xi,j−1 , Xi−1,j−1 , and Xi−1,j . The same construction in terms of Y variables can be used in this case.
4
C. Williams
3 Product of Experts Interpretation At an individual location, we have a model Pt (Yt ) for the augmented vector Yt . To define a joint distribution on X, we set P(X) =
1 Pt (Yt ), Z t
(3.1)
where Z is a normalization constant (known in statistical physics as the partition function). This is the product of experts construction (Hinton, 1999). One can also think of this as a Markov random field construction where
P(X) ∝ exp −E(X) and E(X) = − t log Pt (Yt ). If each Pt (Yt ) is gaussian, then P(X) will also be gaussian, and Z = (2π)N/2 |C|1/2 where C is the covariance matrix of X. Again we consider a simple example relating to a scalar AR(1) process, so Yt = (Xt , Xt − Xt−1 )T . Let 1 Pt (Yt ) ∝ exp − {a0 Xt2 + a1 (Xt − Xt−1 )2 }, 2
(3.2)
with a0 , a1 > 0. Then we obtain the joint distribution, 1 2 2 P(X) ∝ − Xt + a1 (Xt − Xt−1 ) . a0 2 t t
(3.3)
C−1 , the inverse covariance matrix of X, is circulant with entries a0 + 2a1 on the diagonal and −a1 in the bands above and below the diagonal and in the northeast and southwest corners. For the AR(1) process, Xt = αXt−1 + Nt with Nt ∼ N(0, β −1 ), we obtain corresponding entries of β(1 + α 2 ) on the diagonal and −βα off the diagonal. The overall scale of a0 and a1 has the def
same effect as β in setting the variance of the process but r = aa01 = (1−α) α , so for any given α value, there is a corresponding value of r1 . For the gaussian case with expert t involving interactions between Xt and Xt−p , we obtain a quadratic form with the same pattern of banding as in the inverse covariance matrix of an AR(p) process, but as above, for some choices of parameters there may not be a corresponding AR process. Again this construction can be extended to two (or more) dimensions. For example, in 2D we might consider the variable Xi,j and the differences to its four neighbors to the north, south, east, and west to obtain a fivedimensional Y vector. Equation 3.1, with each expert being gaussian, then defines a gaussian Markov random field over the lattice of X variables. 2
1 Interestingly for r ∈ (−4, 0), there are no corresponding values of α. Note that α = 0 ⇒ a1 = 0.
How to Pretend Correlated Variables Are Independent
5
4 Incorporating Hidden State In speech recognition using HMMs, the Yt s are modeled as conditionally independent given the discrete hidden variable st . We now consider how this affects the interpretations given above. For interpretation 1, we consider a switching AR(p) process or AR-HMM (see, e.g., Woodland, 1992), so that Xt depends on Xt−1 , . . . , Xt−p and also st . For example, using gaussian noise and setting st = k, we have Xt ∼ p N( i=1 Aki Xt−i , k ). Notice that the AR model parameters now depend on
p the switching variable. However, we can still write the prediction i=1 Aki Xt−i as a linear combination of the Yit s, so the likelihood can be written in the form of “independent” contributions from the Yt s. Note that the usual forward and backward HMM recursions can be carried out for the ARHMM. For interpretation 2, we have the individual component densities Pt (Yt | st ), and the joint distribution P(X | s) =
1 Pt (Yt | st ), Z(s) t
(4.1)
where s = (s0 , . . . , sN−1 ). Notice that the normalization constant in general depends on s, and thus when given X, the computation of P(X | s) depends no only on the component densities but also on Z(s). However, if Pt (Yt | st ) is gaussian and has the same covariance structure but different means depending on st for all t, then Z would turn out to be independent of s. While writing this article, I became aware of the work of Tokuda, Zen, and Kitamura (2003), who correctly derive the product of gaussian experts construction conditional on s and note the general dependence of Z(s) on s. They also observe that use of the Viterbi algorithm to find the state sequence s that maximizes P(s) t Pt (Yt | st ) (which is easily done with standard dynamic programming techniques) will not, in general, yield the sequence that maximizes P(s | X), because of the Z(s) term. Most practical HMM-based speech recognition systems use mixtures of gaussians to model the Yt s at each frame. The product of experts interpretation readily handles this situation. For an AR model interpretation, the use of a mixture distribution for the Yt s already suggests a switching AR process with the switching variable hidden. 5 Discussion We have described both conditionally specified models (AR processes) and simultaneously specified models (products of experts) to define the joint density2 P(X) and relate it to the augmented feature vectors {Yt }. 2
This terminology is derived from Cressie (1993, sec. 6.3).
6
C. Williams
While this article describes a theoretical framework for understanding why using difference observations make sense, it would be interesting to examine empirically the question of how well AR and products of experts models characterize the dependencies between time frames or pixel locations. Acknowledgments This note was inspired by questions raised by Joe Frankel’s Ph.D. thesis. Thanks to John Bridle and Joe Frankel for helpful conversations and comments on earlier drafts, to Joe Frankel for drawing my attention to Tokuda et al. (2003), and to the anonymous referees for their comments, which helped to improve the article. References Abend, K., Harley, T. J., & Kanal, L. N. (1965). Classification of binary random patterns. IEEE Transactions on Information Theory, 11(4), 538–544. Cressie, N. A. C. (1993). Statistics for spatial data. New York: Wiley. Dunn, D., & Higgins, W. E. (1995). Optimal Gabor filters for texture segmentation. IEEE Transactions on Image Processing, 4(7), 947–964. Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech and Signal Processing, 34, 52–59. Hinton, G. E. (1999). Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks (Vol. 1, pp. 1–6). London: IEE. Tokuda, K., Zen, H., & Kitamura, T. (2003). Trajectory modelling based on HMMs with the explicit relationship between static and dynamic features. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003). N.P.: International Speech Communication Association. Woodland, P. C. (1992). Hidden Markov models using vector linear prediction and discriminative output distributions. In Proceedings of 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. I, pp. 509–512). Piscataway, NJ: IEEE. Received February 3, 2004; accepted May 28, 2004.
NOTE
Communicated by Thorsten Joachims
Convergence of the IRWLS Procedure to the Support Vector Machine Solution Fernando P´erez-Cruz
[email protected] Gatsby Computational Neuroscience Unit, 17 Queen Sqaure, London WC1N 3AR, U.K., and Department of Signal Theory and Communications, University Carlos III in Madrid, Avda Universidad 30, 28911 Leganes (Madrid), Spain
Carlos Bousono-Calz ˜ on ´
[email protected]
Antonio Art´es-Rodr´ıguez
[email protected] Department of Signal Theory and Communications, University Carlos III in Madrid, Avda Universidad 30, 28911 Leganes (Madrid), Spain
An iterative reweighted least squares (IRWLS) procedure recently proposed is shown to converge to the support vector machine solution. The convergence to a stationary point is ensured by modifying the original IRWLS procedure. 1 Introduction Support vector machines (SVMs) are state-of-the-art tools for linear and nonlinear input-output knowledge discovery (Vapnik, 1998; Scholkopf ¨ & Smola, 2001). The SVM relies on the minimization of a quadratic problem, which is frequently solved using quadratic programming (QP) (Burges, 1998). The iterative reweighted least square (IRWLS) procedure for solving SVM for clas´ sification was introduced in P´erez-Cruz, Navia-V´azquez, Rojo-Alvarez, and Art´es-Rodr´ıguez (1999) and P´erez-Cruz, Navia-V´azquez, Alarcon-Diana, ´ and Art´es-Rodr´ıguez (2001) and it was used in P´erez-Cruz, Alarcon-Diana, ´ Navia-V´azquez, and Art´es-Rodr´ıguez (2000) to construct the fastest SVM solver of the time. It solves a sequence of weighted least-square problems that, unlike other least-square procedures such as Lagrangian SVMs (Mangasarian & Musicant, 2000) or least square SVMs (Suykens & Vandewalle, 1999; Van Gestel et al., 2004), leads to the true SVM solution, as we will show here. However, to prove its convergence to the SVM solution, the IRWLS procedure has to be modified with respect to the formulation that appears in P´erez-Cruz et al. (1999, 2001). The IRWLS has been also proposed for solving regression problems (P´erez-Cruz, Navia-V´azquez, Alarcon-Diana, ´ & Art´es-Rodr´ıguez, 2000). AlNeural Computation 17, 7–18 (2005)
c 2004 Massachusetts Institute of Technology
8
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
though we will deal only with the IRWLS for classification, the extension of this proof to regression is straightforward. The article is organized as follows. We prove the convergence of the IRWLS procedure to the SVM solution in section 2 and summarize the algorithmic implementation of it in section 3. We conclude with some comments in section 4. 2 Proof of Convergence of the IRWLS Algorithm to the SVC Solution The support vector classifier (SVC) seeks to compute the dependency between a set of patterns xi ∈ Rd (i = 1, . . . , n) and its corresponding labels
φ(.)
yi ∈ {±1}, given a transformation to a feature space φ(·) (Rd −→ RH and d ≤ H). The SVC solves min
w,ξi ,b
n 1 ξi w2 + C 2 i=1
subject to yi (φT (xi )w + b) ≥ 1 − ξi ξi ≥ 0, where w and b define the linear classifier in the feature space (nonlinear in the input space, unless φ(x) = x) and C is the penalty applied over training errors. This problem is equivalent to the following unconstrained problem, in which we need to minimize LP (w, b) =
n 1 L(ui ) w2 + C 2 i=1
(2.1)
with respect to w and b, where ui = 1−yi (φT (xi )w+b) and L(u) = max(u, 0). To prove the convergence of the algorithm, we need LP (w, b) to be both continuous and differentiable; therefore, we replace L(u) by a smooth approximation, 0, L(u) = Ku2 /2, u − 1/(2K),
u<0 0 ≤ u < 1/K u ≥ 1/K,
which tends to max(u, 0) as K approaches infinity (limK−→∞ L(u) = max(u, 0)).
IRWLS Convergence to the SVM Solution
9
Being a convex problem, the SVM solution is achieved at w∗ and b∗ , which makes the gradient vanish, ∇w LP (w∗ , b∗ ) ∇b LP (w∗ , b∗ ) n dL(u) ∗ φ(xi )yi w − C du u∗i 0 i=1 , = = n 0 dL(u) −C yi du u∗i i=1
∇LP (w∗ , b∗ ) =
(2.2)
where u∗i = 1 − yi (φT (xi )w∗ + b∗ ). Optimization problems are solved using iterative procedures that rely in each iteration on the previous solution (wk and bk , in our case) to obtain the following one, until the optimal solution has been reached. To construct the IRWLS procedure, we modify equation 2.1 using a first-order Taylor expansion of L(u) over the previous solution, as is common in other optimization procedures (Nocedal & Wright, 1999). This leads to n 1 dL(u) 2 k k LP (w, b) = w + C L(ui ) + [ui − ui ] , 2 du uki i=1
where uki = 1 − yi (φT (xi )wk + bk ), LP (wk , bk ) = LP (wk , bk ) and ∇LP (wk , bk ) = ∇LP (wk , bk ). Now, we construct a quadratic approximation imposing that LP (wk , bk ) = LP (wk , bk ) and ∇LP (wk , bk ) = ∇LP (wk , bk ), leading to n 1 dL(u) (ui )2 − (uki )2 2 k LP (w, b) = w + C L(ui ) + 2 du uki 2uki i=1
=
n 1 1 ai (1 − yi (φT (xi )w + b))2 + CT w2 + 2 2 i=1
(2.3)
where
0, C dL(u) ai = k = KC, ui du uki C/uki ,
uki < 0 0 ≤ uki < 1/K uki ≥ 1/K
and CT are constant terms that do not depend on w nor b. The IRWLS procedure consists in minimizing equation 2.3 and then recomputing ai with the obtained solution, and continuing until the solution has been reached. We will focus on the algorithmic implementation in the following section; meanwhile, we will demonstrate the following items to prove that the IRWLS procedure converges to the SVM solution:
10
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
• The sequence (w0 , b0 ), . . . , (wk , bk ), . . . converges to (wop , bop ). • wop = w∗ and bop = b∗ . First, we need to prove that the sequence of solutions converges to a limiting point in solution space (wop , bop ). Then we need to assess that this limit point corresponds with the SVM solution in equation 2.2. Line search algorithms, for advancing toward the optimum, look at the minimizing functional for a descending direction, pk , and modifies the previous solution, zk , and amount ηk to obtain the following one, zk+1 = zk + ηk pk . Wolfe conditions (Nocedal & Wright, 1999) ensure that line search methods make sufficient progress in each iteration, so the limit point is reached with required precision, being LP (zk + ηk pk ) ≤ LP (zk ) + c1 ∇LP (zk )T pk
(2.4)
∇LP (zk + ηk pk )T pk ≥ c2 ∇LP (zk )T pk
(2.5)
for 0 < c1 < c2 < 1. Wolfe conditions can be applied to the IRWLS procedure because we can describe it as a line search method, where zk = [(wk )T bk ]T , pk = [(ws − wk )T (bs − bk )]T , where ws and bs represent the minimum of the weighted least-square problem in equation 2.3. To prove the first Wolfe condition, also known as the strictly decreasing property, we will first show that LP (zk ) > LP (zk + ηk pk ) = LP (zk+1 ). We know that LP (wk , bk ) = LP (wk , bk ) and, being ws and bs the minimum of k k equation 2.3, LP (w , b ) ≥ LP (ws , bs ), equality will hold only if ws = wk and bs = bk , due to the fact that we are solving a least-square problem. Consequently, LP (wk , bk ) ≥ LP (wk+1 , bk+1 ) ∀ηk ∈ (0, 1], because (wk+1 , bk+1 ) are a convex combination of (wk , bk ) and (ws , bs ) and LP (w, b) is a convex functional, and equality will hold only if wk+1 = wk = ws and bk+1 = bk = bs . Now we will set ηk to enforce that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ), to guarantee that LP (wk , bk ) = LP (wk , bk ) > LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ). To show that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ), it is sufficient to prove that L(uki ) +
(uk+1 )2 −(uki )2 dL(u) i du |uki 2uki
≥ L(uk+1 ) ∀i = 1, . . . , n. i (u )2 −(uk )2
i i is tangent to L(ui ) at ui = uki , its For uki ≥ 0, L(uki ) + dL(u) du |uki 2uki minimum is attained at ui = 0, and its minimal value is greater than or
equal to zero. Therefore, in this case, L(uki ) + dL(u) du |uki
(ui )2 −(uki )2 2uki
≥ L(ui ) for any
ui ∈ R. We show an example for = 1 in Figure 1. For uki < 0, we need to ensure that L(uk+1 ) ≤ 0, which can be obtained i k+1 , bk+1 ) are a convex combination of (wk , bk ) and only for uk+1 ≤ 0. As (w i (ws , bs ), uk+1 can be greater than zero only if usi > 0. For the samples, whose i uki
uki < 0 and usi > 0, we will need to set ηik ≤
uki uki −usi
to ensure that uk+1 ≤ 0, i
IRWLS Convergence to the SVM Solution
11
L(u) L(1)+(u2−12)/2 L(−1)
2.5 2 1.5 1 0.5 0 −2
−1
0
1
2
u Figure 1: The dash-dotted line represents the actual SVM loss function L(u). The dashed line represents the approximation to the loss function used in equation 2.3 when uki = 1, and the solid line represents this approximation when uki < 0.
and it can be easily checked that 0 < ηk < 1. Then if we set ηk = min S
uki
uki , − usi
(2.6)
where S = {i| uki < 0 & usi > 0}, we will ensure that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ). In the case S = ∅, we set wk+1 = ws and bk+1 = bs (i.e., ηk = 1), which proves that LP (zk + ηk pk ) < LP (zk ). Now we can set c1 ∈ (0, c∗1 ] to fulfill equation 2.4, where c∗1 =
LP (zk +ηk pk )−LP (zk ) ∇LP (zk )T pk
is greater than zero because
∇LP < 0. Otherwise, would not be a descending direction. Before proving the second Wolfe condition for the IRWLS, let us rewrite LP (w, b) as follows: (zk )T pk
LP (w, b) =
pk
n 1 L(uki ) w2 + C 2 i=1 dL(u) yi [φT (xi )(wk − w) + (bk − b)], + du uki
and let us define
LP (w, b) =
n 1 L(uk+1 ) w2 + C i 2 i=1 dL(u) yi [φT (xi )(wk+1 − w) + (bk+1 − b)], + du uk+1 i
12
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
which is equivalent to LP (w, b) but defined over the actual solution instead. Being LP (w, b) convex, it can be readily seen that LP (w, b) ≥ LP (w, b) and LP (w, b) ≥ LP (w, b) ∀w ∈ RH and ∀b ∈ R. As (wk+1 , bk+1 ) are a convex combination of (wk , bk ) and (ws , bs ), we can rewrite pk = [(ws − wk )T (bs − bk )]T = [(wk+1 − wk )T (bk+1 − bk )]T /ηk , leading in the left-hand side of equation 2.5 to: ηk (∇LP (zk+1 )T pk ) k+1 T
) −C
= (w ×
n
i=1 k w
n dL(u) dL(u) φ (xi )yi −C yi du uk+1 du uk+1 i=1 i i T
− bk+1 − bk
wk+1
= wk+1 2 − (wk+1 )T wk − C
n dL(u) du i=1 k+1
uk+1 i
× yi [φT (xi )(wk+1 − wk ) + (b − bk )] n dL(u) = wk+1 2 − (wk+1 )T wk − C du k+1 T
× yi [φ (xi )(w
k+1
1 − wk 2 − C 2
k
i=1 k+1
− w ) + (b
n i=1
ui
k
− b )]
n 1 k 2 L(uk+1 ) + + C L(uk+1 ) w i i 2 i=1
1 1 1 = wk+1 2 − (wk+1 )T wk + wk 2 + wk+1 2 2 2 2 n +C L(uk+1 ) − LP (wk , bk ) i i=1
1 = wk+1 − wk 2 + LP (wk+1 , bk+1 ) − LP (wk , bk ). 2 We now repeat the same algebraic transformations over the right-hand side, of equation 2.5, leading to: ηk (∇LP (zk )T pk )
n dL(u) dL(u) φ (xi )yi −C yi = (w ) − C du uki du uki i=1 i=1 k+1 w − wk × bk+1 − bk n dL(u) = (wk+1 )T wk − wk 2 + C du k k T
n
T
i=1
ui
× yi [φT (xi )(wk − wk+1 ) + (bk − bk+1 )]
IRWLS Convergence to the SVM Solution
13
n n 1 1 + wk+1 2 + C L(uki ) − wk+1 2 − C L(uki ) 2 2 i=1 i=1
1 = − wk+1 − wk 2 − LP (wk , bk ) + LP (wk+1 , bk+1 ). 2 We now show that
wk+1 − wk 2 /2 + LP (wk+1 , bk+1 ) − LP (wk , bk ) ηk
>
−wk+1 − wk 2 /2 − LP (wk , bk ) + LP (wk+1 , bk+1 ) , ηk
which is equivalent to wk+1 − wk 2 + [LP (wk+1 , bk+1 ) − LP (wk+1 , bk+1 )] + [LP (wk , bk ) − LP (wk , bk )] > 0, because ηk ∈ (0, 1]. The terms L(wk+1 , bk+1 ) − L (wk+1 , bk+1 ) and L(wk , bk ) − L (wk , bk ) are equal to or greater than zero because the loss function is convex. Moreover, wk+1 − wk 2 ≥ 0, and it is zero only if wk+1 = wk . Therefore, we are not at the solution ∇LP (zk+1 )T pk > ∇LP (zk )T pk . Now, we can set c2 ∈ [c∗2 , 1)1 to fulfill equation 2.5, where c∗2 = ∇LP (zk+1 )T pk is less than one because ∇LP (zk )T pk < 0; otherwise, pk would ∇LP (zk )T pk not be a descending direction. We now need to prove that the proposed algorithm stops when the gradient of LP (w, b) vanishes. The Zoutendijk condition (Nocedal & Wright, 1999) tell us that if LP (w, b) is bounded below and it is Lipschitz continuous,2 and the optimization procedure fulfills the Wolfe conditions, then ∇LP (wk , bk )2 ∇L (wk ,bk )T pk
cos2 θk → 0 as k → ∞, where cos2 θk = ∇L P(wk ,bk )pk . If we prove that θk P does not tend to π/2 as k → ∞, we would have proven that the gradient of LP (w, b) vanishes and that the proposed algorithm converges to a minimum. Finally, we would need to prove that the achieved solution corresponds to the SVM solution, which we will first prove. The minimum of equation 2.3 is obtained by solving the following linear system: n T φ(xi )yi ai (1 − yi (φ (xi )w + b)) w − 0 i=1 . = n 0 T − yi ai (1 − yi (φ (xi )w + b)
(2.7)
i=1
If c∗2 < 0, the minimum value c2 can take is c1 , and if c∗1 > 1, the highest value c1 can take is c2 , but this does not affect the given proof. 2 L (w, b) is equal to or greater than zero, and it is Lipschitz continuous, because we P made it differentiable. 1
14
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
The IRWLS procedure stops when ws = wk and bs = bk ; if we replace them in equation 2.7, we are led to C dL(u) T s s φ(xi )yi k k (1 − yi (φ (xi )w + b )) w − du u u i i=1 i n C dL(u) − yi k (1 − yi (φT (xi )ws + bs ) u du k
s
n
i=1
i
ui
n dL(u) s φ(xi )yi w − C du usi 0 i=1 , = = n 0 dL(u) −C yi du usi i=1
(2.8)
which is equal to equation 2.2. Consequently the IRWLS algorithm stops when it has reached the SVM solution. To prove the sufficient condition, we need to show that if wk = w∗ and k b = b∗ , the IRWLS has stopped. Suppose it has not; then we can find ws = wk and bs = bk such that LP (wk , bk ) > LP (ws , bs ), and the strictly decreasing property will lead to LP (w∗ , b∗ ) > LP (ws , bs ), which is a contradiction because w∗ and b∗ give the minimum of LP (w, b). We have just proven that if the IRWLS has stopped, we will be at the SVM solution, and if we are at the SVM solution, the IRWLS has stopped. Finally, we do not need to prove that θk does not tend to π/2, because we have just shown that the algorithm stops iff we are at the SVM solution and, consequently, this is the point at which the gradient of LP (w, b) vanishes. That was what we still needed to prove, ending the proof of convergence. 3 Iterative Reweighted Least Squares for Support Vector Classifiers The IRWLS procedure, when introduced in P´erez-Cruz et al. (1999), did not consider the modification to ensure convergence presented in the previous section (i.e., ηk < 1 in some iterations). We will now describe the algorithmic implementation of the procedure. But before presenting the algorithm, let us rewrite equation 2.7 in matrix form, T Da + I aT
T a aT 1
T w Da y = , b aT y
(3.1)
where = [φ(x1 ), φ(x2 ), . . . , φ(xn )]T , y = [y1 , . . . , yn ]T , a = [a1 , . . . , an ]T , (Da )ij = ai δij (∀i, j = 1, . . . , n), I is the identity matrix, and 1 is a column vector of n ones. This system can be solved using kernels, as well as the regular SVM, by imposing that w = i φ(xi )yi αi and i αi yi = 0. These conditions can be obtained from the regular SVM solution (KKT conditions; see Scholkopf ¨
IRWLS Convergence to the SVM Solution
15
& Smola, 2001, for further details). Also, they can be derived from equation 2.2 in which the αi have replaced the derivative of L(ui ). The system in equation 3.1 becomes H + Da −1 yT
1 y α , = 0 0 b
(3.2)
where (H)ij = k(xi , xj ) = φT (xi )φ(xj ) and k(·, ·) is the kernel of the nonlinear transformation φ(·) (Scholkopf ¨ & Smola, 2001). The steps to derive equation 3.2 from equation 3.1 can be found in P´erez-Cruz et al. (2001). The IRWLS can be summarized in the following steps: 1. Initialization: set k = 0, α0 = 0, b0 = 0 and u0i = 1 2. Solve equation 3.2 to obtain 3.
αs
and
Compute usi . Construct S = {i|uki αk+1 = αs and bk+1 = bs and go to
∀i = 1, . . . , n.
bs .
< 0 & step 5.
usi > 0}. If S = ∅, set
4. Compute ηk , using equation 2.6, and αk+1 and bk+1 . If L(αk+1 , bk+1 ) > L(αs , bs ), set αk+1 = αs and bk+1 = bs . 5. Set k = k + 1 and go to step 2 until convergence. The modification in the third step helps to further decrease the SVM functional. The value of ηk in equation 2.6 is a sufficient condition but not a necessary one, and in some cases, αs and bs can produce a further decrease in L(α, b) than using αk+1 and bk+1 . It is worth pointing out that the solution achieved in the first step coincides with the least-square support vector machine solution (Suykens, Van Gestel, De Brabanter, De Moor, & Vandewalle, 2003), as Da is the identity matrix multiplied by C. In a way, we can say that the starting point of the IRWLS procedure is the LS-SVM solution. 4 Comments on the IRWLS Procedure In this article, we have proven the convergence of the IRWLS procedure to the SVM solution. This algorithm was devised for solving SVMs and presents several properties that make it desirable. First, the IRWLS algorithm, as its name indicates, needs to solve only a simple least-square problem in each iteration. Moreover, the linear system is formed only by the samples whose uki > 0, while those samples whose uki < 0 will not affect the functional LP (w, b) in equation 2.3. This property allows working with only part of the kernel matrix, significantly reducing the run-time complexity. Second, during the first iterations, which are the most computationally costly, many samples change the value of ui from positive to negative. Moreover, if ηk < 1, a sample whose uki < 0 ends with uk+1 = 0, and in the next i iteration its ai = KC. Therefore if any sample whose u∗i ≥ 0, but in some
16
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
intermediate iteration uki < 0, the algorithm will recover it. The IRWLS changes several constraints from active to inactive in one iteration, while most QP algorithms stop when a constraint changes. We illustrate this property with a simple example in Figure 2. Third, we point out that the value of ηk is 1 in most iterations; only seldom does a sample change from uki < 0 to usi > 0 and L(wk+1 , bk+1 ) ≤ L(ws , bs ), which explains why the IRWLS was working correctly without this modification. The role of K can be analyzed from the proof and implementation perspectives. A finite value of K allows demonstrating that the IRWLS procedure converges to the SVM solution. If K was infinite, the functional would not be differentiable, and although equation 2.2 and 2.8 will be equal, that would not mean that wop (bop ) is equal to w∗ (b∗ ). From the implementation viewpoint, we are adding at least 1/K to the diagonal of H; therefore, if H is nearly singular (as it is for most problems of interest—even for infinite VC dimension kernels), a finite value of K would avoid numerical instabilities. We usually fix K between 104 and 1010 , depending on the machine precision,
(a)
(b)
(d)
(e)
(c)
(f)
120
LP in (2)
100 80 60 40 0
# Iter
10
20
Figure 2: (a–e) The intermediate solution of the IRWLS procedure for a simple problem respectively, for iterations 1, 2, 3, 8, and 18 (final). (f) The value of LP for every iteration. The squares represent the negative-class samples and the circles the positive-class samples. The solid circles and squares are those samples whose uk+1 > 0. The solid line is the classification boundary, and the dashed i lines represent the ±1 margins. It can be seen that in the first step, almost half of the samples change from a u0i > 0 to u1i < 0, significantly advancing toward the optimum and reducing the complexity of subsequent iterations. The solution in iteration 8 is almost equal to the solution in iteration 18; in these intermediate iterations, the algorithm is only fine-tuning the values of αi .
IRWLS Convergence to the SVM Solution
17
and do not modify it during training. For this range of K, the solution does not vary significantly, meaning that we are close to or at the SVM solution. A software package in MATLAB can be downloaded from our web page (http://www.gatsby.ucl.ac.uk/∼fernando). 4.1 Extending the IRWLS Procedure. We have proven the convergence of the IRWLS procedure to the standard SVM solution. This procedure is sufficiently general to be applied to other loss functions. It can be directly applied to any convex, continuous, and differentiable loss function. As we rely on the derivative of the loss, we demonstrate that the limiting point of the IRWLS procedure is the solution to the actual functional; we need the first-order Taylor expansion to be a lower bound on the loss function to show the sufficient decreasing property. To complete the IRWLS procedure, one needs to come up with a quadratic approximation, which has to be at least locally an upper bound to the loss function to ensure the strictly decreasing property. This upper bound has to take the same value and derivative that the loss function does at the actual solution to ensure that the sequence of solutions converges to the solution of our functional. The question that readily arises is if it can be employed for nonconvex loss functions. First, if it is used at all, it would lead only to a local minimum; another method would be needed to assess the quality of the obtained solution. If the loss function is not convex, the sufficient decreasing property would not hold for every possible solution found by the second step of the IRWLS procedure and further analysis would be necessary, taking into account the shape of the nonconvex function, to demonstrate the convergence of the algorithm. Acknowledgments We kindly thank Chih-Jen Lin for his useful comments and pointing out the weakness of our previous proofs. References Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining, 2(2), 121–167. Mangasarian, O. L., & Musicant, D. R. (2000). Lagrangian support vector machines. Journal of Machine Learning Research, 1, 161–177. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer. P´erez-Cruz, F., Alarcon-Diana, ´ P. L., Navia-V´azquez, A., & Art´es-Rodr´ıguez, A. (2000). Fast training of support vector classifiers. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13, Cambridge, MA: MIT Press.
18
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
P´erez-Cruz, F., Navia-V´azquez, A., Alarcon-Diana, ´ P. L., & Art´es-Rodr´ıguez, A. (2000). An IRWLS procedure for SVR. In Proceedings of the EUSIPCO’00. Tampere, Finland. P´erez-Cruz, F., Navia-V´azquez, A., Alarcon-Diana, ´ P. L., & Art´es-Rodr´ıguez, A. (2001). SVC-based equalizer for burst TDMA transmissions. Signal Processing, 81(8), 1681–1693. ´ P´erez-Cruz, F., Navia-V´azquez, A., Rojo-Alvarez, J. L., & Art´es-Rodr´ıguez, A. (1999). A new training algorithm for support vector machines. In Proceedings of the Fifth Bayona Workshop on Emerging Technologies in Telecommunications (pp. 116–120). Baiona, Spain. Scholkopf, ¨ B., & Smola, A. (2001). Learning with kernels. Cambridge, MA: MIT Press. Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2003). Least squares support vector machines. Singapore: World Scientific. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300. Van Gestel, T., Suykens, J. A. K., Baesens, B., Vanthienen, S., Dedene, G., De Moor, B., & Vandewalle, J. (2004). Benchmarking least squares support vector machines classifiers. Machine Learning, 54(1), 5–32. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Received January 7, 2004; accepted May 25, 2004.
LETTER
Communicated by Bruno Olshausen
Efficient Coding of Time-Relative Structure Using Spikes Evan Smith
[email protected] Department of Psychology, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
Michael S. Lewicki
[email protected] Department of Computer Science, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
Nonstationary acoustic features provide essential cues for many auditory tasks, including sound localization, auditory stream analysis, and speech recognition. These features can best be characterized relative to a precise point in time, such as the onset of a sound or the beginning of a harmonic periodicity. Extracting these types of features is a difficult problem. Part of the difficulty is that with standard block-based signal analysis methods, the representation is sensitive to the arbitrary alignment of the blocks with respect to the signal. Convolutional techniques such as shift-invariant transformations can reduce this sensitivity, but these do not yield a code that is efficient, that is, one that forms a nonredundant representation of the underlying structure. Here, we develop a non-block-based method for signal representation that is both time relative and efficient. Signals are represented using a linear superposition of time-shiftable kernel functions, each with an associated magnitude and temporal position. Signal decomposition in this method is a non-linear process that consists of optimizing the kernel function scaling coefficients and temporal positions to form an efficient, shift-invariant representation. We demonstrate the properties of this representation for the purpose of characterizing structure in various types of nonstationary acoustic signals. The computational problem investigated here has direct relevance to the neural coding at the auditory nerve and the more general issue of how to encode complex, time-varying signals with a population of spiking neurons. 1 Introduction Nonstationary and time-relative acoustic structures such as transients, timing relations among acoustic events, and harmonic periodicities provide essential cues for many types of auditory processing. In sound localization, Neural Computation 17, 19–45 (2005)
c 2004 Massachusetts Institute of Technology
20
E. Smith and M. Lewicki
human subjects can reliably detect interaural time differences as small as 10 µs, which corresponds to a binaural sound source shift of about 1 degree (Blauert, 1997). In comparison, the sampling interval for an audio CD sampled at 44.1 kHz is 22.7 microseconds. Auditory grouping cues, such as common onset and offset, harmonic comodulation, and sound source location, all rely on accurate representation of timing and periodicity (Slaney & Lyon, 1993). Time-relative structure is also crucial for the recognition of consonants and many types of transient, nonstationary sounds. Neurophysiological research in the auditory brainstem of mammals has found cells capable of conveying precise phase information up to 4 kHz or of tracking the quickly varying envelope of a high-frequency sound (Oertel, 1999). The importance of these acoustic cues has long been recognized, but extracting them from natural signals still poses many challenges because the problem is fundamentally ill posed. In natural acoustic environments, with multiple sound sources and background noises, acoustic events are not directly observable and must be inferred using numerous ambiguous cues. Another reason for the difficulty in obtaining these cues is that most approaches to signal representation are block based; the signal is processed piecewise in a series of discrete blocks. Transients and nonstationary periodicities in the signal can be temporally smeared across blocks. Large changes in the representation of an acoustic event can occur depending on the arbitrary alignment of the processing blocks with events in the signal. Signal analysis techniques such as windowing or the choice of the transform can reduce these effects, but it would be preferable if the representation was insensitive to signal shifts. Shift invariance alone, however, is not a sufficient constraint on designing a general sound processing algorithm. Another important constraint is coding efficiency or, equivalently, the ability of the representation to capture underlying structure in the signal. A desirable code should reduce the information rate from the raw signal so that the underlying structures are more directly observable. Signal processing algorithms can be viewed as a method for progressively reducing the information rate until one is left with only the information of interest. We can make a distinction between the observable information rate, or the rate of the observable variables, and the intrinsic information rate, or the rate of the underlying structure of interest. In speech, the observable information rate of the waveform samples is about 50,000 bits per second, but the intrinsic rate of the underlying words is only around 200 bits per second (Rabiner & Levinson, 1981). Information reduction can be achieved by either selecting only the desired information (and discarding everything else) or removing redundancy, such as the temporal correlations between samples. This reduces the observable information rate while preserving the intrinsic information. In this letter, we investigate algorithms for fitting an efficient, shiftinvariant representation to natural sound signals. The outline of the letter is as follows. The next section describes the motivations behind this approach
Efficient Coding of Time-Relative Structure Using Spikes
21
and illustrates some of the shortcomings of current methods. After defining the model for signal representation, we present different algorithms for signal decomposition and contrast their complexity. Next, we illustrate the properties of the representation on various types of speech sounds. We then present a measure of coding efficiency and compare these algorithms to traditional methods for signal representation. Finally, we discuss the relevance of the computational issues discussed here to spike coding and signal representation at the auditory nerve. 2 Representing Nonstationary Acoustic Structure Encoding the acoustic signal is the first step in any algorithm for performing an auditory task. There are numerous approaches to this problem, which differ in both their computational complexity and in what aspects of signal structure are extracted. Ultimately, the choice about what the representation encodes depends on the tasks that need to be performed. In the ideal case, the encoding process extracts only that information necessary to perform the task and suppresses noise or unrelated information. A generalist approach, like that taken by most mammalian auditory systems, requires a representation that is efficient for a wide range of signals. As natural sounds contain both relatively stationary harmonic structure (e.g., animal vocalizations) as well as nonstationary transient structure (e.g., crunching leaves and twigs), this generalist approach requires a code capable of efficiently representing these disparate sound classes (Lewicki, 2002a). Here we seek an auditory representation that is useful for a variety of different tasks. 2.1 Block-Based Representations. Most approaches to signal representation are block based, in which signal processing takes place on a series of overlapping, discrete blocks. This not only obscures transients and periodicities in the signal, but can also have the effect that for nonstationary signals, small time shifts can produce large changes in the representation, depending on whether and where a particular acoustic event falls within the block. Figure 1 illustrates the sensitivity of block-based representation with small shifts in speech signals. The upper panel shows a short speech waveform sectioned into blocks using two sequences of Hamming windows (solid and dashed curves). Each window spans approximately 30 msecs (512 samples) and successive blocks (A1, A2, and so on) are shifted by 10 msecs. The B blocks offset from the A blocks by an amount indicated by the dot-dash vertical lines (∼ 5 msecs), representing the arbitrary alignment of the signal with respect to the two block sequences. The lower panel shows spectral representations for the three corresponding blocks (solid for the A blocks, dashed for the B blocks). The jagged upper curves show the power spectra for each windowed waveform. The smooth lower curves (offset by −20 dB) show the spectrum of the optimal filter derived by linear predictive coding.
22
E. Smith and M. Lewicki !
!
"
3IGNAL ,EVEL D"
! "
!
"
! "
"
MS ! "
´ ´ ´ ´
&REQUENCY