Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5792
Osamu Watanabe Thomas Zeugmann (Eds.)
StochasticAlgorithms: Foundations and Applications 5th International Symposium, SAGA 2009 Sapporo, Japan, October 26-28, 2009 Proceedings
13
Volume Editors Osamu Watanabe Tokyo Institute of Technology Department of Mathematical and Computing Sciences W8-25, Tokyo 152-8552, Japan E-mail:
[email protected] Thomas Zeugmann Hokkaido University Division of Computer Science N-14, W-9, Sapporo 060-0814, Japan E-mail:
[email protected]
Library of Congress Control Number: 2009935906 CR Subject Classification (1998): F.2, F.1.2, G.1.2, G.1.6, G.2, G.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-04943-5 Springer Berlin Heidelberg New York 978-3-642-04943-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12773497 06/3180 543210
Preface
The 5th Symposium on Stochastic Algorithms, Foundations and Applications (SAGA 2009) took place during October 26–28, 2009, at Hokkaido University, Sapporo (Japan). The symposium was organized by the Division of Computer Science, Graduate School of Computer Science and Technology, Hokkaido University. It offered the opportunity to present original research on the design and analysis of randomized algorithms, random combinatorial structures, implementation, experimental evaluation and real-world application of stochastic algorithms/heuristics. In particular, the focus of the SAGA symposia series is on investigating the power of randomization in algorithms, and on the theory of stochastic processes especially within realistic scenarios and applications. Thus, the scope of the symposium ranges from the study of theoretical fundamentals of randomized computation to experimental investigations on algorithms/heuristics and related stochastic processes. The SAGA symposium series is a biennial meeting. Previous SAGA symposia took place in Berlin, Germany (2001, LNCS vol. 2264), Hatfield, UK (2003, LNCS vol. 2827), Moscow, Russia (2005, LNCS vol. 3777), and Z¨ urich, Switzerland (2007, LNCS vol. 4665). This year 22 submissions were received, and the Program Committee selected 15 submissions for presentation. All papers were evaluated by at least three members of the Program Committee, partly with the assistance of subreferees. The present volume contains the texts of the 15 papers presented at SAGA 2009, divided into groups of papers on learning, graphs, testing, optimization, and caching as well as on stochastic algorithms in bioinformatics. The volume also contains the paper and abstract of the invited talks, respectively: – Werner R¨ omisch (Humboldt University, Berlin, Germany): Scenario Reduction Techniques in Stochastic Programming – Taisuke Sato (Tokyo Institute of Technology, Tokyo, Japan): Statistical Learning of Probabilistic BDDs We would like to thank the many people and institutions who contributed to the success of the symposium. Thanks to the authors of the papers for their submissions, and to the invited speakers, Werner R¨omisch and Taisuke Sato, for presenting exciting overviews of important recent research developments. We are very grateful to the sponsors of the symposium. SAGA 2009 was supported in part by two MEXT Global COE Programs: Center for Next-Generation Information Technology based on Knowledge Discovery and Knowledge Federation, at the Graduate School of Information Science and Technology, Hokkaido University, and Computationism as a Foundation of the Sciences at Tokyo Institute of Technology.
VI
Preface
We are grateful to the members of the Program Committee for SAGA 2009. Their hard work in reviewing and discussing the papers made sure that we had an interesting and strong program. We also thank the subreferees assisting the Program Committee. Special thanks go to Local Chair Shin-ichi Minato (Hokkaido University, Sapporo, Japan) for the local organization of the symposium. Last but not least, we thank Springer for the support provided by Springer’s Online Conference Service (OCS), and for their excellent support in preparing and publishing this volume of the Lecture Notes in Computer Science series. October 2009
Osamu Watanabe Thomas Zeugmann
Organization
Conference Chair Thomas Zeugmann
Hokkaido University, Sapporo, Japan
Program Committee Andreas Albrecht Amihood Amir Artur Czumaj Josep Dıaz
Queen’s University Belfast, UK Bar-Ilan University, Israel University of Warwick, UK Universitat Polit`ecnica de Catalunya, Barcelona, Spain Walter Gutjahr University of Vienna, Austria Juraj Hromkoviˇc ETH Z¨ urich, Switzerland Mitsunori Ogihara University of Miami, USA Pekka Orponen Helsinki University of Technology TKK, Finland Kunsoo Park Seoul National University, Korea Francesca Shearer Queen’s University Belfast, UK Paul Spirakis University of Patras, Greece Kathleen Steinh¨ ofel King’s College London, UK Masashi Sugiyama Tokyo Institute of Technology, Japan Osamu Watanabe (Chair) Tokyo Institute of Technology, Japan Masafumi Yamashita Kyushu University, Fukuoka, Japan Thomas Zeugmann Hokkaido University, Sapporo, Japan
Local Arrangements Shin-ichi Minato
Hokkaido University, Sapporo, Japan
Subreferees Marta Arias Mituhiro Fukuda
Alex Fukunaga Masaharu Taniguchi
Sponsoring Institutions MEXT Global COE Program Center for Next-Generation Information Technology based on Knowledge Discovery and Knowledge Federation at the Graduate School of Information Science and Technology, Hokkaido University. MEXT Global COE Program Computationism as a Foundation of the Sciences at Tokyo Institute of Technology.
Table of Contents
Invited Papers Scenario Reduction Techniques in Stochastic Programming . . . . . . . . . . . . Werner R¨ omisch
1
Statistical Learning of Probabilistic BDDs . . . . . . . . . . . . . . . . . . . . . . . . . . Taisuke Sato
15
Regular Contributions Learning Learning Volatility of Discrete Time Series Using Prediction with Expert Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vladimir V. V’yugin
16
Prediction of Long-Range Dependent Time Series Data with Performance Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Dashevskiy and Zhiyuan Luo
31
Bipartite Graph Representation of Multiple Decision Table Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuya Haraguchi, Seok-Hee Hong, and Hiroshi Nagamochi
46
Bounds for Multistage Stochastic Programs Using Supervised Learning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boris Defourny, Damien Ernst, and Louis Wehenkel
61
On Evolvability: The Swapping Algorithm, Product Distributions, and Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitrios I. Diochnos and Gy¨ orgy Tur´ an
74
Graphs A Generic Algorithm for Approximately Solving Stochastic Graph Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ei Ando, Hirotaka Ono, and Masafumi Yamashita How to Design a Linear Cover Time Random Walk on a Finite Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshiaki Nonaka, Hirotaka Ono, Kunihiko Sadakane, and Masafumi Yamashita
89
104
X
Table of Contents
Propagation Connectivity of Random Hypergraphs . . . . . . . . . . . . . . . . . . . Robert Berke and Mikael Onsj¨ o Graph Embedding through Random Walk for Shortest Paths Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yakir Berchenko and Mina Teicher
117
127
Testing, Optimization, and Caching Relational Properties Expressible with One Universal Quantifier Are Testable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charles Jordan and Thomas Zeugmann
141
Theoretical Analysis of Local Search in Software Testing . . . . . . . . . . . . . . Andrea Arcuri
156
Firefly Algorithms for Multimodal Optimization . . . . . . . . . . . . . . . . . . . . . Xin-She Yang
169
Economical Caching with Stochastic Prices . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Englert, Berthold V¨ ocking, and Melanie Winkler
179
Stochastic Algorithms in Bioinformatics Markov Modelling of Mitochondrial BAK Activation Kinetics during Apoptosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Grills, D.A. Fennell, and S.F.C. Shearer
191
Stochastic Dynamics of Logistic Tumor Growth . . . . . . . . . . . . . . . . . . . . . . S.F.C. Shearer, S. Sahoo, and A. Sahoo
206
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
221
Scenario Reduction Techniques in Stochastic Programming Werner R¨ omisch Humboldt-University Berlin, Department of Mathematics, Rudower Chaussee 25, 12489 Berlin, Germany
Abstract. Stochastic programming problems appear as mathematical models for optimization problems under stochastic uncertainty. Most computational approaches for solving such models are based on approximating the underlying probability distribution by a probability measure with finite support. Since the computational complexity for solving stochastic programs gets worse when increasing the number of atoms (or scenarios), it is sometimes necessary to reduce their number. Techniques for scenario reduction often require fast heuristics for solving combinatorial subproblems. Available techniques are reviewed and open problems are discussed.
1
Introduction
Many stochastic programming problems may be reformulated in the form f0 (x, ξ)P (dξ) : x ∈ X , min E(f0 (x, ξ)) =
(1)
Rs
where X denotes a closed subset of Rm , the function f0 maps from Rm × Rs to the extended real numbers R = R ∪ {−∞, +∞}, E denotes expectation with respect to P and P is a probability distribution on Rs . For example, models of the form (1) appear as follows in economic applications. Let ξ = {ξt }Tt=1 denote a discrete-time stochastic process of d-dimensional random vectors ξt at each t ∈ {1, . . . , T } and assume that decisions xt have to be made such that the total costs appearing in an economic process are minimal. Such an optimization model may often be formulated as T t−1 ft (xt , ξt ) : xt ∈ Xt , Atτ (ξt )xt−τ = ht (ξt ), t = 1, . . . , T . (2) min E t=1
τ =0
Typically, the sets Xt are polyhedral, but they may also contain integrality conditions. In addition, the decision vector (x1 , . . . , xT ) has to satisfy a dynamic constraint (i.e. xt depends on the former decisions) and certain balancing conditions. The matrices Atτ (ξt ), τ = 0, . . . , t − 1, (e.g. containing technical parameters) and the right-hand sides ht (ξt ) (e.g. demands) are (partially) random. The functions ft describe the costs at time t and may also be (partially) random O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 1–14, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
W. R¨ omisch
(e.g. due to uncertain market prices). The objective in (2) consists in minimizing the expected total costs. The time t = 1 represents the present, hence, we assume that ξ1 is deterministic and require that the decision x1 is deterministic, too. The latter condition is modeled by the constraint x1 = E(x1 ). Then we may reformulate the stochastic program (2) as optimization model ˆ : x1 = E(x1 ), x1 ∈ X1 , A10 x1 = h1 (ξ1 ) min f1 (x1 .ξ1 ) + E Φ(x1 , ξ) (3) for the decision x1 at time t = 1. With ξˆ denoting ξˆ := (ξ2 , . . . , ξT ) the uncertain future input process the function Φ is defined by T t−1 ˆ := inf E ft (xt , ξt ) : xt ∈ Xt , Atτ xt−τ = ht (ξt ), t = 2, . . . , T . Φ(x1 , ξ) t=2
τ =0
(4) A solution of (3) minimizes the first period costs and also the minimal expected future costs. It is called first-stage solution while a (stochastic) solution (x2 , . . . , xT ) of (4) is a second-stage solution. Consequently, the model (3) is named two-stage stochastic program. We note that any first-stage solution depends on the probability distribution of the stochastic process ξ. An often more realistic condition is to require that the decision xt at time t depends only on the available information (ξ1 , . . . , ξt ) and that the information flow evolves over time. This requirement is modeled by the constraints xt = E(xt |ξ1 , ξ2 , . . . , ξt ) (t = 1, . . . , T )
(5)
which has to be incorporated into the constraint set of (2) and (4), respectively. The expression E(· |ξ1 , ξ2 , . . . , ξt ) on the right-hand side of (5) is the conditional expectation with respect to the random vector (ξ1 , ξ2 , . . . , ξt ) (and is assumed to be well defined). The constraints (5) are often called nonanticipativity constraints and (2) including (5) is a multi-stage stochastic program. We note that the constraint for t = 1 in (5) coincides with the former condition x1 = E(x1 ) in (2). If we set x = x1 and X := {x ∈ X1 : x = E(x), A10 x = h1 (ξ1 )} ˆ , if x ∈ X and Φ(x, ξ) ˆ is finite f1 (x, ξ1 ) + Φ(x, ξ) f0 (x, ξ) := +∞ , else the optimization model (2) is of the form (1). Suitable assumptions on ξ together ˆ > −∞. In that case, finiteness of with duality arguments often imply Φ(x, ξ) ˆ Φ(x, ξ) is guaranteed if a feasible decision (x2 , . . . , xT ) exists for given x1 . Most of the approaches for solving (1) numerically are based on replacing the probability distribution P by a discrete distribution with finite support {ξ 1 , . . . , ξ N }, where the atom or scenario ξ i appears with probability pi > 0, N i = 1, . . . , N , and it holds i=1 pi = 1. The approximate stochastic program associated to (1) then is of the form
Scenario Reduction Techniques in Stochastic Programming
min
N
pi f0 (x, ξ i ) : x ∈ X .
3
(6)
i=1
In Section 2 we discuss how the discrete distribution should approximate the original probability distribution P . It turns out that properties of the integrand f0 (x, · ) as function of ξ together with properties of P characterize the kind of approximation needed in stochastic programming. When looking at (3), (4) we recall that evaluations of f0 at a pair (x, ξ) may be expensive. This leads us to one of the main numerical challenges in stochastic programming: A good approximation of P requires a large N , but solving (6) in a reasonable running time prefers a small(er) N . Hence, when solving applied stochastic programming models, one might start with a large(r) N , but running time requirements might force a (much) smaller number n of scenarios. Should ξ n+1 , . . . , ξ N just be thrown away ? In this paper, we argue that one should take advantage of the information contained in {ξ 1 , . . . , ξ N } and look for a better approximation of P based on n scenarios out of {ξ 1 , . . . , ξ N } by scenario reduction. The concept of scenario reduction and recent results in this area are reviewed in Section 3. In this paper, we restrict our attention to two-stage stochastic programs. We take a look at the underlying theory, review available algorithms for linear and mixed-integer two-stage stochastic programs, and point out which heuristic algorithms are important in this context and where we see possibilities for improvements. Multi-stage models and the tree structure of scenarios (due to the constraint (5)) require a (slightly) different theoretical basis and will not be considered in this paper, although scenario reduction techniques may be very helpful in this respect (see [8, 9]). Instead we refer to the literature, e.g., [2, 3, 13, 14, 15, 16, 18, 19, 25], to our recent work [8, 9] and to excellent sources of material on theory and applications of stochastic programs in [24, 26].
2
Stability and Distances of Probability Distributions
Solving stochastic programs computationally requires to replace the original probability distribution P by a probability measure Q having a finite number of scenarios and associated probabilities. Of course, Q has to be determined in such a way that infima and solutions of (1) do not change much if P is replaced by Q. In the following, we use the notations f0 (x, ξ)P (dξ) : x ∈ X v(P ) := inf Rs S(P ) := x ∈ X : f0 (x, ξ)P (dξ) = v(P ) Rs
for the infimum and solution set, respectively, of (1) with respect to P . If the set X is compact, the following estimates are valid
4
W. R¨ omisch
|v(P ) − v(Q)| ≤ sup
f0 (x, ξ)(P − Q)(dξ)
s x∈X R
−1 sup d(y, S(P )) ≤ ψP sup
f0 (x, ξ)(P − Q)(dξ) . x∈X
y∈S(Q)
(7) (8)
Rs
In (7) d(y, S(P )) denotes the distance of y ∈ S(Q) to S(P ). The mapping ψP−1 denotes the inverse of the growth function ψP of the objective in a neighborhood of the solution set S(P ), i.e., ψP (t) := inf f0 (x, ξ)P (dξ) − v(P ) : d(x, S(P )) ≥ t, x ∈ X . Rs
The growth function ψP is monotonically increasing on [0, +∞) and it holds ψP (0) = 0. The estimates (7) and (8) elucidate that the distance
f0 (x, ξ)P (dξ) − f0 (x, ξ)Q(dξ)
d(P, Q) := sup
(9) x∈X
Rs
Rs
of the two probability distributions P and Q becomes important. But, this distance is difficult to evaluate numerically since the function f0 is often very involved (e.g. an infimum function of an optimization model as in the example in Section 1). However, in several important model instances of functions f0 it is possible to derive estimates of d(P, Q) by other distances of probability distributions that are easier to evaluate. We mention the following possibilities: (1) The first one is the classical Koksma-Hlawka inequality (see [17, Section 2.2] and [5]) if the functions f0 (x, · ) are of bounded variation in the sense of Hardy and Krause (uniformly with respect to x ∈ X) on the support [0, 1]s of both probability distributions P and Q. The relevant distance of probability measures is the so-called ∗-discrepancy and is defined in (14). (2) A second idea is to determine a set F of functions from a closed subset Ξ of Rs containing the support of both probability distributions P and Q to R such that Cf0 (x, · ) ∈ F for some constant C > 0 not depending on x. Then it holds 1 dF (P, Q) C
dF (P, Q) := sup
f (ξ)P (dξ) − f (ξ)Q(dξ) . d(P, Q) ≤
f ∈F
Ξ
Ξ
Of course, the set F should be selected such that the distance dF (P, Q) is easier to evaluate than d(P, Q). For two-stage stochastic programs that are (i) linear or (ii) mixed-integer linear the following sets F of functions are relevant (cf. [21, 23, 22]): ˜ ≤ cr (ξ, ξ), ˜ ∀ξ, ξ˜ ∈ Ξ}, (i) Fr (Ξ) = {f : Ξ → R : f (ξ) − f (ξ)
Scenario Reduction Techniques in Stochastic Programming
5
(ii) Fr,B (Ξ) = {f 1lB : f ∈ Fr (Ξ), |f (ξ)| ≤ max{1, |ξ|r }, ∀ξ ∈ Ξ, B ∈ B}, where r ∈ N, B denotes a set of polyhedral subsets of Ξ, | · | is a norm on Rs and 1lB is the characteristic function of the set B, i.e., 1lB (ξ) = 1 for ξ ∈ B and 1lB (ξ) = 0 for ξ ∈ Ξ \ B. Furthermore, the function cr is defined by ˜ := max{1, |ξ|r−1 , |ξ| ˜ r−1 }|ξ − ξ| ˜ (ξ, ξ˜ ∈ Ξ). cr (ξ, ξ) We note that the choice of r and of B depends on the structure of the stochastic program. While the second stage infima are locally Lipschitz continuous as functions of ξ in case of linear two-stage models and belong to Fr (Ξ) for some r ∈ N, they are discontinuous on boundaries of certain polyhedral sets for mixed-integer linear models. Hence, the set Fr,B (Ξ) is relevant in the latter case. An important special case of the class B is the set Brect of all rectangular sets in Rs , i.e. Brect = {I1 × · · · × Is : ∅ = Ij is a closed interval in R, j = 1, . . . , s} , which is relevant for the stability of pure integer second stage models. In case (i) so-called Fortet-Mourier metrics of order r
ζr (P, Q) = sup f (ξ)P (dξ) − f (ξ)Q(dξ)
f ∈Fr (Ξ)
Ξ
(11)
Ξ
appear as special instances of dF , and in case (ii) the metrics
ζr,B (P, Q) = sup f (ξ)P (dξ) − f (ξ)Q(dξ) .
f ∈Fr (Ξ),B∈B
(10)
B
(12)
B
Since a sequence of probability measures on Rs converges with respect to ζr,B if and only if it converges with respect to ζr and with respect to the B-discrepancy αB (P, Q) := sup |P (B) − Q(B)|, B∈B
respectively (cf. [12]), one may consider the ‘mixed’ distance dλ (P, Q) = λ αB (P, Q) + (1 − λ) ζr (P, Q)
(13)
for some λ ∈ (0, 1] instead of the more complicated metric ζr,B . Two specific discrepancies are of interest in this paper. The first one is the rectangular discrepancy αBrect which is needed in Section 3.2. The second one is the ∗-discrepancy α∗ (P, Q) = sup |P ([0, x)) − Q([0, x))|, x∈[0,1]s
(14)
where P and Q are probability measures on [0, 1]s and [0, x) = ×si=1 [0, xi ). As is known from [11] discrepancies are (much) more difficult to handle in scenario reduction compared to Fortet-Mourier metrics (see also Section 3).
6
W. R¨ omisch
If both P and Q are discrete probability distributions (with finite support), the Fortet-Mourier distance ζr (P, Q) may be computed as optimal value of certain linear programs (see also [7]). We show in Section 3.2 that the distance dλ (P, Q) may be computed as well by linear programming (see also [10, 12]). If the set Ξ is compact the Fortet-Mourier metric ζr admits a dual representation as Monge-Kantorovich transportation functional. In particular, it holds for all probability measures on Ξ with finite rth moment (see [20, Section 4.1]): ˜ ˜ : η ∈ M (P, Q) , ζr (P, Q) := inf cˆr (ξ, ξ)η(dξ, dξ) (15) Ξ×Ξ
where M (P, Q) denotes the set of all probability measures η on Ξ × Ξ whose marginal distributions on Ξ are just P and Q, respectively. Furthermore, the function cˆr is the reduced cost function associated with the cost cr and is defined by n+1 ˜ ˜ zj ∈ Ξ, n ∈ N . cˆr (ξ, ξ) := inf cr (zj−1 , zj ) : z0 = ξ, zn+1 = ξ, (16) j=1
The function cˆr is a metric on Ξ and it holds cˆr ≤ cr (cf. [20, Chapter 4.1.3]).
3
Optimal Scenario Reduction
Let P be a probability distribution on Rs with finite support consisting of N scenarios ξ i and their probabilities pi , i ∈ I := {1, . . . , N }. The basic idea of optimal scenario reduction consists in determining a probability distribution Qn which is the best approximation of P with respect to a given distance d of probability measures and whose support consists of a subset of {ξ 1 , . . . , ξ N } with n < N elements. This means d(P, Qn ) = inf{d(P, Q) : Q(Rs ) = 1, supp(Q) ⊂ supp(P ), |supp(Q)| = n}. (17) An equivalent formulation of (17) may be obtained as follows: Let QJ denote a probability measure on Rs with supp(QJ ) = {ξ i : i ∈ {1, . . . , N } \ J} for some index set J ⊂ {1, . . . , N } and let qi , i ∈ {1, . . . , N } \ J, be the probability of scenario indexed by i. Then the minimization problem min d(P, QJ ) : J ⊂ I, |J| = N − n, qi ≥ 0, i ∈ I \ J, qi = 1 (18) i∈I\J ∗
determines some index set J and qi∗ ∈ [0, 1] such that the probability measure with scenarios ξ i and probabilities qi∗ for i ∈ {1, . . . , N } \ J ∗ solves (17). The second formulation (18) of the optimal scenario reduction problem has the advantage to provide a decomposition into an inner and an outer minimization problem, namely, min inf d(P, QJ ) : qi ≥ 0, i ∈ I \ J, qi = 1 : J ⊂ I, |J| = N − n . (19) J
q
i∈I\J
As suggested in [4] the distance d has to be selected such that the stochastic program (1) behaves stable with respect to d (cf. Section 2).
Scenario Reduction Techniques in Stochastic Programming
3.1
7
Two-Stage Stochastic Linear Programs
Here, the distance d is just the Fortet-Mourier metric ζr for some r ∈ N which depends on the structure of the underlying stochastic programming model. For a discussion of the choice of r it is referred to [21, 23]. Since the distance ζr (P, QJ ) has a dual representation as mass transportation problem for any index set J (see (15)), the inner minimization problem in (19) can be solved explicitly. Namely, it holds qi = 1 = pj min cˆr (ξ i , ξ j ), (20) DJ := min ζr (P, QJ ) : qi ≥ 0, i ∈ J, j∈J
i∈J
i∈J
where the reduced costs cˆr have the simplified representation cˆr (ξ i , ξ j ) := min
n−1
cr (ξ lk , ξ lk+1 ) : n ∈ N, lk ∈ I, l1 = i, ln = j
(21)
k=1
as shortest path problem compared with the general form (16). The optimal probabilities qi∗ , i ∈ J, are given by the redistribution rule qi∗ = pi + pj and i(j) ∈ arg min cˆr (ξ i , ξ j ), i ∈ / J, (22) i∈J
j∈J i(j)=i
The redistribution rule consists in adding the probability of a deleted scenario j ∈ J to the probability of a scenario that is nearest to ξ j with respect to the distance cˆr on Rs . The outer minimization problem is of the form pj min cˆr (ξ i , ξ j ) : J ⊂ I, |J| = N − n (23) min DJ = j∈J
i∈I\J
and represents a combinatorial optimization problem of n-median type. It is N P-hard and suggests to use heuristic or approximation algorithms. We also refer to [1] where a branch-and-cut-and-algorithm is developed and tested on large scale instances. In our earlier work [6, 7] we proposed two simple heuristic algorithms: the forward (selection) and the backward (reduction) heuristic. To give a short description of the two algorithms, let cij := cˆr (ξ i , ξ j )
(i, j = 1, . . . , N ).
The basic idea of the forward algorithm originates from the simple structure of (23) for the special case n = 1. It is of the form min
u∈{1,...,N }
N
pj cuj .
j=1 j=u
If the minimum is attained at u∗ , the index set J = {1, . . . , N }\{u∗} solves (23). ∗ The scenario ξ u is taken as the first element of supp(Q). Then the separable
8
W. R¨ omisch
structure of DJ is exploited to determine the second element of supp(Q) while the first element is fixed. The process is continued until n elements of supp(Q) are selected. Forward algorithm for scenario reduction Step 0: J [0] := {1, . . . , N }. Step k: uk ∈ arg min
u∈J [k−1]
pj
j∈J [k−1] \{u}
min
i∈J [k−1] \{u}
cij ,
J [k] := J [k−1] \ {uk } . Step n+1: Redistribution with J := J [n] via (22).
Fig. 1. Illustration of selecting the first, second and third scenario out of N = 5
The idea of the backward algorithm is based on the second special case of (23) for n = N − 1. It is of the form min
l∈{1,...,N }
pl min cil . i=l
If the minimum is attained at l∗ , the index set J = {l∗ } solves (23) in case n = N − 1. After fixing the remaining index set I \ {l∗ } a second scenario is reduced etc. The process is continued until N − n scenarios are reduced. Backward algorithm of scenario reduction Step 0: J [0] := ∅ . Step k: lk ∈ arg min
l∈J [k−1]
j∈J [k−1] ∪{l}
pj
min
i∈J [k−1] ∪{l}
cij ,
J [k] := J [k−1] ∪ {lk } . Step N-n+1: Redistribution with J := J [N −n] via (22). It is shown in [6] that both heuristics are polynomial time algorithms and that the forward heuristics is recommendable at least if n ≤ N4 . 3.2
Two-Stage Stochastic Mixed-Integer Programs
The relevant probability metric is now ζr,B , where r ∈ N and B is a set of polyhedral subsets of a (polyhedral) set Ξ that contains the support of P . It was
Scenario Reduction Techniques in Stochastic Programming
9
Fig. 2. Illustration of the original scenario set with N = 5 and of reducing the first and the second scenario
1000
500 500
0
0
-500 -500
-1000 0
24
48
72
96
120
144
168
0
24
48
72
96
120
144
168
Fig. 3. Left: N = 729 equally distributed scenarios in form of a regular ternary tree. Right: Reduced scenario set with n = 20 and line width proportional to scenario probabilities obtained by the forward heuristic with r = 2.
mentioned in Section 2 that both r and B depend on the structure of the particular stochastic program. Since the complexity of scenario reduction algorithms increase if the sets in B get more involved, we follow here [12] and consider only the set Brect (see (10)) and the distance dλ (P, Q) := λαBrect (P, Q) + (1 − λ)ζr (P, Q) for some λ ∈ (0, 1]. As in Section 3.1 we are interested in computing DJ := min dλ (P, QJ ) : qi ≥ 0, i ∈ / J, qi = 1 .
(24)
(25)
i∈J /
In the following, we show that DJ can be computed as optimal value of a linear program (but a closed formula for DJ as in Section 3.1 is not available in general). To this end, we assume without loss of generality that J = {n + 1, . . . , N }, i.e., supp(QJ ) = {ξ 1 , . . . , ξ n } for some 1 ≤ n < N , consider the system of index sets IBrect := {I(B) := {i ∈ {1, . . . , N } : ξ i ∈ B} : B ∈ Brect } and obtain the following representation of the rectangular discrepancy
10
W. R¨ omisch
αBrect (P, QJ ) = sup |P (B) − QJ (B)| = max
pi − I∈IBrect
B∈Brect
i∈I
qj (26)
j∈I∩{1,...,n}
− j∈I∩{1,...,n} qj ≤ tα − i∈I pi , I ∈ IBrect
(27) = min tα
j∈I∩{1,...,n} qj ≤ tα + i∈I pi , I ∈ IBrect Since the set IBrect may be too large to solve the linear program (27) numerically, we consider the system of reduced index sets IB∗ rect := {I(B) ∩ {1, . . . , n} : B ∈ Brect } and the quantities ∗
γ I := max
pi : I ∈ IBrect , I ∩ {1, . . . , n} = I ∗
i∈I
γI ∗ := min
pi : I ∈ IBrect , I ∩ {1, . . . , n} = I ∗
i∈I
for every I ∗ ∈ IB∗ rect . Since any such index set I ∗ corresponds to some left-hand ∗ side of the inequalities in (27), γ I and γI ∗ correspond to the smallest right-hand sides in (27). Hence, the rectangular discrepancy may be rewritten as I∗
− , I ∗ ∈ IB∗ rect ∗ qj ≤ tα − γ αBrect (P, QJ ) = min tα
j∈I . (28) ∗ ∗ j∈I ∗ qj ≤ tα + γI ∗ , I ∈ IBrect Since the number of elements of IB∗ rect is at most 2n (compared to 2N in IBrect ), passing from (27) to (28) indeed drastically reduces the maximum number of inequalities and may make the linear program (28) numerically tractable. Due to (15) the Fortet-Mourier distance ζr (P, QJ ) allows the representation as linear program ⎫ ⎧
N N n ⎬ ⎨
η = qj , j = 1, . . . , n ηij cˆr (ξ i , ξ j )
ηij ≥ 0, ni=1 i,j ζr (P, QJ ) = inf ⎩
j=1 ηi,j = pi , i = 1, . . . , N ⎭ i=1 j=1
where the reduced cost function cˆr is given by (21). Hence, extending the representation (28) of αBrect we obtain the following linear program for determining DJ and the probabilities qj , j = 1, . . . , n, of the discrete reduced distribution QJ , ⎧
⎫
tα , tζ ≥ 0, qj ≥ 0, nj=1 qj = 1, ⎪ ⎪ ⎪
⎪ ⎪ ⎪ ⎪
ηij ≥ 0, i = 1, . . . , N, j = 1, . . . , n, ⎪ ⎪ ⎪ ⎪
⎪ ⎪ ⎪ N n ⎪
⎪ i j ⎪ ⎪ ⎪
tζ ≥ i=1 j=1 cˆr (ξ , ξ )ηij , ⎪ ⎨ ⎬
n (29) DJ = min λ tα + (1 − λ)tζ
j=1 ηij = pi , i = 1, . . . , N, ⎪ ⎪
N ⎪ ⎪ ⎪ ⎪
i=1 ηij = qj , j = 1, . . . , n, ⎪ ⎪ ⎪ ⎪
⎪ ⎪ ⎪ ⎪ I∗ ∗ ∗ ⎪
⎪ ⎪ − q ≤ t − γ , I ∈ I ∗ j α Brect ⎪ ⎪ ⎪
j∈I ⎩ ⎭
∗ ∗ j∈I ∗ qj ≤ tα + γI ∗ , I ∈ IBrect
Scenario Reduction Techniques in Stochastic Programming
11
Fig. 4. Non-supporting rectangle (left) and supporting rectangle (right). The dots represent the remaining scenarios ξ 1 , . . . , ξ n for s = 2.
While the linear program (29) can be solved efficiently by available software, ∗ determining the index set IB∗ rect and the coefficients γ I , γI ∗ is more intricate. ∗ It is shown in [10, Section 3] that the parameters IB∗ rect and γ I , γI ∗ can be determined by studying the set R of supporting rectangles. A rectangle B in Brect is called supporting if each of its facets contains an element of {ξ1 , . . . , ξn } in its relative interior (see also Fig. 4). Based on R the following representations are valid according to [10, Prop. 1 and 2]: I ∗ ⊆ {1, . . . , n} : ∪j∈I ∗ {ξ j } = {ξ 1 , . . . , ξ n } ∩ int B IB∗ rect = B∈R
γ
I∗
γI ∗
= max P (int B) : B ∈ R ∪j∈I ∗ {ξ j } = {ξ 1 , . . . , ξ n } ∩ int B = pi where I := i ∈ {1, . . . , N } : min∗ ξlj ≤ ξli ≤ max∗ ξlj , l = 1, . . . , s i∈I
j∈I
j∈I
for every I ∗ ∈ IB∗ rect . Here, int B denotes the interior of the set B. An algorithm is developed in [12] that constructs recursively l-dimensional supporting rectangles for l = 1, . . . , s. Computational experiments show that its running grows linearly with N , but depends on n and s via the expression time s n+1 . Hence, while N may be large, only moderately sized values of n given 2 s seem to be realistic. Since an algorithm for computing DJ is now available, we finally look at determining a scenario index set J ⊂ I = {1, . . . , N } with cardinality |J| = N −n such that DJ is minimal, i.e., at solving the combinatorial optimization problem min{DJ : J ⊂ I, |J| = N − n}
(30)
which is known as n-median problem and as N P-hard. One possibility is to reformulate (30) as mixed-integer linear program and to solve it by standard software. Since, however, approximate solutions of (30) are sufficient, heuristic algorithms like forward selection are of interest, where uk is determined in its kth step such that it solves the minimization problem
12
W. R¨ omisch
min DJ [k−1] \{u} u ∈ J [k−1] , where J [0] = {1, . . . , N }, J [k] := J [k−1] \ {uk } (k = 1, . . . , n) and J [n] := {1, . . . , N } \ {u1 , . . . , un } serves as approximate solution to (30). Recalling that the complexity of evaluating DJ [k−1] \{u} for some u ∈ J [k−1] is proportional to
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Fig. 5. N = 1 000 samples ξ i from the uniform distribution on [0, 1]2 and n = 25 points ξ uk , k = 1, . . . , n, obtained via the first 25 elements zk , k = 1, . . . , n, of the Halton sequence (in the bases 2 and 3) (see [17, p. 29]). The probabilities qk of ξ uk , k = 1, . . . , n, are computed for the distance dλ with λ = 1 (gray balls) and λ = 0.9 (black circles) by solving (29). The diameters of the circles are proportional to the probabilities qk , k = 1, . . . , 25.
s k+1 shows that even the forward selection algorithm is expensive. 2 Hence, heuristics for solving (30) become important that require only a low number of DJ evaluations. For example, if P is a probability distribution on [0, 1]s with independent marginal distributions Pj , j = 1, . . . , s, such a heuristic can be based on Quasi-Monte Carlo methods (cf. [17]). The latter provide sequences of equidistributed points in [0, 1]s that approximate the uniform distribution on the unit cube [0, 1]s . Now, let n Quasi-Monte Carlo points z k = (z1k , . . . , zsk ) ∈ [0, 1]s , k = 1, . . . , n, be given. Then we determine (k = 1, . . . , n), y k := F1−1 (z1k ), . . . , Fs−1 (zsk ) where Fj is the (one-dimensional) distribution function of Pj , i.e.,
Scenario Reduction Techniques in Stochastic Programming
Fj (z) = Pj ((−∞, z]) =
N
13
(z ∈ R)
pi
i=1, ξji ≤z
and Fj−1 (t) := inf{z ∈ R : Fj (z) ≥ t} (t ∈ [0, 1]) its inverse (j = 1, . . . , s). Finally, we determine uk as solution of min |ξ u − y k |
u∈J [k−1]
and set again J [k] := J [k−1] \ {uk } for k = 1, . . . , n, where J [0] = {1, . . . , N }. Figure 5 illustrates the results of such a Quasi-Monte Carlo based heuristic and Figure 6 shows the discrepancy αBrect for different n and the running times of the Quasi-Monte Carlo based heuristic.
discrepancy
time 600
0.8 400
0.6 0.4
200 0.2 0
10
20
30
40
50
Fig. 6. Distance αBrect between P (with N = 1000) and equidistributed QMC-points (dashed), QMC-points, whose probabilities are adjusted according to (29) (bold), and running times of the QMC-based heuristic (in seconds)
Acknowledgements. The work of the author is supported by the DFG Research Center Matheon “Mathematics for key technologies” in Berlin.
References [1] Avella, P., Sassano, A., Vasil’ev, I.: Computational study of large scale p-median problems. Math. Program. 109, 89–114 (2007) [2] Casey, M., Sen, S.: The scenario generation algorithm for multistage stochastic linear programming. Math. Oper. Res. 30, 615–631 (2005) [3] Dupaˇcov´ a, J., Consigli, G., Wallace, S.W.: Scenarios for multistage stochastic programs. Ann. Oper. Res. 100, 25–53 (2000) [4] Dupaˇcov´ a, J., Gr¨ owe-Kuska, N., R¨ omisch, W.: Scenario reduction in stochastic programming: An approach using probability metrics. Math. Program. 95, 493– 511 (2003)
14
W. R¨ omisch
[5] G¨ otz, M.: Discrepancy and the error in integration. Monatsh. Math. 136, 99–121 (2002) [6] Heitsch, H., R¨ omisch, W.: Scenario reduction algorithms in stochastic programming. Comp. Optim. Appl. 24, 187–206 (2003) [7] Heitsch, H., R¨ omisch, W.: A note on scenario reduction for two-stage stochastic programs. Oper. Res. Let. 35, 731–736 (2007) [8] Heitsch, H., R¨ omisch, W.: Scenario tree modeling for multistage stochastic programs. Math. Program. 118, 371–406 (2009) [9] Heitsch, H., R¨ omisch, W.: Scenario tree reduction for multistage stochastic programs. Comp. Manag. Science 6, 117–133 (2009) [10] Henrion, R., K¨ uchler, C., R¨ omisch, W.: Discrepancy distances and scenario reduction in two-stage stochastic mixed-integer programming. J. Ind. Manag. Optim. 4, 363–384 (2008) [11] Henrion, R., K¨ uchler, C., R¨ omisch, W.: Scenario reduction in stochastic programming with respect to discrepancy distances. Comp. Optim. Appl. 43, 67–93 (2009) [12] Henrion, R., K¨ uchler, C., R¨ omisch, W.: A scenario reduction heuristic for twostage stochastic mixed-integer programming (in preparation) [13] Hochreiter, R., Pflug, G.C.: Financial scenario generation for stochastic multistage decision processes as facility location problems. Ann. Oper. Res. 152, 257– 272 (2007) [14] Høyland, K., Kaut, M., Wallace, S.W.: A heuristic for moment-matching scenario generation. Comp. Optim. Appl. 24, 169–185 (2003) [15] Kuhn, D.: Generalized bounds for convex multistage stochastic programs. LNEMS, vol. 548. Springer, Berlin (2005) [16] Kuhn, D.: Aggregation and discretization in multistage stochastic programming. Math. Program. 113, 61–94 (2008) [17] Niederreiter, H.: Random number generation and Quasi-Monte Carlo methods. CBMS-NSF Regional Conference Series in Applied Mathematics Vol. 63. SIAM, Philadelphia (1992) [18] Pag`es, G., Pham, H., Printems, J.: Optimal quantization methods and applications to numerical problems in finance. In: Rachev, S.T. (ed.) Handbook of Computational and Numerical Methods in Finance, pp. 253–297. Birkh¨ auser, Boston (2004) [19] Pennanen, T.: Epi-convergent discretizations of multistage stochastic programs via integration quadratures. Math. Program. 116, 461–479 (2009) [20] Rachev, S.T., R¨ uschendorf, L.: Mass Transportation Problems, vol. I, II. Springer, Berlin (1998) [21] R¨ omisch, W.: Stability of Stochastic Programming Problems. In: Ruszczy´ nski, A., Shapiro, A. (eds.) Stochastic Programming. Handbooks in Operations Research and Management Science, vol. 10, pp. 483–554. Elsevier, Amsterdam (2003) [22] R¨ omisch, W., Vigerske, S.: Quantitative stability of fully random mixed-integer two-stage stochastic programs. Optim. Lett. 2, 377–388 (2008) [23] R¨ omisch, W., Wets, R.J.-B.: Stability of ε-approximate solutions to convex stochastic programs. SIAM J. Optim. 18, 961–979 (2007) [24] Ruszczy´ nski, A., Shapiro, A. (eds.): Stochastic Programming. Handbooks in Operations Research and Management Science, vol. 10. Elsevier, Amsterdam (2003) [25] Shapiro, A.: Inference of statistical bounds for multistage stochastic programming problems. Math. Meth. Oper. Res. 58, 57–68 (2003) [26] Wallace, S.W., Ziemba, W.T. (eds.): Applications of Stochastic Programming. MPS-SIAM Series in Optimization (2005)
Statistical Learning of Probabilistic BDDs Taisuke Sato Department of Computer Science Graduate School of Information Science and Engineering Tokyo Institute of Technology
[email protected]
Abstract. BDDs (binary decision diagrams) are a graphical data structure which compactly represents Boolean functions and has been extensively used in VLSI CAD. In this talk, I will describe how BDDs are applied to machine learning, or more concretely how they are used in probability computation and statistical parameter learning for probabilistic modeling. The basic idea is that of PPC (propositionalized probabilistic computation) which considers “X = x”, a random variable X taking a value x, as a probabilistic proposition. A complex (discrete) probabilistic model such as a Bayesian network is representable by way of PPC as a probabilistic Boolean formula, thus by a probabilistic BDD. We developed an efficient sum-product probability computation algorithm on BDDs and also derived a parameter learning algorithm called the BDD-EM algorithm for maximum likelihood estimation in statistics. We applied the BDD-EM algorithm to the problem of finding a most probable hypothesis about a metabolic pathway in systems biology.
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, p. 15, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning Volatility of Discrete Time Series Using Prediction with Expert Advice Vladimir V. V’yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol’shoi Karetnyi per. 19, Moscow GSP-4, 127994, Russia
[email protected]
Abstract. In this paper the method of prediction with expert advice is applied for learning volatility of discrete time series. We construct arbitrage strategies (or experts) which suffer gain when “micro” and “macro” volatilities of a time series differ. For merging different expert strategies in a strategy of the learner, we use some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on current gains of the experts. We consider the case when experts onestep gains can be unbounded. New notion of a volume of a game vt is introduced. We show that our algorithm has optimal performance in the case when the one-step increments Δvt = vt − vt−1 of the volume satisfy Δvt = o(vt ) as t → ∞.
1
Introduction
In this paper we construct arbitrage strategies which suffer gain when “micro” and “macro” volatilities of a time series differ. For merging different expert strategies in a strategy of the learner, we use methods of the theory of prediction with expert advice. Using Cheredito [2] results in mathematical finance, Vovk [14] considered two strategies which suffer gain when prices of a stock follow fractional Brownian motion: the first strategy shows a large return in the case when volatility of the time series is high, the second one shows a large return in the opposite case. The main peculiarity of these arbitrage strategies is that their one-step gains and losses can be unrestrictedly large. Also, there is no appropriate type of a loss function for these strategies, and we are forced to consider general gains and losses. We construct a learning algorithm merging online these strategies into a one Learner’s strategy protected from these unbounded fluctuations as much as possible. Prediction with Expert Advice considered in this paper proceeds as follows. A “pool” of expert strategies (or simply, experts) i = 1, . . . N is given. We are asked to perform sequential actions at times t = 1, 2, . . . , T . At each time step t, experts i = 1, . . . N receive results of their actions in form of their gains or losses sit . In what follows we refer to sit as to gains. At the beginning of the step t Learner, observing cumulated gains si1:t−1 = si1 + . . . + sit−1 of all experts i = 1, . . . N , assigns non-negative weights wti (summing O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 16–30, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning Volatility of Discrete Time Series
17
to 1) to each expert i and suffers a gain equal to the weighted sum of experts N gains s˜t = i=1 wti sit . T The cumulative gain of the learner on first T steps is equal to s˜1:T = t=1 s˜t . This can be interpreted in probabilistic terms as follows. On each time step t, Learner choose to follow an expert i according to the internal distribution P {It = i} = wti , i = 1, . . . N ; at the end of step t Learner receives the same gain sit as the ith expert and suffers Learner’s cumulative gain s1:t = s1:t−1 + sit . Let E(st ) denote the Learner’s expected one-step gain according to this randomization; it coincides with the weighted sum s˜t of the experts gains. The cumulative expected gain of our learning algorithm on first T steps is equal to T E(s1:T ) = t=1 E(st ). Evidently, s˜1:T = E(s1:T ) for all T . The goal of the learner’s algorithm is to perform in terms of s˜1:T = E(s1:T ) almost as well as the best expert in hindsight in the long run. In the traditional framework of prediction with expert advice, it is supposed that one-step gains of experts are bounded, for example, 0 ≤ sit ≤ 1 for all i and t. We allow gains to be unbounded. We consider also the case when the notions of loss and gain functions are not used. In this paper we use the method of following the perturbed leader. This method was discovered by Hannan [5]. Kalai and Vempala [7] rediscovered this method and published a simple proof of the main result of Hannan. They called the algorithm of this type FPL (Following the Perturbed Leader). Hutter and Poland [6] presented a further developments of the FPL algorithm for countable class of experts, arbitrary weights and adaptive learning rate. Also, FPL algorithm is usually considered for bounded one-step losses: 0 ≤ sit ≤ 1 for all i and t. The similar results can be achieved by other aggregate strategies, like Weighted Majority (WM) algorithm of Littlestone and Warmuth [8] or algorithm “hedge” of Freund and Schapire [4]. The FPL √algorithm has the same performance as the WM-type algorithms up to a factor 2. A major advantage of the FPL algorithm is that its analysis remains easy for an adaptive learning rate, in contrast to WMderivatives (see remark in [6]). Most papers on prediction with expert advice either consider bounded losses or assume the existence of a specific loss function. The setting allowing unbounded one-step losses (or gains) do not have wide coverage in literature; we can only refer reader to [1], [10], where polynomial bounds on one-step losses were considered. In this paper, we present some modification of Kalai and Vempala [7] algorithm of following the perturbed leader (FPL) for the case of unrestrictedly large one-step expert gains sit ∈ (−∞, +∞) not bounded in advance. This algorithm uses adaptive weights depending on past cumulative gains of the experts. t We introduce new notions of volume of a game vt = maxi |sij | and scaled j=1
fluctuation of the game fluc(t) = Δvt /vt , where Δvt = vt − vt−1 for t ≥ 1. In Section 2 we present a game with two zero-sum experts which suffer gain or loss when micro and macro volatilities of stock prices differ. These gains and
18
V.V. V’yugin
losses cannot be bounded in advance. The notion of the volume of a game is natural from this financial point of view. In Section 3 we consider a more general problem - we consider a game with N experts, where these experts suffer unbounded gains sit ∈ (−∞, +∞). We present some probabilistic learning algorithm merging the experts decisions and prove that this algorithm is performed well under very broad assumptions. In Section 4 we discuss a derandomized version this algorithm for the case of two zero-sum experts with arbitrary one-step gains from Section 2. We show in Theorem 1 (Section 3) that if fluc(t) ≤ γ(t) for all t, where γ(t) is a non-increasing computable real function such that 0 < γ(t) ≤ 1 for all t, then the algorithm of following the perturbed leader with adaptive weights constructed in Section 3 has the performance E(s1:T ) ≥ max si1:T − 2 i=1,...N
T 6(1 + ln N ) (γ(t))1/2 Δvt . t=1
If γ(t) → 0 as t → ∞ this algorithm is asymptotically consistent in a modified sense lim inf T →∞
1 E(s1:T − max si1:T ) ≥ 0, i=1,...N vT
(1)
where s1:T is the total gain of our algorithm on steps ≤ T . Proposition 1 of Section 3 shows that if the condition Δvt = o(vt ) is violated the cumulative gain of any probabilistic prediction algorithm can be much less than the gain of some expert of the pool.
2
Arbitrage Strategies
We consider a time series as a series of some stock prices. Then each expert from the pool will represent a method of playing on the stock market. Let K, M and T be positive integer numbers and let the time interval [0, KT ] be divided on a large number KM of subintervals. Let also S = S(t) be a function representing a stock price at time t. Define a discrete time series of stock prices S0 = S(0), S1 = S(T /(KM )), S2 = S(2T /(KM )) . . . , SKM = S(T ). In this paper, volatility is an informal notion. We say that the sum K−1
(S(i+1)T − SiT )2
i=0
represents the macro volatility and the sum KT −1
(ΔSi )2 ,
i=0
(2)
Learning Volatility of Discrete Time Series
19
where ΔSi = Si+1 − Si , i = 1, . . . KT , represent the micro volatility of the time series (2). In this paper for simplicity we consider the case K = 1. Informally speaking, the first strategy will show a large return if (ST −S0 )2 T −1 (ΔSi )2 ; the second one will show a large return when (ST − S0 )2 i=0 T −1 i=0
(ΔSi )2 . There is an uncertainty domain for these strategies, i.e., the case
when both and do not hold.1 We consider the game between an investor and the market. The investor can use the long and short selling. At beginning of time step t Investor purchases the number Ct of shares of the stock by St−1 each. At the end of trading period the market discloses the price St+1 of the stock, and the investor incur his current income or loss st = Ct ΔSt at the period t. We have the following equality T −1
(ST − S0 )2 = (
ΔSt )2 =
t=0 T −1
2(St − S0 )ΔSt +
t=0
T −1
(ΔSt )2 .
(3)
t=0
The equality (3) leads to the two strategies for investor which are represented by two experts. At the beginning of step t Experts 1 and 2 hold the number of shares Ct1 = 2C(St − S0 ),
(4)
−Ct1 ,
(5)
Ct2
=
where C is an arbitrary positive constant. These strategies at step t earn the incomes s1t = 2C(St −S0 )ΔSt and s2t = −s1t . The strategy (4) earns in first T steps of the game the income s11:T
=
T t=1
s1t
= 2C((ST − S0 ) − 2
T −1
(ΔSt )2 ).
t=1
The strategy (5) earns in first T steps the income s21:T = −s11:T . The number of shares Ct1 in the strategy (4) or number of shares Ct2 = −Ct1 in the strategy (5) can be positive or negative. The one-step gains s1t and s2t = −s1t are unbounded and can be positive or negative: sit ∈ (−∞, +∞). 1
The idea of these strategies is based on the paper of Cheredito [2] (see also Rogers [11], Delbaen and Schachermayer [3]) who have constructed arbitrage strategies for a financial market that consists of money market account and a stock whose price follows a fractional Brownian motion with drift or an exponential fractional Brownian motion with drift. Vovk [14] has reformulated these strategies for discrete time. We use these strategies to define a mixed strategy which incur gain when macro and micro volatilities of time series differ. There is no uncertainty domain for continuous time.
20
V.V. V’yugin
We analyze this game in the follow leader framework. We introduce Learner that can choose between two strategies (4) and (5). The following simple example of zero sum experts shows that even in the case where one-step gains of experts are bounded Learner can perform much worse than each expert: let the current gains of two experts on steps t = 0, 1, . . . , 6 be s10,1,2,3,4,5,6 = (1/2, −1, 1, −1, 1, −1, 1) and s20,1,2,3,4,5,6 = (0, 1, −1, 1, −1, 1, −1). Suppose that s10 = s20 = 0. The “Follow Leader” algorithm always chooses the wrong prediction; its income is s1:6 = −5.5. We solve this problem in Section 3 using randomization of the experts cumulative gains.
3
The Follow Perturbed Leader Algorithm with Adaptive Weights
We consider a game of prediction with expert advice. Let at each step t of the game all N experts receive arbitrary one-step gains sit ∈ (−∞, +∞), i = 1, . . . N , and the cumulative gain of the ith expert after step t is equal to si1:t = si1:t−1 +sit . A probabilistic learning algorithm of choosing an expert presents for any i the probabilities P {It = i} of following the ith expert given the cumulative gains si1:t−1 of the experts i = 1, . . . N in hindsight. Probabilistic algorithm of choosing an expert. FOR t = 1, . . . T Given past cumulative gains of the experts si1:t−1 choose the expert i, where i = 1, . . . N , with probability P {It = i}. Receive the one-step gains at step t of the expert sit and suffer one-step gain st = sit of the master algorithm. ENDFOR The performance of this probabilistic algorithm is measured in its expected regret E( max si1:T − s1:T ), i=1,...N
where the random variable s1:T is the cumulative gain of the master algorithm, si1:T , i = 1, . . . N , are the cumulative gains of the expert algorithms, E is the mathematical expectation. In this section we explore asymptotic consistency of probabilistic learning algorithms in case of unbounded one-step gains. A probabilistic algorithm is called asymptotically consistent if lim inf T →∞
1 E(s1:T − max si1:T ) ≥ 0. i=1,...N T
(6)
Notice that then 0 ≤ sit ≤ 1 all expert algorithms have total gain ≤ T on first T steps. This is not true for the unbounded case, and there are no reasons to divide the expected regret (6) on T . We modify the definition (6) of the normalized expected regret as follows. Define the volume of a game at step t
Learning Volatility of Discrete Time Series
vt =
t j=1
21
max |sij |. i
Evidently, vt−1 ≤ vt for all t ≥ 1. Put v0 = 0. A probabilistic learning algorithm is called asymptotically consistent (in the modified sense) in a game with N experts if lim inf T →∞
1 E(s1:T − max si1:T ) ≥ 0. i=1,...N vT
(7)
A game is called non-degenerate if vt → ∞ as t → ∞. Denote Δvt = vt − vt−1 for t ≥ 1. The number fluc(t) =
Δvt maxi |sit | = , vt vt
(8)
is called scaled fluctuation of the game at the step t (put 0/0 = 0). By definition 0 ≤ fluc(t) ≤ 1 for all t ≥ 1. The following simple proposition shows that any probabilistic learning algorithm cannot be asymptotically optimal for some game such that fluc(t) → 0 as t → ∞. For simplicity, we consider the case of two experts. Proposition 1. For any probabilistic algorithm of choosing an expert and for any > 0 two experts exist such that fluc(t) ≥ 1 − , 1 1 E(max si1:t − s1:t ) ≥ (1 − ) vt i=1,2 2 for all t, where s1:t is the cumulative gain of this algorithm. Proof. Given a probabilistic algorithm of choosing an expert and such that 0 < < 1, define recursively one-step gains s1t and s2t of expert 1 and expert 2 at any step t = 1, 2, . . . as follows. By s11:t−1 and s21:t−1 denote the cumulative gains of these experts incurred at steps ≤ t − 1; let vt−1 be the corresponding volume, where t = 1, 2, . . .. Define v0 = 1 and Mt = 4vt−1 / for t ≥ 1. For t ≥ 1, define s1t = 0 and s2t = Mt if P {It = 1} > 12 , and define s1t = Mt and s2t = 0 otherwise. Let for t ≥ 1, vt be the volume of the game. Let st be one-step gain of the master algorithm and s1:t be its cumulative gain at step t ≥ 1. By definition for all t ≥ 1, E(st ) = s1t P {It = 1} + s2t P {It = 2} ≤
1 Mt . 2
Evidently, E(s1,1 ) = E(s1 ) and E(s1:t ) = E(s1:t−1 ) + E(st ) for all t ≥ 2. Then E(s1:t ) ≤ 12 (1 + /2)Mt for all t ≥ 1. Evidently, vt = vt−1 + Mt = Mt (1 + /4) and Mt ≤ max si1:t for all t ≥ 1. i
22
V.V. V’yugin
Therefore, the normalized expected regret of the master algorithm is bounded from below 1 (1 − /2)Mt 1 1 ≥ (1 − ). E(max si1:t − s1:t ) ≥ 2 i vt Mt (1 + /4) 2
for all t. Let ξ 1 ,. . . ξ N be a sequence of i.i.d random variables distributed according to the exponential law with the density p(x) = exp{−x}. Let γ(t) be a non-increasing real function such that 0 < γ(t) ≤ 1 for all t; for example, γ(t) = t−δ , where δ > 0 and t ≥ 1. Define N ln 1+ln 1 6 1− and (9) αt = 2 ln γ(t) 6 αt μt = (γ(t)) = (γ(t))1/2 . (10) 1 + ln N for all t.2 We consider an FPL algorithm with a variable learning rate t =
1 , μt vt−1
(11)
where μt is defined by (10) and the volume vt−1 depends on experts actions on steps < t. By definition vt ≥ vt−1 and μt ≤ μt−1 for t = 1, 2, . . .. Also, if γ(t) → 0 as t → ∞ then μt → 0 as t → ∞. We suppose without loss of generality that si0 = v0 = 0 for all i and 0 = ∞. The FPL algorithm is defined as follows: FPL algorithm. FOR t = 1, . . . T Define It = argmaxi {si1:t−1 + 1t ξ i }, where t is defined by (11) and i = 1, 2, . . . N .3 Receive one-step gains sit for experts i = 1, . . . , N , and receve one-step gain It st of the FPL algorithm. ENDFOR T sIt t be the cumulative gain of the FPL algorithm. Let s1:T = t=1
The following theorem gives a bound for regret of the FPL algorithm. It shows also that if the game is non-degenerate and Δvt = o(vt ) as t → ∞ with algorithmic bound then the FPL-algorithm with variable parameters μt is asymptotically consistent. 2
3
The choice of the optimal value of αt will be explained later. It will be obtained by minimization of the corresponding member of the sum (40) below. The definition (9) is invalid for γ(t) = 1. In that follows for γ(t) = 1, we will use the values (γ(t))αt and (γ(t))1−αt defined by (10). If the maximum is achieved for more then one different i choose the minimal such i.
Learning Volatility of Discrete Time Series
23
Theorem 1. Let γ(t) be a non-increasing computable real function such that 0 < γ(t) ≤ 1 and fluc(t) ≤ γ(t)
(12)
for all t. Then the expected gain of the FPL algorithm with variable learning rate (11) is satisfied T E(s1:T ) ≥ max si1:T − 2 6(1 + ln N ) (γ(t))1/2 Δvt . i
(13)
t=1
for all T . If the game is non-degenerate and γ(t) → 0 as t → ∞ this algorithm is asymptotically consistent lim inf T →∞
1 E(s1:T − max si1:T ) ≥ 0. i=1,...N vT
(14)
Proof. The analysis of optimality of the FPL algorithm is based on an intermediate predictor IFPL (Infeasible FPL). IFPL algorithm. FOR t = 1, . . . T Define t =
1 , μt vt
(15)
where vt is the volume of the game at step t and μt is defined by (10). Also, define Jt = argmaxi {si1:t + 1 ξ i }. t
Receive the one step gain sJt t of the IFPL algorithm. ENDFOR The IFPL algorithm predicts under the knowledge of t and si1:t , i = 1, . . . N , which may not be available at beginning of step t. Using unknown value of t and si1:t is the main distinctive feature of our version of IFPL. For any t we have 1 i ξ }, t 1 1 Jt = argmaxi {si1:t + ξ i } = argmaxi {si1:t−1 + sit + ξ i }. t t It = argmaxi {si1:t−1 +
The expected one-step and cumulated gains of the FPL and IFPL algorithms at step t are denoted lt = E(sIt t ) and rt = E(sJt t ), l1:T =
T t=1
lt and r1:T =
T t=1
rt ,
24
V.V. V’yugin
respectively, where sIt t is the one-step gain of the FPL algorithm at step t and sJt t is the one-step gain of the IFPL algorithm, and E denotes the mathematical expectation. Lemma 1. The total expected gains of the FPL and IFPL algorithms satisfy the inequality l1:T ≥ r1:T − 6
T
(γ(t))1−αt Δvt
(16)
t=1
for all T . Proof. For any j = 1, . . . n and fixed reals c1 , . . . cN define 1 ci }, t 1 1 mj = max{si1:t + ci } = max{si1:t−1 + sit + ci }. i=j i=j t t mj = max{si1:t−1 + i=j
1 2 Let mj = sj1:t−1 + 1t cj 1 and mj = sj1:t−1 + sjt2 + and since j1 = j we have
1 t cj2 .
By definition of mj
1 1 1 cj ≥ sj1:t−1 + cj 1 = t 2 t 1 1 1 1 sj1:t−1 + cj1 + − cj1 = t t t 1 1 mj + − cj1 . t t
2 mj = sj1:t−1 + sjt2 +
(17)
We compare conditional probabilities P {It = j|ξ i = ci , i = j} and P {Jt = j|ξ i = ci , i = j}. The following chain of equalities and inequalities is valid: P {It = j|ξ i = ci , i = j} = 1 P {sj1:t−1 + ξ j ≥ mj |ξ i = ci , i = j} = t P {ξ j ≥ t (mj − sj1:t−1 )|ξ i = ci , i = j} = P {ξ j ≥ t (mj − sj1:t−1 ) + (t − t )(mj − sj1:t−1 )|ξ i = ci , i = j} = 1 1 P {ξ j ≥ t (mj − sj1:t−1 ) + (t − t )(sj1:t−1 − sj1:t−1 + cj1 )|ξ i = ci , i = j} = t 1 exp{−(t − t )(sj1:t−1 − sj1:t−1 )} × 1 P {ξ j ≥ t (mj − sj1:t−1 ) + (t − t ) cj1 |ξ i = ci , i = j} ≥ t 1 exp{−(t − t )(sj1:t−1 − sj1:t−1 )} × 1 1 1 P {ξ j ≥ t (mj − sj1:t + sjt − − cj1 ) + (t − t ) cj1 |ξ i = ci , i = j} = t t t
(18) (19) (20)
(21)
Learning Volatility of Discrete Time Series 1 exp{−(t − t )(sj1:t−1 − sj1:t−1 ) − t sjt } ×
j
exp −
t (mj
P {ξ ≥
1 1 − μt vt−1 μt vt
−
sj1:t )|ξ i
= ci , i = j} = sj 1 − sj1:t−1 ) − t (sj1:t−1 × μt vt
1 (sj ) − mj |ξ i = ci , i = j} ≥ P {ξ j > μt vt 1:t
j j1 ) Δvt Δvt (s1:t−1 − s1:t−1 − × exp − μt vt vt−1 μt vt P {ξ j > Δvt exp − μt vt
1+
25 (22) (23)
(24)
1 (mj − sj1:t )|ξ i = ci , i = j} = μt vt j
1 − s1:t−1 sj1:t−1 vt−1
P {Jt = 1|ξ i = ci , i = j}.
(25)
Here we have used twice, in (18)-(19) and in (21)-(22), the equality P {ξ > a + b} = e−b P {ξ > a} for any random variable ξ distributed according to the exponential law. The equality (20)-(21) follows from (17) and t ≥ t for all t. We have used in (24) the equality vt − vt−1 = maxi |sit |. The expression in the exponent (25) is bounded j sj1 1:t−1 − s1:t−1 (26) ≤ 2, vt−1 i s since v1:t−1 ≤ 1 for all t and i. t−1 Therefore, we obtain P {It = j|ξ i = ci , i = j} ≥
3 Δvt exp − P {Jt = j|ξ i = ci , i = j} ≥ μt vt exp{−3(γ(t))1−αt }P {Jt = j|ξ i = ci , i = j}.
(27) (28)
Since, the inequality (28) holds for all ci , it also holds unconditionally P {It = j} ≥ exp{−3(γ(t))1−αt }P {Jt = j}. for all t = 1, 2, . . . and j = 1, . . . N . Since sjt + Δvt ≥ 0 for all j and t, we obtain from (29) lt + Δvt = E(sIt t + Δvt ) =
N
(sjt + Δvt )P (It = j) ≥
j=1
exp{−3(γ(t))1−αt }
N
(sjt + Δvt )P (Jt = j) =
j=1
exp{−3(γ(t))1−αt }(E(sJt t ) + Δvt ) =
(29)
26
V.V. V’yugin
exp{−3(γ(t))1−αt }(rt + Δvt ) ≥ (1 − 3(γ(t))1−αt )(rt + Δvt ) = rt + Δvt − 3(γ(t))1−αt (rt + Δvt ) ≥ rt + Δvt − 6(γ(t))1−αt Δvt .
(30)
In the last line of (30) we have used the inequality |rt | ≤ Δvt for all t and the inequality exp{−3r} ≥ 1 − 3r for all r. Subtracting Δvt from both sides of the inequality (30) and summing it by t = 1, . . . T , we obtain l1:T ≥ r1:T − 6
T
(γ(t))1−αt Δvt
t=1
for all T . Lemma 1 is proved. The following lemma gives a lower bound for the gain of the IFPL algorithm. Lemma 2. The expected cumulative gain of the IFPL algorithm with the learning rate (15) is bounded by r1:T ≥ max si1:T − (1 + ln N ) i
T
(γ(t))αt Δvt
(31)
t=1
for all T . Proof. The proof is along the line of the proof from Hutter and Poland [6] with an exception that now the sequence t is not monotonic. Let in this proof, st = (s1t , . . . sN t ) be a vector of one-step gains and s1:t = 1 (s1:t , . . . sN ) be a vector of cumulative gains of the experts algorithms. Also, 1:t let ξ = (ξ 1 , . . . ξ N ) be a vector whose coordinates are i.i.d according to the exponential law random variables. Recall that t = 1/(μt vt ) and v0 = 0, 0 = ∞. Define s˜1:t = s1:t + 1 ξ for t = 1, 2, . . .. Consider the one-step gains s˜t = t for the moment. For any vector s and a unit vector d denote st + ξ 1 − 1 t
t−1
M (s) = argmaxd∈D {d · s}, where D = {(0, . . . 1), . . . (1, . . . 0)} is the set of N unit vectors of dimension N and “·” is the inner product of two vectors. We first show that T
M (˜ s1:t ) · s˜t ≥ M (˜ s1:T ) · s˜1:T .
(32)
t=1
For T = 1 this is obvious. For the induction step from T − 1 to T we need to show that s1:T ) · s˜1:T − M (˜ s1:T −1 ) · s˜1:T −1 . M (˜ s1:T ) · s˜T ≥ M (˜
Learning Volatility of Discrete Time Series
27
This follows from s˜1:T = s˜1:T −1 + s˜T and s1:T −1 ) · s˜1:T −1 . M (˜ s1:T ) · s˜1:T −1 ≤ M (˜ We rewrite (32) as follows T
M (˜ s1:t ) · st ≥ M (˜ s1:T ) · s˜1:T −
t=1
T
M (˜ s1:t ) · ξ
t=1
1 1 − t t−1
.
(33)
By the definition of M we have
ξ = M (˜ s1:T ) · s˜1:T ≥ M (s1:T ) · s1:T + T ξ max{d · s1:T } + M (s1:T ) · . d∈D T
(34)
We have T 1
t
t=1
−
1 t−1
M (˜ s1:t ) · ξ =
T
(μt vt − μt−1 vt−1 )M (˜ s1:t ) · ξ.
(35)
t=1
We will use the inequality 0 ≤ E(M (˜ s1:t ) · ξ) ≤ E(M (ξ) · ξ) = E(max ξ i ) ≤ 1 + ln N. i
(36)
The proof of this inequality uses an idea of Lemma 1 from [6]. We have for the exponentially distributed random variables ξ i , i = 1, . . . N , i
i
P {max ξ ≥ a} = P {∃i(ξ ≥ a)} ≤ i
N
P {ξ i ≥ a} = N exp{−a}.
Since for any non-negative random variable η, E(η) = we have E(maxi ξ i − ln N ) = ∞
(37)
i=1
∞
∞
P {η ≥ y}dy, by (37)
0
P {maxi ξ i − ln N ≥ y}dy ≤
0
N exp{−y − ln N }dy = 1. Therefore, E(maxi ξ i ) ≤ 1 + ln N .
0
By (36) the expectation of (35) has the upper bound T t=1
E(M (˜ s1:t ) · ξ)(μt vt − μt−1 vt−1 ) ≤ (1 + ln N )
T
μt Δvt .
t=1
Here we have used the inequality μt ≤ μt−1 for all t, Since E(ξ i ) = 1 for all i, the expectation of the last term in (34) is equal to ξ 1 (38) E M (s1:T ) · = = μT vT . T T
28
V.V. V’yugin
Combining bounds (33)-(35) and (38), we obtain T r1:T = E M (˜ s1:t ) · st ≥ t=1
max si1:T + μT vT − (1 + ln N ) i
T
μt Δvt ≥
t=1
max si1:T − (1 + ln N ) i
T
μt Δvt .
(39)
t=1
Lemma is proved. . We finish now the proof of the theorem. The inequality (16) of Lemma 1 and the inequality (31) of Lemma 2 imply the inequality E(s1:T ) ≥ max si1:T − i
T
(6(γ(t))1−αt + (1 + ln N )(γ(t))αt )Δvt .
(40)
t=1
for all T . The optimal value (9) of αt can be easily obtained by minimization of each member of the sum (40) by αt . In this case μt is equal to (10) and (40) is equivalent to (13). Let γ(T ) → 0 as T → ∞ and the game is non-degenerate, i.e., vT → ∞ as T → ∞. We have Tt=1 Δvt = vT for all T . Then by Toeplitz lemma [13] 1 vT
T (γ(t))1/2 Δvt 2 6(1 + ln N )
→0
t=1
as T → ∞. This limit and the inequality (13) imply the expected asymptotic consistency (14). Theorem is proved. In [1] and [10] polynomial bounds on one-step losses were considered. We also present a bound of this type. Corollary 1. Suppose that maxi |sit | ≤ ta for all t and and lim inf t→∞ where i = 1, . . . N and a, δ are positive real numbers. Then 1 E(s1:T ) ≥ max si1:T − O( (1 + ln N ))T 1− 2 δ+a
vt ta+δ
> 0,
i
as T → ∞, where γ(t) = t−δ and μt is defined by (10).
4
Zero-Sum Game
In this section we derandomize the zero-sum game from Section 2. We interpret the expected one-step gain E(st ) gain as the weighted average of one-step gains
Learning Volatility of Discrete Time Series
29
of experts strategies. In more detail, at each step t, Learner divide his investment in proportion to the probabilities of expert strategies (4) and (5) computed by the FPL algorithm and suffers the gain Gt = 2C(St − S0 )(P {It = 1} − P {It = 2})ΔSt
T at any step t, where C is an arbitrary positive constant; G1:T = t=1 Gt = gain. E(s1:T ) is the Learner’s cumulative t If the game satisfies |s1t |/ i=1 |s1i | ≤ γ(t) for all t then by (13) we have the lower bound G1:T ≥ |s11:T | − 8
T
(γ(t))1/2 |s1t |
t=1
for all T . Assume that γ(t) = μ for all t. Then G1:T ≥ |s11:T | − 8μ1/2 vT for all T .
5
Conclusion
In this paper we study two different problems: the first of them is how use the fractional Brownian motion of prices to suffer gain on financial market; the second one consists in extending methods of the theory of prediction with expert advice for the case when experts one-step gains are unbounded. Though these problems look independent, the first of them serves as a motivating example to study averaging strategies in case of unbounded gains and losses. The FPL algorithm with variable learning rates is simple to implement and it is bringing satisfactory experimental results when prices follow fractional Brownian motion. There are some open problems for further research. How to construct a defensive strategy for Learner in sense of Shafer and Vovk’s book [12]? This means that Learner starting with some initial capital never goes to debt and suffer a gain when macro and micro volatilities differ. Also, a general problem is to develop another strategies which suffer gain when prices follow fractional Brownian motion. In the theory of prediction with expert advice, it is useful to analyze the performance of the well known algorithms (like WM) for the case of unbounded gains and losses in terms of the volume of a game. There is a gap between Proposition 1 and Theorem 1, since we assume in this theorem that the game satisfies fluc(t) ≤ γ(t) → 0, where γ(t) is computable. Does there exists an asymptotically consistent learning algorithm in case where fluc(t) → 0 as t → ∞ with no computable upper bound?
Acknowledgments This research was partially supported by Russian foundation for fundamental research: 09-07-00180-a and 09-01-00709-a. Author thanks Volodya Vovk and Yuri Kalnishkan for useful comments.
30
V.V. V’yugin
References [1] Allenberg, C., Auer, P., Gyorfi, L., Ottucsak, G.: Hannan consistency in on-Line learning in case of unbounded losses under partial monitoring. In: Balc´ azar, J.L., Long, P.M., Stephan, F. (eds.) ALT 2006. LNCS (LNAI), vol. 4264, pp. 229–243. Springer, Heidelberg (2006) [2] Cheredito, P.: Arbitrage in fractional Brownian motion. Finance and Statistics 7, 533 (2003) [3] Delbaen, F., Schachermayer, W.: A general version of the fundamental theorem of asset pricing. Mathematische Annalen 300, 463–520 (1994) [4] Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139 (1997) [5] Hannan, J.: Approximation to Bayes risk in repeated plays. In: Dresher, M., Tucker, A.W., Wolfe, P. (eds.) Contributions to the Theory of Games 3, pp. 97– 139. Princeton University Press, Princeton (1957) [6] Hutter, M., Poland, J.: Prediction with expert advice by following the perturbed leader for general weights. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp. 279–293. Springer, Heidelberg (2004) [7] Kalai, A., Vempala, S.: Efficient algorithms for online decisions. In: Sch¨olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 26– 40. Springer, Heidelberg (2003); Extended version in Journal of Computer and System Sciences 71, 291–307 (2005) [8] Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Information and Computation 108, 212–261 (1994) [9] Lugosi, G., Cesa-Bianchi, N.: Prediction, Learning and Games. Cambridge University Press, New York (2006) [10] Poland, J., Hutter, M.: Defensive universal learning with experts for general weight. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 356–370. Springer, Heidelberg (2005) [11] Rogers, C.: Arbitrage with fractional Brownian motion. Mathematical Finance 7, 95–105 (1997) [12] Shafer, G., Vovk, V.: Probability and Finance. It’s Only a Game! Wiley, New York (2001) [13] Shiryaev, A.N.: Probability. Springer, Berlin (1980) √ [14] Vovk, V.: A game-theoretic explanation of the dt effect, Working paper #5 (2003), http://www.probabilityandfinance.com
Prediction of Long-Range Dependent Time Series Data with Performance Guarantee Mikhail Dashevskiy and Zhiyuan Luo Computer Learning Research Centre Royal Holloway, University of London Egham, Surrey TW20 0EX, UK {mikhail,zhiyuan}@cs.rhul.ac.uk
Abstract. Modelling and predicting long-range dependent time series data can find important and practical applications in many areas such as telecommunications and finance. In this paper, we consider Fractional Autoregressive Integrated Moving Average (FARIMA) processes which provide a unified approach to characterising both short-range and long-range dependence. We compare two linear prediction methods for predicting observations of FARIMA processes, namely the Innovations Algorithm and Kalman Filter, from the computational complexity and prediction performance point of view. We also study the problem of Prediction with Expert Advice for FARIMA and propose a simple but effective way to improve the prediction performance. Alongside the main experts (FARIMA models) we propose to use some naive methods (such as Least-Squares Regression) in order to improve the performance of the system. Experiments on publicly available datasets show that this construction can lead to great improvements of the prediction system. We also compare our approach with a traditional method of model selection for the FARIMA model, namely Akaike Information Criterion.
1 Introduction Many real-life applications such as those related to telecommunication network traffic and stock markets have the property of long-range dependence, i.e. the sample autocovariance function is a slowly varying function ([14]). Modelling and predicting such processes can find practical and important applications in many areas such as telecommunications, signal processing, finance, econometrics and climate studies. For example, predicting network traffic demand is a fundamental objective of network resource management algorithms. A few models have been proposed for representing long-range dependent processes. One of such models, Fractional Autoregressive Integrate Moving Average (FARIMA) has proved useful in the analysis of time series with long-range dependency and has been used in various applications [5, 16, 22]. In this paper, we focus on FARIMA models for time series data and investigate various ways to improve the quality of prediction for such series. Firstly, we address the problem of model selection and discuss ways of selecting the best possible model (or to find alternative ways of using such a model without having to choose only one set of initial parameters). In [4] it was shown that O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 31–45, 2009. c Springer-Verlag Berlin Heidelberg 2009
32
M. Dashevskiy and Z. Luo
the Aggregating Algorithm (a method of Prediction with Expert Advice) could be effectively used to avoid model selection and provide performance guarantee. Further improvements of the method are discussed and presented. Secondly, we consider the problem of making predictions for a FARIMA process. Two algorithms, namely the Innovations Algorithm described in [3] and Kalman Filter ([9, 10]) are investigated. Both algorithms have been used extensively but the problem of choice between them is still open. Experiments are carried out to compare these algorithms in terms of prediction performance and computational efficiency. Finally, we consider an approach for dealing with practical model selection problem by using the Aggregating Algorithm. One of the advantages of the proposed method is a performance guarantee through a theoretical bound on the total loss of the algorithm used. The rest of the paper is organised as follows. Section 2 briefly describes the FARIMA model and Subsection 2.1 is devoted to Akaike Information Criterion (AIC) which is often used as a model selection tool. Section 3 describes the Innovations Algorithm and Kalman Filter, two algorithms which can be used for prediction of linear processes. Section 4 presents a short summary of the Aggregating Algorithm, a method which can be applied to guarantee the performance of prediction for long-range dependent processes. Section 5 discusses the experimental setup and obtained results. Finally Section 6 concludes the paper.
2 Fractional Auto-Regressive Integrated Moving Average (FARIMA) Consider we have observations x1 , x2 , . . . ; xi ∈ R ∀i and we are interested in knowing a sequence of functions:fn : x1 , . . . , xn −→ xn+1 , such that its outputs are close (in a sense) to observed values in the future. This is the problem of predicting a discrete time series. A time series process {Xt , t ∈ Z} is called weakly stationary if it has a finite mean and the covariance depends only on the difference between the parameters. Definition 1. The autocovariance function of process {Xt } is γ(s, t) = E[(Xs − μ)(Xt − μ)], where μ = E Xt . Since for weakly stationary processes the covariance depends only on the difference between two parameters, we can write γ(s, t) = E[(Xs − μ)(Xt − μ)] = E[(X0 − μ)(Xt−s − μ)] = γ(t − s). In this paper we are most interested in long-range dependent time series processes. Definition 2. Time series process {Xt , t ∈ Z} is called long-range dependent if ∞ |γ(h)| = ∞. h=−∞ Another definition of long-range dependence states that if a process’s autocovariance function can be written in the form γ(h) ∼ h2d−1 l1 (h), where l1 (h) is a slowly varying function, then this process is long-range dependent. In this definition d reflects the order of long-range dependency. In this section we describe the FARIMA model, which was introduced in [6, 7]. The model has been applied to many applications where long-range dependent processes
Prediction of Long-Range Dependent Time Series Data
33
had to be predicted. These applications include prediction of network traffic demand, electricity demand and financial data. The FARIMA model is a generalization of the Auto-Regressive Moving Average (ARMA) which is itself a generalization of the Moving Average and the Autoregressive models. The ARMA model was designed to model and predict processes with short-range dependence in linear form. Previous research ([20, 22]) showed that the FARIMA model can model network traffic demand well and we assume that this model can be successful in predicting other time series processes with long-range dependence. A FARIMA process {Xt } can be defined by φ(B)Xt = θ(B)(1 − B)−d t , where B is the backshift operator defined as B k Xt = Xt−k , φ(B) = 1 + φ1 B + . . . + φp B p and θ(B) = 1 + θ1 B + . . . + θq B q are the autoregressive and moving average operators respectively; φ(B) and θ(B) have no common roots, (1−B)−d is a fractional differentiating operator defined by binomial expansion (1 − B)−d =
∞ j=0
ηj B j = η(B) , where ηj =
Γ (j + d) Γ (j + 1)Γ (d)
∞ = 0, −1, −2, . . ., and where Γ (z) = 0 tz−1 e−t dt is the Gamma funcfor d < tion (see [14, 3]) and t is a white noise sequence with finite variance. Process {Xt } is called a FARIMA(p, d, q) process. To fit a FARIMA model into some data it is essential to start with parameter estimation. The parameter estimation problem for FARIMA involves estimation of three parameters: d, φ(B) and θ(B). Of these three parameters d describes the order of longrange dependency and φ(B) and θ(B) describe the short-range dependency. Knowing parameter d we can convert a FARIMA process into an ARMA process and the other way round. There exist different techniques to estimate the parameters associated with a FARIMA process. In [4] it was shown that it is not necessary to estimate the model orders p and q to use the FARIMA model. An important characteristic of a time series is Hurst parameter H which is related to parameter d via the equation H = d + 12 . The following procedure is used to estimate parameter d and the polynomial coefficients of φ(B) and θ(B) with fixed parameters p and q: 1 2, d
1. Estimate Hurst parameter using Dispersional analysis (also known as the Aggregated Variance method), see [17, 2] for more details; parameter d can be found using d = H − 12 . 2. Convert the data from a FARIMA process into an ARMA process using differentiation with estimated parameter d. 3. Apply the Innovations Algorithm ([3]) to fit a Moving Average model into the data. 4. Estimate the coefficients of φ(B) and θ(B) ([3]). As the model orders p and q increase, the computational complexity increases dramatically. This is one of the major drawbacks of the FARIMA model. In this paper, we will only consider FARIMA models with small values of p and q. This is a standard approach to using FARIMA models since usually the most successful models have parameters p and q not greater than 2.
34
M. Dashevskiy and Z. Luo
2.1 Akaike Information Criterion When the problem of model selection arises, there is a standard technique which helps to choose a good model. Model selection among fractionally integrated auto-regressive moving average (FARIMA) models is commonly carried out with the likelihood-ratio test. Akaike Information Criterion is a widely used method for model identification ([14]) and has become a standard tool in time series modelling. The criterion calculates a value (AIC) for each suggested model and chooses the model with its lowest AIC. AIC trades off the goodness-of-fit against the number of parameters required in the model. A large number of parameters is needed to be penalised in order to avoid overfitting, i.e. a situation where the model fits the data too well and cannot predict the data well due to particularities of the available data. For the FARIMA model the criterion calculates the value: 2 ) + 2(p + q + 1), AIC = N ln(ˆ σN
where N is the size of the training set, p and q are the parameters of the FARIMA model 2 and σ ˆN is an approximation of the log-likelihood function. In this paper we used the Whittle approximation ([14]) to this function which states 2 σ ˆN =
1 2π
(N −1)/2
j=1
IN (λj ) , f (λj , θ, φ, d)
where IN (λj ) denotes the periodogram and f (λj , θ, φ, d) is the spectral density of the FARIMA process with the parameters’ vector (θ, φ, d) at the Fourier frequency λj = 2πj N . The periodogram of series yj is N 1 iλj I(λ) = y e j 2πN j=1 and the spectral density of the FARIMA model is −2d |θ(e−iλ )|2 σ2 λ , f (λ) = 2 sin 2π 2 |φ(e−iλ )|2 where φ(·) and θ(·) are the autoregressive and moving average polynomials of the FARIMA model and d is a parameter of the FARIMA model (as in FARIMA(p, d, q)) and σ is the variance of the white noise. The model with the smallest AIC over the set of considered models is chosen as the preferred model. AIC value is, however, not a direct measure of predictive power. A good study on the FARIMA model identification is given in [1]. As it will be shown in Section 5 AIC does not usually identify the best model and there exist alternative methods of using FARIMA models.
3 Prediction for ARMA Processes As it has been mentioned in Section 2, the problem of prediction for FARIMA processes can be transformed into the problem of prediction for ARMA processes by splitting a
Prediction of Long-Range Dependent Time Series Data
35
FARIMA model into its “fractional differentiated” and “ARMA” part. Fractional differentiating of the data can be carried out using the following (see [21]): Wt =
t i=0
i
(−1) Γ (1 + d) Xi . Γ (1 + i)Γ (1 + d − i)
We can transform the process back into a FARIMA process using the following: Xt =
t i=0
i
(−1) Γ (1 − d) Wi . Γ (1 + i)Γ (1 − d − i)
The converted process Wt can be fitted with an ARMA model and different linear prediction algorithms have been developed. More formally, if we have some time series data Xt and we want to use ARMA models to predict future observations of the series. We can follow the steps: 1. Transform the data into an ARMA process Wt 2. Make a prediction for the transformed process Wt 3. Calculate the prediction of the Xt process using the transformed model In the rest of this section we describe two algorithms which can be used for recursive one-step ahead predictions of ARMA processes, namely the Innovations Algorithm and Kalman Filter. The former can also be used for modelling ARMA processes and fitting the ARMA model into data. 3.1 Innovations Algorithm One of the algorithms used to fit an ARMA process into data is called the Innovations Algorithm. We first start with fitting a Moving Average (MA) model with a high order into the data. When such a model is fitted, we can find coefficients of φ and θ of an ARMA model which fits the data. From the ARMA model we can move on to FARIMA models. The algorithms provided in this section are described more fully in [3]. Suppose we have an ARMA process Wt . To make predictions on the future observations we can follow the steps: 1. Fit a MA model of a high order into the data using the Innovations Algorithm 2. Use the MA model to find coefficients of polynomials φ and θ and the white noise variance σ 2 of an ARMA model which fits the data well 3. Use the polynomials φ and θ to predict the future value of Wt Let us now consider each step separately. Consider the process generated by Wt − φ1 Wt−1 − . . . − φp Wt−p = Zt + θ1 Zt−1 + . . . + θq Zt−q , Zt ∼ WN(0, σ 2 ). (1)
36
M. Dashevskiy and Z. Luo
If we assume that data series Xt is generated by the ARM A(p, q) model with known parameters p and q and use properties of ARMA models, we can represent the data series Wt as ∞ Wt = ψj Zt−j , j=0
where Zj is a white noise sequence and coefficients ψj satisfy ψ0 = 1, . min(j,p) ψj = θj + i=1 φi ψj−i To fit a model ∞ j=0 ψj Zt−j into the data Wt with zero mean we can use Algorithm 1: Algorithm 1. Innovations Algorithm for M A(r) model Require: Parameter r k(i, j) = E(Wi Wj ), 1 ≤ i, j ≤ r + 1 v0 = k(1, 1) for n = 1, 2, . . . , r do for l = 0, 1, . . . , n − 1 do
θn,n−l = vl−1 k(n + 1, l + 1) − l−1 j=0 θl,l−j θn,n−j vj end for 2 vn = k(n + 1, n + 1) − n−1 j=0 θn,n−j vj end for
By taking the order of the Moving Average model r high enough (in our experiments we used r = max(p, q) + p − 1 which proved to be giving a good fit) we can estimate the coefficients of polynomial φ by solving ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ θr,q+1 θr,q φ1 θr,q−1 · · · θr,q+1−p ⎜ θr,q+2 ⎟ ⎜ θr,q+1 ⎜ ⎟ θr,q · · · θr,q+2−p ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ φ2 ⎟ ⎜ .. ⎟ = ⎜ ⎟ × ⎜ .. ⎟ . .. .. .. ⎝ . ⎠ ⎝ ⎠ ⎝ . ⎠ . . . θr,q+p θr,q+p−1 θr,q+p−2 · · · θr,q φp Coefficients θj of the polynomial of the ARMA process represented by equation (1) can be estimated using min(j,p)
θj = θr,j −
φi θr,j−i , j = 1, 2, . . . , q
i=1
and the white noise variance in equation (1) is estimated using σ 2 = vr , where vr are calculated by the algorithm. Once polynomials φ and θ and variance σ 2 of equation (1) are known, Algorithm 2 ˆ n+1 on Wn+1 . can be used to make a one-step ahead prediction W
Prediction of Long-Range Dependent Time Series Data
37
Algorithm 2. Recursive one-step ahead prediction of an ARM A(p, q) process Require: Parameters N (the total size of “training set + test set”), p and q (the orders of the model) m = max(p, q) v0 = E(W12 ) ˆ1 = 0 W for n = 1, 2, . . . , N do γ(n) = E(W1 Wn ) for 1 ≤ i, j, ≤ n + 1 do k(i, j) = σ −2 γ(i − j),if 1 ≤ i, j ≤ m p −2 =σ q γ(i − j) − r=1 φr γ(r − |i − j|) , if min(i, j) ≤ m < max(i, j) ≤ 2m = r=0 θr θr+|i−j| , if min(i, j) > m = 0 otherwise end for for l = 0, 1, . . . , n − 1 do
θn,n−l = vl−1 k(n + 1, l + 1) − l−1 j=0 θl,l−j θn,n−j vj end for 2 vn = k(n + 1, n + 1) − n−1 j=0 θn,n−j vj if n < m then ˆ n+1−j ) ˆ n+1 = n θn,j (Wn+1−j − W W j=1 else ˆ n+1−j ) ˆ n+1 = φ1 Wn + . . . + φp Wn+1−p + n θn,j (Wn+1−j − W W j=1 end if end for
3.2 Kalman Filter Kalman Filter is a technique for smoothing, filtering and prediction for linear time series data. In this paper we will use Kalman Filter only to predict future observations of a time series for a given model. To apply Kalman Filter the model needs to be represented in the state-space form, which can be described by the discrete-time equations: yt = Zt αt + Gt t , (2) αt+1 = Tt αt + Ht t where yt ∈ R is the observation sequence, αt is the state vector of size s × 1 for all time t, the system matrices Zt , Gt , Tt and Ht are non-random matrices of sizes 1 × s, 1 × l, s × s and s × l respectively, t ∼ (0, σ 2 I) is the errors vector of size l × 1, α1 ∼ (a1 , σ 2 P1 ) and t and α1 are mutually uncorrelated, a1 and P1 are the parameters of prior distribution of the state vector and σ 2 is the variance of the errors. The state-space model and Kalman Filter were first introduced by Kalman and Bucy (see [9, 10]). To represent the ARM A(p, q) model as a linear state-space system we follow [8] and take s = max(p, q + 1), l = 1 and
38
M. Dashevskiy and Z. Luo
⎛
φ1 φ2 .. .
⎜ ⎜ ⎜ Tt = T = ⎜ ⎜ ⎝ φs−1 φs
⎞ 0 0⎟ ⎟ .. ⎟ , .⎟ ⎟ 0 0 ··· 1⎠ 0 0 ··· 0 1 0 ··· 0 1 ··· .. .. . . . . .
where φi = 0 for i = p + 1, . . . , s and Zt = Z = (1, 0, . . . , 0), Gt = G = 0, Ht = H = (1, θ1 , . . . , θq , 0, . . . , 0) . In general, observations yt are converted into innovations vt . Let us assign α1 = (y0 , 0, . . . , 0)T , where y0 is the last observation from the training set and P1 = I, the identity matrix. At each step t = 1, . . . , n the following five steps are performed to update the model: 1. 2. 3. 4. 5.
vt = yt − Zt αt Ft = Zt Pt Zt + Gt Gt Kt = (Tt Pt Zt + Ht Gt )Ft−1 αt+1 = Tt αt + Kt vt Pt+1 = Tt Pt (Tt − Kt Zt ) + Ht (Ht − Kt Gt )
Intuitively αt is a state estimate, Ft is innovations’ covariance, Kt is the optimal Kalman gain and Pt is a covariance’s estimate. In our particular case we can formulate the aforementioned sequence of steps in the form of an algorithm (see Algorithm 3). Since the matrices Z, G, T and H are constant in time and G = 0 : Algorithm 3. Kalman Filter for an ARMA process P1 = I T, Z and H are as given above for t = 1, 2, . . . do vt = yt − Zαt {These 5 steps update the model} Ft = ZPt Z Kt = (T Pt Z )Ft−1 αt+1 = T αt + Kt vt Pt+1 = T Pt (T − Kt Z) + HH yˆt+1 = Zαt+1 {Give a prediction} end for
Algorithm 3 outputs predictions for future values of yt . It is interesting to note that the number of steps required to calculate predictions at each round of the algorithm does not grow with the number of observations being processed, as opposed to the number of steps required by Algorithm 2. Experiments will be carried out to compare empirically the computational complexity of the two algorithms on publicly available datasets.
4 Aggregating Algorithm To predict future observations of a time series, typically one has to select a model. If it is possible to make an assumption that FARIMA models, which can efficiently represent
Prediction of Long-Range Dependent Time Series Data
39
time series data which exhibit both long-range and short-range dependence, can fit the data well, it is essential to choose three parameters p, q and d. It is still an open question on how to choose the best set of parameters given the data. In [4] it was proposed to use several models with different sets of parameters and to mix their outcomes according to previous respective performances in order to avoid making the choice between different sets of parameters. The aim here is not to find the best model predicting time series data, but to find an efficient way to use several models so that we can have a guaranteed high performance. The idea of using several prediction algorithms in order to achieve high performance has been studied extensively and is called the problem of Prediction with Expert Advice. The problem can be described in the form of a game. In the game, there are experts which give individual predictions of the possible outcome and it is unknown which expert can perform well in advance. At each round of the game, experts give their predictions based on the past observations and we can use the experts’ predictions in order to make our prediction. Once our prediction is made, the outcome of the event is known. As the result, the losses suffered by each expert and our algorithm are calculated. Our aim is to find a strategy which can guarantee that the total loss (i.e. the cumulative loss after all rounds of the game) is not much worse than the total loss of the best expert during the game. It is useful to describe the game of prediction with expert advice more formally. Consider a game between three players, Nature, Experts, and Learner. Let Σ be a set called signal space, Ω be a set called sample space, Γ be a set called the decision space, and Θ be a measurable space called the parameter space. A signal is an element of the signal space, an outcome is an element of the sample space and a decision is an element of the decision space. The function that measures the performance of Experts and Learner is called the loss function, λ : Ω × Γ → [0, ∞], and is also part of the game. In the perfect-information game at each round T = 1, 2, . . .: 1. Nature outputs a signal xT ∈ Σ. 2. Experts make measurable predictions ξT : Θ → Γ ; where ξT (θ) is the prediction corresponding to the parameter θ ∈ Θ. 3. Learner makes his own prediction γT ∈ Γ. 4. Nature chooses an outcome ωT ∈ Ω. 5. Experts’ loss calculation LT (θ) = Tt=1 λ(ωt , ξt (θ)). T 6. Learner’s loss calculation LT = t=1 λ(ωt , γt ). The aim of Learner is to minimize the regret term, i.e. the difference between the total loss of the algorithm and the total loss of the best expert. Many strategies have been developed for Learner to make his prediction γT such as following the perturbed leader, gradient descent and weighted majority algorithm. In this paper we consider the case Ω ⊂ R, Γ ⊂ R and apply one of the methods solving the problem, namely the Aggregating Algorithm (AA). This algorithm’s main advantage consists in a theorem (see [18]) stating that its performance cannot be improved, i.e. it has guaranteed optimal performance. When the experts are fixed the only step of the game of prediction with expert advice which remains unclear is step 3. At each step T the Aggregating Algorithm re-computes the weights of the experts (represented by their probability distribution) and mixes all
40
M. Dashevskiy and Z. Luo
experts’ predictions according to this distribution. Thus the AA gets a mixture of the experts’ predictions (a generalized prediction), then it finds the best (in some sense) prediction from Θ. There are two parameters in the AA: the learning rate is η > 0 (β is defined to be e−η ) and P0 is a probability distribution on set Θ of experts. Intuitively P0 is the prior distribution, which specifies the initial weights assigned to the experts. More formally, at each step T the algorithm performs the following actions to make the prediction at step 3 of the above game: 3.1 Updating experts’ weights according to the previous losses: PT −1 (dθ) = β λ(ωT −1 , ξT −1 (θ)) PT −2 (dθ), θ ∈ Θ. 3.2 Predictions’ mixing according to their weights: β λ(ω, ξT (θ)) PT∗−1 (dθ), gT (ω) = logβ
(3)
Θ
where PT∗−1 (dθ) is a normalised probability distribution, i. e. PT∗−1 (dθ) = PT −1 (dθ) PT −1 (Θ) and if PT −1 (Θ) = 0, the AA is allowed to choose any prediction. On this step we get gT (ω) which is a function Ω → R. 3.3 Looking for an appropriate prediction from Γ : γT (gT ). An important equation which describes how to apply the AA to our problem was proved in [18] (Lemma 2). It states that in the case of the square-loss function, i. e. λ(ω, γ) = 2 (ω − γ) , η ≤ 2Y1 2 (where Y is a bound for |yt |) γT =
gT (−Y ) − gT (Y ) 4Y
(4)
is a prediction from Θ. Formula (4) gives us the rule for making a prediction at step T . In this paper we use FARIMA models as experts and ξT (θ) corresponds to the prediction at step T given by a specific FARIMA model denoted by θ. As we use only a finite number of experts the integration in the formula (3) becomes the operation of summation. Preprocessing of the datasets can ensure |Y | = 1. In this paper we consider square loss which is a standard measure of the performance of prediction algorithms. Since the number of experts is finite in our case, we can use the fact ([19]) that for each round T of the game and each expert θ the accumulative loss of the AA LT (AA) is bounded by as LT (AA) ≤ LT (θ) + ln(n) where n is the total number of experts.
5 Experiments and Results In this section we use FARIMA models with different parameters and Least-Squares Regression to make predictions on a time series and apply the Aggregating Algorithm in various settings to several datasets. Our aim is to find an efficient and fast algorithm with high performance and compare it with existing methods. All the algorithms were
Prediction of Long-Range Dependent Time Series Data
41
implemented in Matlab and the experiments were run on a laptop with Intel Core Duo 2.40GHz, 4.0 GB RAM. We ran our experiments on publicly available datasets related to network traffic. Dataset Z contains an hour’s worth of all wide-area traffic between the Lawrence Berkeley Laboratory and the rest of the world and was made publicly available by Vern Paxson (see [15]). Dataset M was collected at Bellcore in August of 1989 and contains one “normal” hour’s traffic demand (see [12]). We transformed both the datasets in such a way that each observation is the number of packets in the corresponding network during a fixed time interval, one second. The datasets Z and M are then each divided into two sub-datasets of equal size Z1 (d = 0.34) and Z2 (d = 0.24), and M1 (d = 0.32) and M2 (d = 0.38) respectively. Dataset DEC-PKT (d = 0.33) contains an hour’s worth of all wide-area traffic between Digital Equipment Corporation and the rest of the world. Dataset LBL-PKT-4 (d = 0.29) consists of observations of another hour of traffic between the Lawrence Berkeley Laboratory and the rest of the world. On each of these datasets the following operations were performed: subtraction of the mean value and division by the maximum absolute value. Table 1. Running times (in seconds) for the two algorithms Alg. stats M1 M2 Z1 Z2 DEC-PKT LBL-PKT-4 mean 1.37 1.37 1.38 1.37 1.39 1.36 KF std 0.05 0.03 0.03 0.03 0.04 0.02 mean 1156.68 1181.57 1136.84 1137.30 1140.42 1136.08 IA std 19.81 37.53 8.69 1.71 28.71 4.35
Table 1 shows measured running times for all 15 FARIMA models (with d estimated using Dispersional Analysis and 0 ≤ p, q ≤ 3, p2 + q 2 = 0) based on Kalman Filter (KF) and the Innovations Algorithm (IA) as the underlying prediction algorithm for the ARMA model. The times are given in seconds. We repeated our experiments on 5 subdatasets of 300 examples with 200 training examples and 100 test examples from all original datasets and then averaged the time over the sub-datasets. The performance of these algorithms was quite similar for all experiments. The mean value of the relative difference between the error achieved by the best model (according to the cumulative errors) of the Innovations Algorithm and the best model of Kalman Filter is 3.92%, the median is 0.20% and the standard deviation is 0.10. We also applied the runs test (see [11]) to check that Kalman Filter and the Innovations Algorithm are not performancecomparable (i.e. that we cannot predict which algorithm will have lower error rate for its respective best model). To apply the test we computed a series of 1s and 0s, where 1 corresponded to the case where the best model output by Kalman Filter performed better and 0 to the case where the best model output by the Innovations Algorithm performed better. We then checked the hypothesis that the series is random. With 99% confidence neither one of the two algorithms outperforms the other one. Therefore, since it works much faster, we will use Kalman Filter in our further experiments.
42
M. Dashevskiy and Z. Luo Table 2. Cumulative square errors for different algorithms and datasets Algorithm FARIMA(0, 1) FARIMA(0, 2) FARIMA(0, 3) FARIMA(1, 0) FARIMA(1, 1) FARIMA(1, 2) FARIMA(1, 3) FARIMA(2, 0) FARIMA(2, 1) FARIMA(2, 2) FARIMA(2, 3) FARIMA(3, 0) FARIMA(3, 1) FARIMA(3, 2) FARIMA(3, 3) LSR AA-15 AA-16 FARIMA-AIC AA-AIC-5 Th. bound:AA-5 Th. bound:AA-15 Th. bound:AA-16
M1 1.50 1.21 1.21 1.16 1.20 1.21 0.94 1.21 36.59 1.60 22.47 1.21 97.85 100.39 101.84 2.16 1.14 1.06 1.16 1.24 2.76 3.65 3.72
M2 26.41 26.32 26.40 26.46 26.18 26.51 44.50 26.30 55.12 46.97 98.80 26.30 35.45 33.07 33.52 3.51 22.74 4.63 26.46 25.57 27.79 28.89 6.28
Z1 6.28 6.23 6.12 6.21 6.20 7.18 4.97 6.17 4.95 4.76 54.65 6.05 91.51 105.17 104.73 4.56 5.46 5.05 6.21 6.16 7.65 7.47 7.33
Z2 DEC-PKT LBL-PKT-4 0.13 0.45 1.51 0.13 0.45 1.53 0.15 0.47 1.52 0.14 0.45 1.50 0.14 0.44 1.53 4.24 1.39 8.38 0.16 0.44 1.51 0.14 0.45 1.52 0.28 1.54 1.62 21.60 68.72 59.13 0.17 88.66 1.51 0.14 0.46 1.49 0.21 92.74 7.90 0.19 90.52 53.11 0.20 99.94 29.57 0.25 1.62 1.46 0.13 0.41 1.47 0.11 0.30 1.37 0.16 0.45 1.50 0.15 0.43 1.56 1.75 2.05 3.10 2.84 3.15 4.20 2.91 3.22 4.23
Table 2 summarises the cumulative square errors for different algorithms. Firstly, the FARIMA model with different sets of parameters (p, q) and Least-Squares Regression (LSR) are considered. Secondly, the results for the Aggregating Algorithm with 15 FARIMA models as its experts (AA-15) and the Aggregating Algorithm with 15 FARIMA models and the LSR algorithm as its experts (AA-16) are given. Then the FARIMA model selected by Akaike Information Criterion (FARIMA-AIC) and the Aggregating Algorithm with 5 FARIMA models with lowest AIC values as its experts (AA-AIC-5) are considered. In addition, Table 2 gives the theoretical bounds on the cumulative square-loss for AA-AIC-5 (AA-5), AA-15 and AA-16. For each dataset we used only the first 300 examples where 200 examples were used for training and the remaining 100 examples as testing examples. From Table 2 it can be seen that the best FARIMA model for each dataset is different (i.e. we can’t choose one set (p, q) such that the FARIMA model with this set of parameters is better than the other models on all datasets) even though the data is of similar nature. In case of dataset M2 all FARIMA models perform much worse than a simple regression method, Least-Squares Regression. The Aggregating Algorithm is a method which can avoid model selection and provide a guaranteed output. As it can be seen the theoretical bound for the AA holds and that adding a simple prediction algorithm as an expert improves the performance of the prediction system in most cases. It can also be seen that AA-15 and AA-16 perform not much worse than the best model (expert) and the
Prediction of Long-Range Dependent Time Series Data
43
difference between losses of individual experts and AA
differences between the total losses of the best experts and the worst experts are large. For example, the total losses of the best model (FARIMA(2,2)) for dataset Z1 is 4.76 and the worst model (FARIMA(3,2)) suffered the total loss of 105.17. However, AA-15 achieved 5.46. A traditional way of dealing with the problem of model selection is to use Akaike Information Criterion (AIC). As it can be seen from Table 2 AIC failed to identify the best FARIMA model during our experiments. A natural way of combining the AA and AIC is to use AIC to choose “promising” experts, i.e. experts that according to AIC should perform well. In our experiments we used AIC to determine the five most promising experts out of 15 but in most cases (5 out of 6) the results were worse than those obtained by the AA based on all experts. A conclusion which can be drawn from these results is that using AIC might not be able to select the best possible model. In [4] it was shown that methods of Prediction with Expert Advice can be successfully used to mix FARIMA models in order to achieve a guaranteed performance. We have shown that adding a naive method can improve the performance even further and that the constructed prediction system outperforms traditional systems which employ AIC for model selection. Figure 1 shows the differences between the cumulative errors of 12 experts (the rest of the experts performed much worse and therefore their results are not included in the
8
theoretical bound on the values difference between losses of individual experts and AA
6
4
2
0
−2
−4 0
10
20
30
40
50 examples
60
70
80
90
100
Fig. 1. The differences between the cumulative errors of 12 best experts and the AA based on these experts (dataset Z1). The thick line shows the theoretical bound for this value.
44
M. Dashevskiy and Z. Luo
graph) and the AA based on all (16) experts for the dataset Z1. The AA guarantees that this difference is bounded by the logarithm of the number of experts (when the number of experts is finite). The thick line shows the theoretical bound for this value. As it can be seen from Figure 1 the bound holds.
6 Conclusions In this paper, we studied how to model and predict long-range dependent time series with FARIMA models and investigated various ways to improve prediction performance. We have compared two widely used linear prediction algorithms, namely the Innovations Algorithm and Kalman Filter, in terms of prediction performance and computational complexity. Empirical results have shown that Kalman Filter can achieve good performance while maintaining high efficiency due to its lightweight mechanism for model update. We have shown that by using the Aggregating Algorithm we can guarantee the quality of predictions and outperform the model selected by Akaike Information Criterion. The experiments have also shown that by adding a naive method such as LeastSquares Regression we can improve the performance of the prediction system and deal with the cases where the FARIMA model does not perform well. The problems considered in this paper give several directions for further research. Since we ran the experiments on network traffic demand time series datasets we need to validate the results for other types of applications. We also plan to find a way of improving the overall performance of the prediction system by assigning different weights to the experts before running the algorithm. This can be done, for example, by using different information criteria, such as AIC. Another approach is to look at how model selection for other models, such as Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model can be avoided by aggregating several models with different parameters. It may turn out that this approach can achieve better performance than the methods used currently.
Acknowledgements This work has been supported by the EPSRC through grant EP/E000053/1 “Machine Learning for Resource Management in Next-Generation Optical Networks” and Royal Society International Joint Project “Explosives Trace Detection with an Odour Capture Hybrid Sensor System”.
References [1] Bisaglia, L.: Model selection for long-memory models. Quaderni di Statistica 4 (2002) [2] Blok, H.J.: On The Nature Of The Stock Market: Simulations And Experiments. PhD Thesis, University of British Columbia, Canada (2000) [3] Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer, Heidelberg (1991)
Prediction of Long-Range Dependent Time Series Data
45
[4] Dashevskiy, M., Luo, Z.: Guaranteed Network Traffic Demand Prediction Using FARIMA Models. In: Fyfe, C., Kim, D., Lee, S.-Y., Yin, H. (eds.) IDEAL 2008. LNCS, vol. 5326, pp. 274–281. Springer, Heidelberg (2008) [5] Dethe, C.G., Wakde, D.G.: On the prediction of packet process in network traffic using FARIMA time-series model. J. of Indian Inst. of Science 84, 31–39 (2004) [6] Granger, C.W.J., Joyeux, R.: An introduction to long-memory time series models and fractional differencing. J. of Time Series Analysis 1, 15–29 (1980) [7] Hosking, J.R.M.: Fractional differencing. Biometrica 68, 165–176 (1981) [8] de Jong, P., Penzer, J.: The ARMA model in state space form. Statistics & Probability Letters 70(1), 119–125 (2004) [9] Kalman, R.E.: A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82, 34–45 (1960) [10] Kalman, R.E., Bucy, R.S.: New results in linear filtering and prediction theory. Journal of Basic Engineering 83, 95–108 (1961) [11] Laurencelle, L., Dupuis, F.-A.: Statistical Tables, Explained and Applied: Explained and Applied. World Scientific, Singapore (2002) [12] Leland, W.E., Taqqu, M.S., Willinger, W., Wilson, D.V.: On the self-similar nature of Ethernet traffic. IEEE/ACM Trans. on Networking. 2, 1–15 (1994) [13] de Lima, A.B., de Amazonas, J.R.A.: State-space Modeling of Long-Range Dependent Teletraffic. In: Mason, L.G., Drwiega, T., Yan, J. (eds.) ITC 2007. LNCS, vol. 4516, pp. 260–271. Springer, Heidelberg (2007) [14] Palmo, W.: Long-Memory Time Series. Theory and Methods. Wiley Series in Probability and Statistics (2007) [15] Paxson, V.: Fast Approximation of Self-Similar Network Traffic. Technical report LBL36750/UC-405 (1995) [16] Shu, Y., Jin, Z., Zhang, L., Wang, L., Yang, O.W.W.: Traffic prediction using FARIMA models. In: IEEE International Conf. on Communication, vol. 2, pp. 891–895 (1999) [17] Taqqu, M.S., Teverovsky, V., Willinger, W.: Estimators for long-range dependence: An empirical study. Fractals 3, 785–788 (1995) [18] Vovk, V.: Competitive On-line Statistics. Int. Stat. Review 69, 213–248 (2001) [19] Vovk, V.: Prediction with expert advice for the Brier game (2008), http://arxiv.org/abs/0710.0485 [20] Willinger, W., Paxson, V., Riedi, R.H., Taqqu, M.S.: Long-range dependence and data network traffic. Theory And Applications Of Long-Range Dependence, 373–407 (2003) [21] Xue, F.: Modeling and Predicting Long-range Dependent Traffic with FARIMA Processes. In: Proc. of 1999 International Symposium on Communication (1999) [22] Xue, F., Liu, J., Zhang, L., Yang, O.W.W.: Traffic Modelling Based on FARIMA Models. In: Proc. IEEE Canadian Conference on Electrical and Computer Eng. (1999)
Bipartite Graph Representation of Multiple Decision Table Classifiers Kazuya Haraguchi1, Seok-Hee Hong2 , and Hiroshi Nagamochi3 1
Faculty of Science and Engineering, Ishinomaki Senshu University, Japan
[email protected] 2 School of Information Technologies, University of Sydney, Australia
[email protected] 3 Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Japan
[email protected]
Abstract. In this paper, we consider the two-class classification problem, a significant issue in machine learning. From a given set of positive and negative samples, the problem asks to construct a classifier that predicts the classes of future samples with high accuracy. For this problem, we have studied a new visual classifier in our previous works, which is constructed as follows: We first create several decision tables and extract a bipartite graph structure (called an SE-graph) between the given set of samples and the set of created decision tables. We then draw the bipartite graph as a two-layered drawing by using an edge crossing minimization technique, and the resulting drawing acts as a visual classifier. We first describe our background and philosophy on such a visual classifier, and then consider improving its classification accuracy. We demonstrate the effectiveness of our methodology by computational studies using benchmark data sets, where the new classifier outperforms our older version, and is competitive even with such standard classifiers as C4.5 or LibSVM.
1
Introduction
1.1
Classification Problem
Our goal is to establish the theory of a new visual classifier to a mathematical learning problem called classification, which has been a significant research issue from classical statistics to modern research fields on learning theory and data analysis [1, 2]. For a positive integer m, let [m] = {1, 2, . . . , m}. A sample s is an n dimensional vector with entries on n domains (or n attributes) D1 , . . . , Dn , and s belongs to either positive (+) class or negative (−) class. We assume that Dj (∀j ∈ [n]) is a finite set of at least two elements. In classification, we are given a training set S = S + ∪ S − of samples, where S + and S − are the sets of samples s ∈ S belonging to the positive and negative classes, respectively. Table 1
This work is partially supported by Grant-in-Aid for Young Scientists (Start-up, 20800045) from Japan Society for the Promotion of Science (JSPS).
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 46–60, 2009. c Springer-Verlag Berlin Heidelberg 2009
Bipartite Graph Representation of Multiple Decision Table Classifiers
47
Table 1. A training set S = S + ∪ S − with S + = {s1 , s2 , s3 , s4 } and S − = {s5 , s6 , s7 } over three attributes with D1 = {yes, no}, D2 = {high, med}, and D3 = {high, med, low}
S + (malignant) s1 s2 s3 s4 − S (benign) s5 s6 s7
Att. 1 (headache) yes yes yes no no yes no
Att. 2 (temperature) high med high high med high med
Att. 3 (blood pressure) high med high med high med low
shows an example of training set S, where a sample corresponds to a patient for some disease, an attribute represents the result of an examination, and the class represents diagnosis. A classifier is a function from the sample space S = D1 × · · · × Dn to the class {+, −}. The aim of classification is to construct a classifier that predicts the class of a future sample s ∈ S with high accuracy, where s is possibly unseen, i.e., s ∈ S. 1.2
Background on Visual Classifier
Visualization plays an important role as an effective analysis tool for huge and complex data sets in many application domains such as financial market, computer networks, biology and sociology. Graph drawing has been extensively studied over the last twenty years due to its popular application for visualization in VLSI layout, computer networks, software engineering, social networks and bioinformatics [3]. To draw graphs automatically and nicely, we need to give a mathematical definition of aesthetic criteria (e.g., the number of edge crossings, symmetry, or bends) for 2D and 3D drawings. In our companion paper [4, 5], we proposed a new mathematical measurement of occlusion for 2.5D drawing [6] of pairwise trees (where one dimension of the 2.5 dimensions is used for essentially different purpose from the others) as the sum of the number of edge crossings of the bipartite graphs in the projected images. We then presented algorithms to the problem of finding a 2.5D drawing that minimizes the occlusion. As a case study, we observed the pairwise trees which represent samples and entries, supposing that the minimum occlusion drawing supports visual analysis of classification or clustering. Given a well-visualized image, one may expect that a hidden mathematical structure has been revealed. We are also inspired by the fact that algorithms for reducing edge crossings in graph drawings are recently used in some data analysis such as the rank aggregation problem [7, 8]. We hypothesize that good visualization (e.g., visual objects with low visual complexity) itself can discover
48
K. Haraguchi, S.-H. Hong, and H. Nagamochi
essential or hidden structure of data without relying on data analysis techniques, which can lead to a novel learning technique. Based on our hypothesis, we designed a prototype visual classifier using visualization in our preliminary research [9, 10]. The classifier was constructed from a training set S of samples and a family P = {P1 , . . . , PN } of N partitions of the sample space S (i.e., each Pj ∈ P is such a set Pj = {Q1 , . . . , QK } of K K subspaces of S that satisfies k=1 Qk = S and Qk ∩ Qk = ∅ for k = k ) as follows: We first represented the relationship between the samples of the training set S and the subspaces of P as a bipartite graph. We visualized the bipartite graph using a two-layered drawing [11], as shown in Fig. 1(a). Next, starting with the family P = {D1 , . . . , Dn } of attribute domains, we tried to reduce the number of edge crossings by changing the drawing (i.e., the ordering of nodes on each side) based on local search, and to change the structure of the bipartite graph by replacing two partitions into their product as a new partition until a termination criteria is satisfied (see Fig. 1(b)). The resulting drawing was used as a visual classifier in the following way. Given a future sample s ∈ S, we placed s on the sample side of the drawing so that the position of s was the average of the positions of adjacent nodes in the entry side, and then judged s as positive (negative) if the nearest neighbor of s was a positive (negative) sample.
Fig. 1. The prototype visual classifier [9, 10] for MONKS-1 data set
Bipartite Graph Representation of Multiple Decision Table Classifiers
49
Table 2. Decision tables T1 = (A1 , 1 ), T2 = (A2 , 2 ) and T3 = (A3 , 3 ) with attribute sets A1 = {1, 2}, A2 = {1, 3} and A3 = {3} T1 = (A1 , 1 ) v ∈ D1 × D2 1 (v) yes, high + yes, med + no, high + no, med −
T2 = (A2 , 2 ) v ∈ D1 × D3 2 (v) yes, high + yes, med + yes, low + no, high − no, med + no, low −
T3 = (A3 , 3 ) v ∈ D3 3 (v) high + med + low −
Although we demonstrated the effectiveness of the preliminary version of our visual classifier by empirical studies, the mechanism or reason why such a classifier could successfully work was not analyzed. To clarify the reason, we recently investigated a relationship between the above bipartite graph representation of data and the majority vote classifier (MV ) with multiple decision tables [12]. Among the classifiers developed so far, decision table is simple and comprehensible. Table 2 shows three decision tables for the training set in Table 1. Formally, a decision table T = (A, ) is defined by a subset A ⊆ [n] of n attributes and a label function . The label function assigns either (+) or (−) to each entry of A which corresponds to a row in the table. For a future sample, a decision table estimates its class as the label of the matched entry. For example, (no, high, high) is classified as (+) by T1 , (−) by T2 and (+) by T3 . Assume that a set T = {T1 , . . . , TN } of N decision tables is given. We note that one can use T as a single classifier by applying such ensemble technique as majority vote (MV). The MV classifies a future sample as the majority class among the N outputs given by the N decision tables in T . For example, MV with T = {T1 , T2 , T3 } in Table 2 classifies (no, high, high) as (+). In [12], we considered a generalization of MV (GMV ), and found that GMV classifier can be represented as a drawing of bipartite graph called sample-entry graph (SE-graph), where the number of edge crossings is to be minimized. We proposed a scheme to decide the parameters of GMV for “fixed” decision table set T , and demonstrated its effectiveness by experiments using randomly generated T ; However, it was left open to develop an algorithm to obtain a “good” decision table set. 1.3
New Contribution
In this paper, following our recent work [12], we consider GMV with a given set T = {T1 , . . . , TN } of decision tables, and represent it as an SE-graph. In an SE-graph, one node set corresponds to the samples in S, the other set does to the rows (i.e., entries) of the decision tables in T , and a sample s and an entry v are joined by an edge if and only if s matches v. See Fig. 2(a) for an example of SE-graph drawn as two-layered drawing. The layout of SE-graph
50
K. Haraguchi, S.-H. Hong, and H. Nagamochi
Fig. 2. SE-graph for GLASS data set
is defined by permutations on the samples and the entries. We perform edge crossing minimization of a drawing of SE-graph, and we divide the sample nodes into positive and negative sides by a suitably chosen threshold θ, based on which a future sample will be judged as positive or negative (see the θ dividing the sample nodes in Fig. 2(b)). These are the major difference from our preliminary classifier [9, 10]. The new contribution of this paper is summarized as follows. (i) We assert that edge crossing minimization of SE-graph “stretches” the training set to a total order through observation on the layout. This opens a new research direction for future research of visual classifier. (ii) To address the open problem in [12], we present a new simple procedure that collects a subset of “good” decision tables from a given larger set. Also, we decide the parameters of GMV in more detailed way. (iii) Performance of classifier is usually evaluated by error rate on test set of samples. Our new version of visual classifier (SEC-II) outperforms the
Bipartite Graph Representation of Multiple Decision Table Classifiers
51
previous version (SEC-I) [12], where the improvement is 25% on average. Surprisingly, it is competitive with decision tree constructed by C4.5 [13] and support vector machine (SVM ) constructed by LibSVM package [14] of WEKA [15], and even outperforms them for some benchmark instances. The paper is organized as follows. In Sect. 2, we prepare terminologies mainly on decision table and SE-graph. In Sect. 3, we discuss the meaning of edge crossing minimization on SE-graph (for (i)) and describe the construction algorithm of our classifier, along with the idea of its improvement via crossing minimization (for (ii)). Then we present computational results in Sect. 4 (for (iii)), followed by concluding remarks in Sect. 5.
2 2.1
Preliminaries Decision Table
Let A = {j1 , . . . , jq } ⊆ [n] denote a subset of n attributes. Formally, we call an element v ∈ Dj1 × · · · × Djq of the Cartesian product of the domains an entry of A. For s ∈ S, let s|A = (sj1 , . . . , sjq ) denote the projection of s onto the attribute set A. For a decision table T = (A, ), we define the label function by DTM (Decision Table Majority), which was introduced in [16]. For an entry v of A, let − m+ A (v) (resp., mA (v)) denote the number of positive (resp., negative) samples matching v; + + + m+ A (v) = |{s ∈ S | s |A = v}|,
− − − m− A (v) = |{s ∈ S | s |A = v}|.
Then the value (v) ∈ {+, −} is defined tive/negative samples matching v; ⎧ ⎨+ (v) = − ⎩ majority class in S
(1)
according to distribution of posi− if m+ A (v) > mA (v), + if mA (v) < m− A (v), otherwise.
(2)
The empirical error rate of a classifier is its error rate on the training set. Then the empirical error rate of T = (A, ) is computed by: 1 |S|
v is an entry of A
− min{m+ A (v), mA (v)}.
(3)
Recall a given set T = {T1 , . . . , TN } of N decision tables. For any decision table Tk = (Ak , k ) (k ∈ [N ]), let Ak = {j1 , . . . , jq } ⊆ [n]. We denote the set of all entries of Ak by Dk = Dj1 × · · · × Djq , and denote the union of all Dk ’s by D = D1 ∪ · · · ∪ DN .
52
K. Haraguchi, S.-H. Hong, and H. Nagamochi
We define the generalized majority vote (GMV) with T [12]. GMV is defined by a tuple (c1 , . . . , cN , θ) of N functions ck : Dk → [−1, 1] (k ∈ [N ]) and a threshold θ ∈ [−1, 1]. Function ck assigns a real value in [−1, 1] to each entry of decision table Tk , and is called a class level . Given a future sample s ∈ S, a GMV classifier first computes the barycenter b(s ) of sample s by N 1 ck (s |Ak ), b(s ) = N
(4)
k=1
and estimates the class of s as the sign of b(s ). One can readily see that MV is a special case of GMV as follows; for each decision table Tk ∈ T and its entry v ∈ Dk , set ck (v) = 1 (resp., −1) if k (v) = + (resp., −), and θ = 0. 2.2
Sample-Entry Graph (SE-Graph)
We define sample-entry graph (SE-graph) [12], which is a bipartite graph G = (S, D, E) defined as follows: – Each sample s ∈ S is called a sample node, and each entry v ∈ D = D1 ∪ · · · ∪ DN is called an entry node. – For each k ∈ [N ], a sample node s and an entry node v ∈ Dk are joined by an edge (s, v) ∈ S × Dk if and only if s matches the entry v, i.e., s|Ak = v. Thus the edge set is given by E = E1 ∪E2 ∪· · ·∪EN such that Ek = {(s, v) ∈ S × Dk | s|Ak = v}. For each k ∈ [N ], let Gk = (S, Dk , Ek ) denote the subgraph of G induced by S and Dk . We visualize the SE-graph in 3D space shown in Fig. 3: We assume a cylinder standing on the x, y-plane. Suppose that its base has radius of the unit length and that the height is sufficiently large. Then we take N + 1 lines orthogonal to the x, y-plane so that one of them penetrates the center of the base and the others are along with the side of the cylinder. Finally, we place the training set S on the central line and the entry sets D1 , . . . , DN on the N side lines, respectively. Formally, we define a layout of SE-graph G = (S, D, E) by (σ, Π), where σ : S → [|S|] is an ordering on the samples in S, and Π = {π1 , . . . , πN } is a set of orderings on the entries, πk : Dk → [|Dk |] (∀k ∈ [N ]). We are not concerned with how to assign the sets D1 , . . . , DN to the N lines on the side of the cylinder here. The assignment can be chosen arbitrarily. Observe that each subgraph Gk appears as a two-layered drawing on a 2D plane. We say that two edges (s, v), (s , v ) ∈ Ek cross if and only if (σ(s) > σ(s ) and πk (v) < πk (v )) or (σ(s) < σ(s ) and πk (v) > πk (v )). We denote by χ(Gk ; σ, πk ) the number of edge crossings of Gk when the samples in S are ordered by σ and the entries in Dk are ordered by πk . We denote by χ(G; σ, Π)
Bipartite Graph Representation of Multiple Decision Table Classifiers
S yes high yes med no high no med
+
s1
+s
2 3
+s s4 + s5 − s6 − s7 −
yes high
D1
yes med yes low no high no med no low
z y
53
high
med low
D3
shadow of nodes on the x, y-plane
D2
x Fig. 3. The SE-graph G = (S, D, E) for the training set S in Table 1 and the set T = {T1 , T2 , T3 } of decision tables in Table 2
the number of edge crossings of G in the layout (σ, Π), defined as the sum of the edge crossings over all the N induced subgraphs; χ(G; σ, Π) =
N
χ(Gk ; σ, πk )
k=1
=
N {(s, s ) ∈ S × S | σ(s) < σ(s ), πk (s|A ) > πk (s |A )}. (5) k k k=1
For given G = (S, D, E), we refer to the problem of finding (σ, Π) that minimizes (5) as two-sided edge crossing minimization on SE-graph (2CMSE ). Furthermore, if Π is fixed, we call the problem one-sided edge crossing minimization on SE-graph (1CMSE ) for fixed Π. We can regard 2CMSE (and 1CMSE) as a special case of the crossing minimization problem on a two-layered bipartite graph on 2D plane: We place the samples on one layer and the entries on the other layer, and for the latter, we restrict ourselves to the permutations where the entries from the same decision table are arranged consecutively. Under the above restriction, one can easily see that the total number of edge crossings to the sum of edge crossings Nequals × , which is the number of crossings over all subgraphs G1 , . . . , GN plus |S| 2 2 between edges from different edge sets Ek , Ek (k, k ∈ [N ], k = k ) and is a constant for given S and T . Therefore, the edge crossing minimization on G drawn as two-layered drawing on 2D plane is equivalent to minimization of the sum of edge crossings over all subgraphs G1 , . . . , GN , i.e., χ(G; σ, Π).
54
K. Haraguchi, S.-H. Hong, and H. Nagamochi
The two-sided crossing minimization problem (2CM) on a two-layered bipartite graph is NP-hard [17] even if the permutation of one node set is fixed, i.e., one-sided crossing minimization problem (1CM) [18]. However, there are several approximation algorithms available for 1CM. For example, Eades and Wormald [18] proposed a median method , which produces a 3-approximate solution. The √ barycenter heuristic by Sugiyama et al. [11] is an O( m)-approximation algorithm (where m denotes the number of nodes) [18]. Currently, the best known approximation algorithm is given by Nagamochi [19] that delivers a drawing with a 1.4664 factor approximation, based on a random key method. 2.3
Partially Ordered Set (Poset)
In the next section, we deal with partial order on samples which are decided by a set of orderings Π = {π1 , . . . , πN } on the entries. For any set P of elements, we call a binary relation a partial order if it satisfies reflexivity, antisymmetry and transitivity. We call (P, ) a partially ordered set (poset ). For any p, q ∈ P , we say that p and q are comparable if at least one of p q and q p holds and that they are incomparable if neither p q nor q p holds. We call (P, ) a chain (resp., an antichain) if any p, q ∈ P (p = q) are comparable (resp., incomparable). For example, denoting by m a positive integer, the poset ([m], ≤) is a chain where [m] = {1, 2, . . . , m} is a set of integers and ≤ denotes the sign of inequality on numbers.
3 3.1
Construction of Classifier via Crossing Minimization Poset and Crossing Minimization on SE-Graph
As mentioned in the literature (e.g., [20, 21]), crossing minimization may improve readability of SE-graph. In this subsection, we describe our observation on crossing minimization of SE-graph. Observe that a set Π = {π1 , . . . , πN } of orderings on the entries defines a partial order on samples as follows: For any two samples s, s , we write s Π s if and only if πk (s|Ak ) ≤ πk (s |Ak ) for all k’s. The binary relation Π is clearly a partial order on samples. Whether the poset (S, Π ) is close to a chain or an antichain depends on the choice of Π. We prefer that (S, Π ) is close to a chain rather than an antichain since we can expect meaningful analysis by making use of comparability defined on the samples. In particular, it would be interesting if positive and negative samples are separated clearly in the chain. Such a chain may contain essential information on discrimination of positive and negative samples. We then evaluate the poset (S, Π ) by whether there exists a “good” bijection from (S, Π ) to the chain ([|S|], ≤), i.e., an ordering σ : S → [|S|] on the samples. The goodness of σ should take into account how σ preserves comparability of the samples in the poset (S, Π ). It should be penalized for such a conflict:
Bipartite Graph Representation of Multiple Decision Table Classifiers
55
σ(s) < σ(s ) but πk (s|Ak ) > πk (s |Ak ). Then the following equation can serve as an evaluator of the poset (S, Π ); min σ
N {(s, s ) ∈ S × S | σ(s) < σ(s ), πk (s|A ) > πk (s |A )} k k k=1
= min χ(G; σ, Π). σ
(6)
Compare (5) and (6). The problem of computing (6) is no more than 1CMSE for fixed Π. We then claim that 2CMSE finds such a poset (S, Π ) that is the closest (in our sense) to a chain among all Π’s. 3.2
Construction Algorithm of SE-Graph Classifier
Recall GMV (c1 , . . . , cN , θ) defined in Sect. 2.1. We now compute the layout (σ, Π) of SE-graph along with GMV. Consider the 1CMSE that asks to decide the optimum ordering σ on the samples, where the orderings π1 , . . . , πN on the entries are fixed as a nondecreasing order of class level. To be more precise, by denoting Dk = {v1 , v2 , . . . , vK } (∀k ∈ [N ]) and ck (v1 ) ≤ ck (v2 ) ≤ · · · ≤ ck (vK ), we assign πk (v1 ) = 1, πk (v2 ) = 2, · · · , πk (vK ) = K. To solve the problem, let us adopt barycenter heuristic for 1CM [11] here. Barycenter heuristic permutes the samples in S in a nondecreasing order of barycenter (see (4)). Computing its layout by barycenter heuristic, we assert that the SE-graph appears as a good visualization of GMV in the following sense: (i) The number of edge crossings is (approximately) minimized. Indeed, barycenter heuristic has been recognized as an effective approximation algorithm in practice; e.g., [22]. (ii) The resulting drawing can enable some meaningful analysis on GMV. For example, the computed string of samples (which we call the sample string) is split into two substrings according to whether barycenter is larger than θ or not. The samples in the former (resp., latter) substring are estimated as positive (resp., negative) by GMV. We call a GMV visualized as above an SE-graph based classifier (SEC ). Construction of SEC consists of four steps, and we describe their details and computational complexity as follows. Step 1 : Generation of Decision Table Set. In our previous work [12], we did not consider how to generate T , and conducted the experiments for randomly generated T . In this paper, we consider collecting “good” decision tables from a given larger set of decision tables. More specifically, we decide T = {T1 , . . . , TN } as follows: We first generate N ≥ N decision tables by choosing the attribute sets at random. From the generated N decision tables, we take T as the set of the N decision tables having the smallest empirical error rate (see (3)).
56
K. Haraguchi, S.-H. Hong, and H. Nagamochi
To generate a decision table Tk = (Ak , k ), it takes O(n) time to choose an − attribute set Ak and O(|Dk | + n|S|) time to compute m+ Ak (v) and mAk (v) (see (1)) for all entries v’s, by which the label k (v) is immediate by (2). Then it
N is required O( k=1 |Dk | + nN |S|) time to obtain N decision tables. Further, it takes O(N log N ) time to select the N decision tables having the smallest
N empirical error rate, which is fewer than O( k=1 |Dk | + nN |S|). Step 2 : Determination of Class Levels. We prefer such a sample string where positive and negative samples are separated well, rather than a patchy string. Recall that the barycenter of a sample is the average of the class levels of the matched entries. Hence we should decide the value ck (v) ∈ [−1, 1] of class level to be large (resp., small) if positive (resp., negative) class is dominant in the entry v. In this paper, we adopt the following definition of ck (v) which was proposed by Stanfill and Waltz [23]; ⎧ − ⎨0 if m+ Ak (v) = mAk (v) = 0, + 2mA (v) ck (v) = (7) k −1 otherwise. ⎩ m+ (v)+m − (v) Ak
Ak
N It requires O( k=1 |Dk |) time to compute all class levels. In (7), if the denominator is fixed, ck (v) is linear with respect to m+ Ak (v). In our previous work [12], we defined ck (v) based on statistical test where ck (v) performs like a threshold function or a sigmoid function rather than a linear func− tion. For example, for two entries v, v ∈ Dk , suppose that (m+ Ak (v), mAk (v)) = + − (7, 3) and (mAk (v ), mAk (v )) = (10, 0). Then the definition of (7) gives 0.4 to v and 1 to v , while the definition of [12] gives almost 1 to both entries. The definition of (7) discriminates the entries in more detail. Step 3 : Crossing Minimization. Since we adopt barycenter heuristic to solve 1CMSE, we need to compute the barycenters of the samples, which requires O(nN |S|) time. Step 4 : Determination of Threshold. For threshold θ ∈ [−1, 1], we choose the value so that the empirical error rate of the resulting SEC is minimized. Thus it is given by the following equality (which was introduced in [12]), and can be computed in O(|S|) time if the samples have been sorted in barycenter (which requires O(|S| log |S|) time): 1 + |{s ∈ S + | b(s+ ) ≤ θ }| + |{s− ∈ S − | b(s− ) > θ }| . (8) θ ∈[−1,1] |S|
θ = arg min
Computational Complexity. Overall, our SEC construction algorithm requires
N O( k=1 |Dk | + nN |S|) time, which is due to generation of decision tables (Step 1). To use SEC, i.e., to classify a future sample s ∈ S, we need O(nN ) time to check the matched entries, to compute its barycenter b(s ), and to decide whether b(s ) ≤ θ or b(s ) > θ.
Bipartite Graph Representation of Multiple Decision Table Classifiers
4
57
Experimental Results
In this section, we compare a new version of SEC (denoted as SEC-II) with other classifiers in terms of prediction error rate on future samples. For comparison, we take its former version (denoted as SEC-I) [12], decision tree constructed by C4.5 [13] and support vector machine (SVM) constructed by LibSVM package [14] of WEKA [15]. We use 13 benchmark data sets from UCI Machine Learning Repository [24]. Among these, 7 data sets (named CHESS, MONKS-1, MONKS-2, MONKS-3, MUSHROOM, TICTACTOE and VOTING) are discrete ones, where the domain of each attribute is a set of discrete nominals, and thus can be used in our formulation without any conversion. The other 6 data sets (named BCW, GLASS, HABERMAN, HEART, HEPATITIS and IONO) are not discrete ones, where some attributes may have numerical values. We transform each non-discrete data set into binary one by the algorithm considered in our previous work [25]. For estimation of prediction error rate, we construct a classifier from a training set and then evaluate its error rate on a test set. Among the 13 data sets, MONKS-1,2,3 and IONO have their own training and test sets, and thus we use them in the experiments. For the other data sets, we estimate the prediction error rate by 10-fold cross validation, a well-known experimental methodology to split a data set into training/test sets [26]. For SEC’s, we take the average of error rates over 100 different T ’s because its performance must heavily depend on the choice of T . Figure 2 shows construction of SEC-II for GLASS data set. In Fig. 2 (b), we can see that the entries are sorted by class level for each decision table, and that the samples are sorted by barycenter. A future sample is classified according to whether it falls on the left side (−) or the right side (+) to the threshold θ. Table 3 shows the error rates of the SEC-I/II, C4.5 and LibSVM. For SEC’s, we generate N = 100 decision tables at random, with the restriction that the cardinality of an attribute set is at most 4. Furthermore, for SEC-II, we select N = 10 decision tables that achieve the smallest empirical error rate. The standard deviations arising from different T ’s are also shown. We set all parameters of C4.5 and LibSVM to the default values, where we use RBF kernel for LibSVM. In general, SEC-II outperforms the SEC-I in almost all data sets, where the improvement is 25% on average. SEC-II wins over SEC-I in 8 out of 13 data sets by difference over standard deviation. Also, SEC-II is competitive with C4.5 and LibSVM. Surprisingly, SEC-II wins over C4.5 (resp., LibSVM) in MONKS1 and BCW (resp., MONKS-1 and HABERMAN) by difference over standard deviation, and in 7 (resp., 6) data sets, the error rate of C4.5 (resp., LibSVM) is still in the range of standard deviation from the average of SEC-II. In the field of machine learning, it is broadly recognized that processing attributes (e.g., pruning or weighting attributes) plays a significant role in constructing effective classifiers [1, 16, 13]. Our SEC-II constructs decision tables on randomly selected attribute sets and collects some of them just simply by
58
K. Haraguchi, S.-H. Hong, and H. Nagamochi
Table 3. Error rates (%) of SEC-II, in comparison with SEC-I [12], C4.5 and LibSVM Data CHESS MONKS-1 MONKS-2 MONKS-3 MUSHROOM TICTACTOE VOTING BCW GLASS HABERMAN HEART HEPATITIS IONO
Samples 3196 124 169 122 8124 958 435 699 214 306 270 155 200
SEC-II 14.79 ± 4.39 7.14 ± 6.48 34.08 ± 1.76 4.75 ± 1.40 0.08 ± 0.12 17.64 ± 1.99 5.41 ± 0.74 3.83 ± 0.39 6.17 ± 0.56 25.82 ± 0.89 18.00 ± 1.63 21.68 ± 1.35 7.48 ± 3.93
SEC-I 21.06 ± 3.04 21.25 ± 2.42 27.02 ± 1.72 9.55 ± 1.80 1.51 ± 0.68 22.56 ± 1.34 10.46 ± 1.08 3.74 ± 0.24 7.48 ± 0.44 28.06 ± 0.77 17.65 ± 1.04 22.63 ± 1.27 9.46 ± 1.34
C4.5 0.41 24.30 35.00 2.80 0.00 15.46 4.34 5.15 6.51 26.15 18.13 20.61 4.59
LibSVM 5.88 18.28 32.40 3.70 0.00 12.52 4.36 3.57 6.49 27.11 15.55 20.62 6.62
empirical error rate, while C4.5 and LibSVM uses smarter and more sophisticated strategies for processing attributes. Based on these, we consider that SEC is an effective methodology and can be improved further by detailed inspection. Let us discuss the computation time. For MUSHROOM data set, which is the largest one, our implementation takes about 20 seconds to construct an SECI/II. (Our PC carries 2.4GHz CPU.) Most of the computation time is devoted to generating decision tables, which agrees with the analysis in Sect. 3.2. We note that our preliminary classifier [9, 10] takes 120 seconds even for IONO data set, much smaller one, which indicates that our methodology makes progress in scalability. For other classifiers, C4.5 takes less than 1 second and LibSVM does 5 to 10 seconds. It is left as our future work to develop how to generate a better and smaller set of decision tables.
5
Concluding Remarks
In this paper, we have described our background and philosophy on a new visual classifier. We have considered a classifier named SEC, as a visualization of GMV, and its improvement over [12]. From viewpoint of machine learning, GMV (c1 , . . . , cN , θ) is regarded as a hyperplane classifier that transforms a discrete sample s ∈ S into a real vector x ∈ [−1, 1]N , and that classifies s according to which side of the hyperplane x1 + · · · + xN − θ the vector x exists in. This type of classifier has already been studied (e.g., [27]). However, it is our achievement to interpret GMV as a bipartite graph (SE-graph), showing new application of crossing minimization methods in graph drawing. For future work, we should make use of the structure of the sample poset, which was discussed in Sect. 3.1. In machine learning, many previous methodologies convert unordered discrete samples into numerical ones, while we embed
Bipartite Graph Representation of Multiple Decision Table Classifiers
59
them in a poset. We conjecture that our idea on sample poset may give a new direction of processing discrete data sets. Also, we need to establish an algorithm to generate a better and smaller decision table set, to extend the formulation to multi-class data sets, etc.
References [1] Friedman, J.H.: Recent advances in predictive (machine) learning. Journal of Classification 23, 175–197 (2006) [2] Kettenring, J.R.: The practice of cluster analysis. Journal of Classification 23, 3–30 (2006) [3] Battista, G.D., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing: Algorithms for the Visualization of Graphs. Prentice Hall, Englewood Cliffs (1999) [4] Haraguchi, K., Hong, S., Nagamochi, H.: Visual analysis of hierarchical data using 2.5D drawing with minimum occlusion. In: IEEE PacificVis 2008 (2008) (poster session) [5] Haraguchi, K., Hong, S., Nagamochi, H.: Visual analysis of hierarchical data using 2.5D drawing with minimum occlusion. Technical Report 2009-010, Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Japan (2009) [6] Ware, C.: Designing with a 2 1/2D attitude. Information Design Journal 10(3), 171–182 (2001) [7] Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW10, pp. 613–622. ACM, New York (2001) [8] Biedl, T.C., Brandenburg, F.J., Deng, X.: Crossings and permutations. In: Healy, P., Nikolov, N.S. (eds.) GD 2005. LNCS, vol. 3843, pp. 1–12. Springer, Heidelberg (2006) [9] Haraguchi, K., Hong, S., Nagamochi, H.: Classification by ordering data samples. RIMS Kokyuroku 1644, 20–34 (2009) [10] Haraguchi, K., Hong, S., Nagamochi, H.: Classification via visualization of samplefeature bipartite graphs. Technical Report 2009-011, Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Japan (2009) [11] Sugiyama, K., Tagawa, S., Toda, M.: Methods for visual understanding of hierarchical system structures. IEEE Transactions on Systems, Man, and Cybernetics SMC-11(2), 109–125 (1981) [12] Haraguchi, K., Hong, S., Nagamochi, H.: Visualization can improve multiple decision table classifiers. In: Proc. of MDAI (to appear, 2009) [13] Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) [14] Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm [15] Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/ml/weka/ [16] Kohavi, R.: The power of decision tables. In: Lavraˇc, N., Wrobel, S. (eds.) ECML 1995. LNCS, vol. 912, pp. 174–189. Springer, Heidelberg (1995) [17] Garey, M.R., Johnson, D.S.: Crossing number is NP-complete. SIAM Journal on Algebraic and Discrete Methods 4, 312–316 (1983)
60
K. Haraguchi, S.-H. Hong, and H. Nagamochi
[18] Eades, P., Wormald, N.C.: Edge crossings in drawings of bipartite graphs. Algorithmica 11, 379–403 (1994) [19] Nagamochi, H.: An improved bound on the one-sided minimum crossing number in two-layered drawings. Discrete and Computational Geometry 33(4), 569–591 (2005) [20] Purchase, H.: Which aesthetic has the greatest effect on human understanding? In: DiBattista, G. (ed.) GD 1997. LNCS, vol. 1353, pp. 248–261. Springer, Heidelberg (1997) [21] Huang, W., Hong, S., Eades, P.: Layout effects on sociogram perception. In: Healy, P., Nikolov, N.S. (eds.) GD 2005. LNCS, vol. 3843, pp. 262–273. Springer, Heidelberg (2006) [22] J¨ unger, M., Mutzel, P.: 2-layer straightline crossing minimization: Performance of exact and heuristic algorithms. Journal of Graph Algorithms and Applications 1(1), 1–25 (1997) [23] Stanfill, C., Waltz, D.L.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986) [24] Asuncion, A., Newman, D.: UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html [25] Haraguchi, K., Nagamochi, H.: Extension of ICF classifiers to real world data sets. In: Okuno, H.G., Ali, M. (eds.) IEA/AIE 2007. LNCS (LNAI), vol. 4570, pp. 776–785. Springer, Heidelberg (2007) [26] Weiss, S.M., Kulikowski, C.A.: Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann, San Francisco (1991) [27] Grabczewski, K., Jankowski, N.: Transformations of symbolic data for continuous data oriented models. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 359–366. Springer, Heidelberg (2003)
Bounds for Multistage Stochastic Programs Using Supervised Learning Strategies Boris Defourny, Damien Ernst, and Louis Wehenkel University of Li`ege, Systems and Modeling, B28, B-4000 Li`ege, Belgium
[email protected] http://www.montefiore.ulg.ac.be/~bdf/
Abstract. We propose a generic method for obtaining quickly good upper bounds on the minimal value of a multistage stochastic program. The method is based on the simulation of a feasible decision policy, synthesized by a strategy relying on any scenario tree approximation from stochastic programming and on supervised learning techniques from machine learning.
1
Context
Let Ω denote a measurable space equipped with a sigma algebra B of subsets of Ω, defined as follows. For t = 0, 1, . . . , T − 1, we let ξt be a random variable valued in a subset of an Euclidian space Ξt , and let Bt denote the sigma algebra def generated by the collection of random variables ξ[0: t−1] = {ξ0 , . . . , ξt−1 }, with B0 = {∅, Ω} corresponding to the trivial sigma algebra. Then we set BT −1 = B. Note that B0 ⊂ B1 ⊂ · · · ⊂ BT −1 form a filtration; without loss of generality, we can assume that the inclusions are proper — that is, ξt cannot be reduced to a function of ξ[0: t−1] . Let πt : Ξ → Ut denote a Bt -measurable mapping from the product space Ξ = Ξ0 × · · · × ΞT −1 to an Euclidian space Ut , and let Πt denote the class of such mappings. Of course Π0 is a class of real-valued constant functions. We equip the measurable space (Ω, B) with the probability measure P and consider the following optimization program, which is a multistage stochastic program put in abstract form: S:
min E {f (ξ, π(ξ))}
π∈Π
subject to πt (ξ) ∈ Ut (ξ) almost surely.
(1)
Here ξ denotes the random vector [ξ0 . . . ξT −1 ] valued in Ξ, and π is the mapping from Ξ to the product space U = U0 × · · · × UT −1 defined by π(ξ) = [π0 (ξ) . . . πT −1 (ξ)] with πt ∈ Πt . We call such π an implementable policy and let Π denote the class of implementable policies. The function f : Ξ × U → R ∪ {±∞} is B-measurable and such that f (·, ξ) is convex for each ξ. It can be interpreted as a multi-period cost function. The sets Ut (ξ) are defined as Bt -measurable closed convex subsets of Ut . A set Ut (ξ) may implicitly depend on π0 (ξ), . . . , πt−1 (ξ) viewed as Bt -measurable O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 61–73, 2009. c Springer-Verlag Berlin Heidelberg 2009
62
B. Defourny, D. Ernst, and L. Wehenkel
random variables; in that context we let U0 be nonempty, and Ut (ξ) be nonempty whenever πτ (ξ) ∈ Uτ (ξ) for each τ < t — these conditions correspond to a relatively complete recourse assumption. The interpretation of the measurability restrictions put on the class Π is that def the sequence {ut } = {πt (ξ)} depends on information on ξ gradually revealed. Note the impact of these restrictions: should the mappings πt be B-measurable, and then an optimal solution would simply be given by u = π(ξ) with u = [u0 . . . uT −1 ] a real vector minimizing f (ξ, u) subject to constraints ut ∈ Ut (ξ) and optimized separately for each ξ. A usual approximation for estimating the optimal value of programs of the form (1) consists in replacing the probability space (Ω, B, P) by a simpler one, so as to replace the expectation by a finite sum and make the optimization program tractable. One then ends up with a finite number of possible realizations ξ (k) of ξ — called scenarios — and a probability P induced by probability nmeasure (k) (k) (k) masses p = P (ξ = ξ ) > 0 satisfying k=1 p = 1. The program (1) then becomes S :
min
π∈Π
n
subject to πt (ξ (k) ) ∈ Ut (ξ (k) ) ∀k . (2)
p(k) f (ξ (k) , π(ξ (k) ))
k=1 (k) def
It has a finite number of optimization variables ut = πt (ξ (k) ). Remark that the random variables ξt have actually been replaced by other random variables ξt , and the sigma subalgebras Bt actually replaced by those Bt generated by the collection ξ[0: t−1] = {ξ0 , . . . , ξt−1 }. In particular, B is now identified to BT −1 , and the class Πt is such that πt is Bt -measurable. This paper is interested in the following question. We want to generalize a solution for (2) — that is, values of πt (ξ) on a finite subset of points of Ξ — to a good candidate solution π ˆ for (1), of course in such a way that π ˆ ∈ Π and ˆ (ξ) π ˆt (ξ) ∈ Ut (implementability and feasibility). Moreover, we want that π be easy to evaluate (low complexity), so that we can, considering a test sample TS of m independent and identically distributed (i.i.d.) new scenarios ξ (j) , compute efficiently an unbiased estimator of E{f (ξ, π ˆ (ξ))} by m
1 π) = f (ξ (j) , π ˆ (ξ (j) )) , RTS (ˆ m j=1
(3)
which provides, up to the standard error of (3), and given that π ˆ is feasible but possibly suboptimal, an upper bound on the optimal value of (1). The rest of the paper is organized as follows. Section 2 motivates the question. Section 3 describes properties of the data extracted from approximate programs (2). Section 4 casts the generalization problem within a standard supervised learning framework and discusses the possible adaptations to the present context. Section 5 proposes learning algorithms. Section 6 points to related work and Section 7 concludes.
Bounds for Multistage Stochastic Programs
2
63
Motivation
Coming back to the transition from (1) to (2), we observe that there exist, in the stochastic programming literature, algorithms A(Θ), with Θ any algorithmdependent parameters, that take as input (Ω, B, P), possibly under the form of the conditional distributions of ξt given ξ[0: t−1] , and return an approximation (Ω , B , P ) under the form of n scenario-probability pairs. These pairs are generally arranged under the form of a scenario tree [1, 2, 3, 4, 5, 6, 7, 8, 9], from which a version of (2) can be formulated and solved. But this approach raises several concerns, in particular on the good choices for the parameters Θ. – It has been shown in [10] for a particular randomized algorithm A1 (Θ1 ) relying on Monte Carlo sampling — the Sample Average Approximation (SAA) [5] — that it is necessary, for guaranteeing with a certain probability 1 − α that an optimal value of (2) is -close to that of (1), to have a number of scenarios n that grows with T exponentially. Thus in practice, one might have to use unreliable approximations built with a small fraction of this number n. What happens then is not clear. It has also been shown in [11] that it is not possible to obtain an upper bound on the optimal value of (1) using statistical estimates based on A1 (Θ1 ) when T > 2. – It has been shown in [8] that there exists a particular derandomized algorithm A2 (Θ2 ) relying on Quasi-Monte Carlo sampling and valid for a certain class of models for ξ, such that the optimal value and solution set of (2) converge to those of (1) asymptotically. What happens in the non-asymptotical case is not clear. – It has been shown in [9] that there exist two polynomial-time deterministic heuristics A3 (Θ3 ) — the backward and forward tree constructions — that approximate the solution to a combinatorial problem, ideal in a certain sense, which is NP-hard. However, the choice of Θ3 is left as largely open. There actually exists a generic method for computing upper bounds on the value of (1) using any particular algorithm A(Θ). This method has been used by authors studying the value of stochastic programming models [12] or the performance of particular algorithms A(Θ) [13, 14]. It relies on an estimate of the form (3) with π ˆ replaced by a certain policy π ˜ defined implicitly. It has computational requirements which turn out to be extreme. To see this, we describe the procedure used to compute the sequence {˜ πt (ξ (j) )} corresponding (j) to a particular scenario ξ . Assuming that the solution to the A(Θ)-approximation (2) of a problem S of (1) the form (1) is available, in particular u0 ∈ Π0 ∩ U0 , one begins by setting (1) (j) π ˜0 (ξ (j) ) = u0 . Then, one considers a new problem, say S(ξ0 ), derived from (j) S by replacing the random variable ξ0 by its realization ξ0 — and possibly taking account of π ˜0 (ξ (j) ) for simplifying the formulation. This makes the sigma algebra B1 in (1) trivial and Π1 a class of constant functions. One uses again the algorithm A(Θ), possibly with an updated Θ, to obtain an approximate ver(j) (1) sion (2) of S(ξ0 ). One computes its solution, extracts u1 ∈ Π1 ∩ U1 from
64
B. Defourny, D. Ernst, and L. Wehenkel (1)
it, and sets π ˜1 (ξ (j) ) = u1 . The procedure is repeated for each t = 2, . . . , T − 1, (j) on the sequence of problems S(ξ[0: t−1] ) formed by using the growing collection (j)
of observations ξ[0: t−1] extracted from ξ (j) . It is not hard to see that the resulting sequence π ˜t (ξ (j) ) forming π ˜ (ξ (j) ) meets the implementability and feasibility conditions. But repeating the full evaluation process on a sufficient number m of scenarios is particularly cumbersome. The bottleneck may come from the complexity of the programs to solve, from the running time of the algorithm A(Θ) itself, or, if A(Θ) is a randomized algorithm, from the new source of variance it adds to the Monte-Carlo estimate (3). Now an upper bound on the value of (1) computed quickly and reflecting the quality of A(Θ) would allow a more systematic exploration of the space of parameters Θ. In the case of a randomized algorithm such as A1 (Θ1 ), it may allow to rank candidate solutions obtained through several runs of the algorithm.
3
Datasets
We consider a problem S and the approximate version (2) obtained through an algorithm A(θ). Let (p(i) , ξ (i) , u(i) ) denote a probability-scenario-decision triplet formed by extracting from an optimal solution to (2) the sequence u(i) = (i) (i) (i) (i) [u0 . . . uT −1 ] that corresponds to the scenario ξ (i) = [ξ0 . . . ξT −1 ] of probability p(i) . There are n of such triplets; let us collect them into a dataset Dn . Definition 1 (Regular datasets) In this paper, a dataset is said to be S−regular, or simply regular, if the decisions u(i) of its triplets (p(i) , ξ (i) , u(i) ) are jointly optimal with respect to a program (2) over the complete set of probability-scenario pairs (p(i) , ξ (i) ). Normal datasets have two immediate but important properties. Proposition 1 (Properties of regular datasets) 1. For any pair i, j of triplets (p(i) , ξ (i) , u(i) ) and (p(j) , ξ (j) , u(j) ) in Dn , the non(i) (j) anticipativity property holds: for t = 0, ut = ut and for 1 ≤ t ≤ T − 1, (i) (j) (i) (j) if ξ[0: t−1] = ξ[0: t−1] then ut = ut . 2. For any triplet (p(i) , ξ (i) , u(i) ), the feasibility property holds: for all t ≥ 0, (i) ut ∈ Ut (ξ (i) ). The interpretation of these properties may be better perceived through the proof. Proof. Property 1 results from the measurability restriction put on the classes Πt . Since the sigma algebras Bt are generated by the random variables ξ[0: t−1] , the decisions ut = πt (ξ) are Bt -measurable if and only if there exists a measurable map gt : Ξ0 × · · · × Ξt−1 → Ut such that ut = gt (ξ0 , . . . , ξt−1 ) in any event (Theorem 20.1 in [15]). The latter set of conditions is enforced in (2) either by (i) (j) (i) (j) imposing explicit equality constraints ut = ut whenever ξ[0: t−1] = ξ[0: t−1] , or by merging those equal optimization variables into a single one. (i) def Property 2 is directly enforced by the constraint ut = πt (ξ (i) ) ∈ Ut (ξ (i) ) in (2).
Bounds for Multistage Stochastic Programs
65
Definition 2 (NAF property). In this paper, a dataset is said to have the NAF property if its triplets mutually satisfy the Non-Anticipativity properties, and if each triplet satisfies the Feasibility properties relative to S. Remark 1. Let Dn+1 be a regular dataset with triplets having the same probability p(i) = (n + 1)−1 . Let Dn be the dataset obtained by deleting one triplet from Dn+1 and setting p(i) = n−1 for each remaining triplet. Then Dn has still the NAF property but is not necessarily regular. Any dataset with the NAF property induces a set of compatible policies, in the following sense. Proposition 2 (Compatible policies) Let Dn be a dataset of triplets (p(k) , ξ (k) , u(k) ) with the NAF property. Let P(S) denote the set of implementable and feasible policies for S. We assume that P(S) is nonempty. We define P(Dn ) as the set of policies π ∈ Π such that π(ξ (k) ) = u(k) , 1 ≤ k ≤ n. Then it holds that P(S) ∩ P(Dn ) is nonempty. In (1) (i) particular, each policy π in P(S) ∩ P(Dn ) satisfies π0 (ξ) = u0 , and πt (ξ) = ut (i) whenever ξ[0: t−1] = ξ[0: t−1] for some 1 ≤ i ≤ n and 0 < t < T −1. These policies are said to be compatible with the dataset. Proof. We recall that π ∈ P(S) iff π ∈ Π and πt (ξ) ∈ Ut (ξ) almost surely. If there exists a policy π in P(S) ∩ P(Dn ), it is clear, by the feasibility prop(k) erty of the triplets in Dn , that πt (ξ (k) ) = ut is in Ut (ξ (k) ) for each k, t. The properties stated in the proposition then follow from the Bt -measurability of πt . It remains to show that given the value of the mapping π over the points def Ξ Dn = {ξ (1) , . . . , ξ (n) }, it is still possible to specify π(ξ) for any ξ ∈ Ξ \ Ξ Dn such that πt (ξ) ∈ Ut (ξ). But this follows from the relatively complete recourse (i) assumption: for any scenario ξ such that ξ[0: t−1] = ξ[0: t−1] for some i, t, setting (i)
πτ (ξ) = uτ for each τ ≤ t cannot make the sets Ut+1 (ξ), . . . , UT −1 (ξ) empty.
4
Generalization Strategies
Proposition 2 suggests to view each pair (ξ (k) , u(k) ) of a dataset Dn with the NAF property as being generated by a single implementable and feasible policy π ∈ P(Dn ). One can thus try to predict the output π(ξ) of that policy on any new scenario, on the basis of the triplets of Dn . However, the problem is not well-posed if we cannot discriminate among the policies compatible with the dataset. There exists two classical approaches to further specify the problem: 1. Choose a class of models on which the search for π can be restricted; 2. Put a prior probability over every possible function for π, compute posterior probabilities using the observations provided by the dataset, and then make a prediction for the policy. It turns out that the two approaches raise similar issues regarding their adaptation to the context of this paper; we pursue the discussion using the first one.
66
B. Defourny, D. Ernst, and L. Wehenkel
Definition 3 (Training sets) Starting from the triplets of a dataset Dn , for each t > 0, let St be a set made def def (k) (k) (k) of the pairs (x(k) , y (k) ) with x(k) = (ξ0 , . . . , ξt−1 ) and y (k) = ut . Definition 4 (Supervised Learning — Hypothesis Class View) Consider a training set St of m pairs (x(k) , y (k) ) ∈ Xt × Yt . Assume that these pairs are i.i.d. samples drawn from a fixed but unknown distribution Pt over Xt × Yt . Consider a hypothesis class Ht of mappings ht : Xt → Yt . The goal of a supervised learning algorithm L : [Xt × Yt ]m → Ht is to select a hypothesis ht ∈ Ht such that ht (x) approximates y in the best possible way on new samples (x, y) drawn from the distribution Pt . In many approaches to supervised learning, the selection of the hypothesis ht is done by first specifying a loss function : X × Y × Y → R+ for penalizing the error between the prediction ht (x) and the target y, and a complexity measure C : H → R+ for penalizing the complexity of the hypothesis [16, 17]. Defining the empirical risk of a hypothesis ht ∈ Ht as R,St (ht ) = m−1
m
(x(k) , y (k) , ht (x(k) )) ,
k=1
one selects the hypothesis ht by minimizing a weighted sum of the empirical risk and the complexity measure (we use λ ≥ 0): h∗t (λ) = arg min R,St (ht ) + λ C(ht ) . ht ∈Ht
(4)
Choosing the right value for λ is essentially a model selection issue. One seeks to choose λ such that h∗t (λ) would also minimize the generalization error R (ht ) = E(x,y)∼Pt { (x, y, ht (x))} .
(5)
Ideally, the dataset is large, and we can partition the samples into a training set, a validation set, and a test set. Hypotheses h∗t (λ) are optimized on the training set, and then ranked on the basis of their risk on the validation set. The best hypothesis h∗t (λ∗ ) is retained and its generalization error is estimated on the test set. Otherwise, when the dataset is small, there exist computationally-intensive techniques to estimate the generalization error by working on several random partitions of the dataset [18, 19]. To actually produce an upper bound in (3), the prediction of the output of π(ξ) must be implementable and feasible. While it is not hard to decompose the learning of π into a sequence of learning tasks for π0 , π1 (ξ1 ), . . . , πT −1 (ξ[0: T −2] ), more care is needed for enforcing the feasibility of a prediction. Let Ξ[0: t] be a shorthand for Ξ0 × · · · × Ξt and U[0: t] be a shorthand for U0 × · · · × Ut . Definition 5 (Feasibility mappings) We call Mt : Ξ[0: t−1] × U[0: t] → Ut a Ut -feasibility mapping if its range is always contained in the feasible set Ut (ξ) whenever Ut (ξ) is nonempty.
Bounds for Multistage Stochastic Programs
67
Example 1 (Projections) Let || · || denote a strictly convex norm on Ut . For the sake of concreteness, let Ut (ξ) be represented as {ut ∈ Ut : gk (ξ0 , . . . , ξt−1 , u0 , . . . , ut ) ≤ 0, 1 ≤ k ≤ m} with gk jointly convex in u0 , . . . , ut . The mapping Mt defined by Mt (ξ[0: t−1] , u[0: t] ) = arg min ||z − ut || z∈Ut
subject to
gk (ξ0 , . . . , ξt−1 , u0 , . . . , ut−1 , z) ≤ 0 ,
1≤k≤m
is a Ut -feasibility mapping. Proposition 3 (Representation of implementable and feasible policies) Let Mt be feasibility mappings for the sets Ut , t = 0, . . . , T − 1. Let h0 denote a constant, and let ht : Ξ[0: t−1] → Ut be some mappings, t = 1, . . . , T − 1. Then the mapping π ˆ : Ξ → U defined by π ˆ (ξ) = [ˆ π0 , . . . , π ˆT −1 ] , π ˆt = Mt (ξ0 , . . . , ξt−1 , π ˆ0 , π ˆ1 , . . . , π ˆt−1 , ht (ξ0 , . . . , ξt−1 )) π ˆ0 = M0 (h0 ) , corresponds to an implementable and feasible policy π ˆ for S. It can be used in (3). The computational complexity of π ˆt depends on the complexity of evaluating ht and then Mt . The mappings ht of Prop. 3 can be defined as the best hypothesis of a standard supervised learning algorithm for the learning set St . The model selection can be performed in two ways (referred in the sequel as M1 and M2): 1. We follow the classical approach of supervised learning outlined above, and keep some triplets of the dataset Dn apart for testing purpose; 2. Instead of using the loss function of the supervised learning algorithm in (5), we look for good choices of the parameters λ of the learning problems by directly computing (3) for each candidate set of parameters. (It is also possible to perform the model selection on (3) on a large validation set of new scenarios, and then recompute (3) on a large test set of new scenarios.) Remark 2. It is possible to combine the two model selection approaches in a same policy π ˆ . More generally, it is also possible to form π ˆ by combining learned models ˜t of π ˆt on t = 0, 1, . . . , t0 , and implementing on t = t0 + 1, . . . , T − 1 the models π Section 2 based on standard stochastic programming. The two classes of models are complementary: as t increases, the learning problems for ht become harder (larger input space, fewer samples), but the optimization problems S(ξ[0:t−1] ) for π ˜t become easier. Up to now, we have viewed the generalization problem as the one of selecting a good model π that explains the generation of a regular dataset Dn , and yields good predictions on new scenarios ξ in terms of a loss function from supervised learning (if we know the desired output) or in terms of the objective function of the problem (1).
68
B. Defourny, D. Ernst, and L. Wehenkel
Nevertheless, at the same time there is no guarantee that any policy π compatible with a dataset Dn having the NAF property was optimal for the original program S. We could then interpret u(k) as a noisy observation of the output π ∗ (ξ (k) ) of an optimal policy π ∗ for S. In this respect, we could still use the supervised learning approach of Prop. 3. We could even legitimately try to work with learning sets St formed from a dataset without the NAF property, for example a dataset obtained by merging the solution to several approximations (2). We can also view the generalization problem differently. Proper algorithms A(θ) come with the guarantee that the solution set of (2) converges to the solution set of (1). Thus, we could consider two regular datasets Dn and Dq , with n < q, and learn hypotheses ht from Dn that generalize well to the examples from Dq . We will refer to this generalization approach as M3.
5
Learning Algorithms
In this section, we develop algorithms for learning the hypotheses ht ∈ Ht of Prop. 3. We start by noting that if ut ∈ Ut is vector-valued with components [ut ]1 , . . . , [ut ]d , the hypothesis ht must be vector-valued as well. Of course, Proposition 4. Any standard regression algorithm suitable for one-dimensional outputs can be extended to multi-dimensional outputs. Proof. A training set St = {(x(i) , y (i) )}1≤i≤n can be split into d training sets [St ]j = {(x(i) , [y (i) ]j )}1≤i≤n , j = 1, . . . , d, from which hypotheses [ht ]1 , . . . , [ht ]d can be learned. The components [ht ]j can then be concatenated into a single vector-valued hypothesis ht . The following algorithm implements the approach M2 of Section 4. Algorithm 1 (Joint selection approach) Let S0 be an approximation of the form (2) to a problem (1). Let St denote the training sets built from a solution to S0 . Assume that the model selection in (4) is parameterized by a single parameter λ common to every component j of ht and every index t. Assume that the feasibility mappings Mt are fixed a priori. Then, for each λ in some finite set Λ, 1. Learn [ht ]j (λ) using training sets [St ]j and the λ-dependent criterion (4). ˆ (λ) (Proposition 3). 2. From [ht ]j (λ) form a policy π 3. Evaluate the score of π ˆ (λ) with (3) on m scenarios. Let v(λ) be that score. Select λ∗ = arg minλ∈Λ v(λ) and return the hypotheses h0 (λ∗ ), . . . , hT −1 (λ∗ ). Remark 3. For datasets with the NAF property, h0 (λ∗ ) = M0 (h0 (λ∗ )) = u0 . (1)
The purpose of the assumptions of Algorithm 1 is to reduce the complexity of the search in the parameter space Λ. To see this, let cA (Θ0 ) denote the expected time for forming the approximation S0 to S using algorithm A(Θ0 ), and let cS (0) denote the expected time for solving S0 . Let |Λ| denote the cardinality of Λ, cL (t) the expected running time of the learning algorithm on a dataset St , and cE (t) the expected running time of the combined computation of Mt and ht .
Bounds for Multistage Stochastic Programs
69
Proposition 5. Algorithm 1 runs in expected time T −1 T −1 T −1 |Λ| · cL (t) + m · cE (t) = |Λ| · [cL (t) + m cE (t)] , t=1
t=1
t=1
starting from data obtained in expected time cA (Θ0 ) + cS (0). In general cL (t) and cE (t) grow with the dimensions of Ξ0:t−1 and Ut , and with the cardinality of the dataset St . These dependences, and the ratio between cL (t) and cE (t), depend largely on the algorithms associated with the hypothesis spaces Ht and the feasibility mappings Mt . The value of λ ∈ Λ may actually also influence cL (t) and cE (t) but we neglect this dependence in first approximation. Clearly Algorithm 1 can be run on N parallel processes, so as to replace |Λ| in Prop. 5 by |Λ|/N . Relaxing in Algorithm 1 the assumption of a single λ and considering paramˆ eters λt ∈ Λt proper to each t could improve the quality of the learned −1policy π |Λt | inbut would also imply an expected running time proportional to Tt=1 stead of |Λ| in Prop. 5, if one attempts a systematic search over all possible values of [λ1 . . . λT −1 ] in Λ1 × · · · × ΛT −1 . In that context it might be better to switch to another model selection strategy (compare to remark 2 and M3): Algorithm 2 (Sequential selection approach) Let S0 , . . . , ST −1 be T approximations of the form (2) to a problem (1), with S0 being the only solved one. 1. From the solution to S0 , extract h0 = u0 and the training set S1 . Set t = 1. 2. Assume that the model selection for ht is parameterized by a single λt common to every component [ht ]j of ht . For each value of λt in a finite set Λt , learn from the training set St the hypothesis ht (λt ). 3. Consider the optimal values v(λt ) of programs St (λt ) derived from St as (k) follows. For each τ = 1, . . . , t and each realization ξ[0: τ −1] in St , compute (1)
(k) def
successively, using u ˆ0 def
(k)
= h0 , (k)
(k)
(k)
(k)
(k)
= Mt (ξ0 , . . . , ξτ −1 , u ˆ0 , . . . , uˆτ −1 , hτ (ξ0 , . . . , ξτ −1 )) uˆ(k) τ where hτ depends on λt when τ = t. Then define St (λt ) as the program St (k) (k) to which is added the whole set of equality constraints uτ = uˆτ . 4. Select λ∗t = arg minλt ∈Λt v(λt ) and set ht = ht (λ∗t ). 5. If t < T − 1, extract from the solution to St (λ∗t ) the training set St+1 , set t to t + 1, and go to step 2. Otherwise, return the hypotheses h0 (λ∗0 ), . . . , hT −1 (λ∗T −1 ). Given π ˆ0 , . . . , π ˆt−1 , the impact of π ˆt on the generalization abilities of π ˆ is evalˆ1 , . . . , π ˆt−1 uated on an approximation St to S distinct from those on which π have been determined. The program St (λt ) acts as a proxy for a test, on an independent set of scenarios, of a hybrid policy π † such that (i) πτ† = π ˆτ for
70
B. Defourny, D. Ernst, and L. Wehenkel
τ ≤ t, with π ˆt depending on λt , and (ii) πτ† = πτ∗ for τ > t with π ∗ an optimal policy for S. Optimizing St (λt ) yields the decisions of a policy for ut+1 , . . . , uT −1 perfectly adapted to the scenarios of St — a source of bias with respect to π ∗ — and to the possibly suboptimal decisions u0 , . . . , ut imposed on these scenarios by π ˆ0 , . . . , π ˆt . Whether v(λt ) is lower or higher than the value of π † on a large test sample cannot be predicted, but the ranking among the values v(λt ) should be fairly representative of the relative merits of the various models for π ˆt . The complexity of Algorithm 2 is expressed as follows. Let cA (Θj ) denote the expected time for forming the approximation Sj to S using algorithm A(Θj ). Let m(t, τ ) denote the number of distinct realizations ξ[0: τ −1] of ξ[0: τ −1] in St . Let cS (t) denote the expected running time for solving a program St (λt ). (k)
Proposition 6. Algorithm 2 runs in expected time t T −1 |Λt | · cL (t) + m(t, τ ) cE (τ ) + cS (t) , t=1
τ =1
starting from data obtained in expected time
T −1 j=0 cA (Θj )
+ cS (0).
The term cS (t) is usually large but decreases with t, since the step 3 of Algorithm 2 sets decisions up to time t and thus reduces St (λt ) to m(t, t) problems (k) (k) over disjoint subsets of the optimization variables ut+1 , . . . , uT −1 of St . The values m(t, τ ) are typically far smaller than m in Prop. 5. The term cS (t) is thus essentially the price paid for estimating the generalization abilities of a policy π ˆ ˆT −1 still unspecified. that leaves π ˆt+1 , . . . , π For the sake of comparison, we express the expected running time of the generic upper bounding technique of Section 2. Let cS (t) denote the expected (j) time for solving the approximation to the program S(ξ[0: t−1] ) of Section 2 obtained through A(Θt ). Let cA (Θt ) denote the expected running time of the (j) algorithm A(Θt ). Note that here Θt for t > 0 is such that there is in S(ξ[0: t−1] ) (j)
a single realization of ξ[0: t−1] corresponding to the actual observation ξ[0: t−1] of ξ[0: t−1] . Proposition 7. The bounding scheme of Section 2 runs in expected time T −1
m · [cA (Θt ) + cS (t)] ,
t=1
starting from data obtained in expected time cA (Θ0 ) + cS (0). Of course using N parallel processes allows to replace m by m/N .
6
Related Work
The need for generalizing decisions on a subset of scenarios to an admissible policy is well recognized in the stochastic programming literature [11]. Some authors
Bounds for Multistage Stochastic Programs
71
addressing this question propose to assign to a new scenario the decisions associated to the nearest scenario of the approximate solution [20, 21], thus essentially reducing the generalization problem to the one of defining a priori a measurable similarity metric in the scenario space (we note that [21] also proposes several variants for a projection step restoring the feasibility of the decisions). Put in perspective of the present framework, this amounts to adopt, without model selection, the nearest neighbor approach to regression [22] — arguably one of the most unstable prediction algorithm [17]. Our work differs in that it is concerned with the complexity of exploiting the induced policies, seeks to identify the nature of the generalization problem, and makes the connection with model selection issues well studied in statistics [23, 24, 25, 26] and statistical learning [16, 27]. As a result, guidance is provided on how to improve over heuristical methods, and on which tools to use [28, 29, 30, 31, 32, 33]. The model selection techniques developed in this paper are proper to the stochastic programming context. There is a large body of work applying supervised learning techniques to the context of Markov Decision Processes (MDP) [34] and Reinforcement Learning problems (RL) [35, 36]. In these problems, one seeks to maximize an expected sum of rewards by finding a policy that maps states to decisions. The state space is often large, but the decision space is often reduced to a finite number of possible actions. Supervised learning is used to represent the value function of dynamic programming, or to represent the policy directly [37, 38, 39, 40, 41]. We observe however that the hypothesis spaces relevant for representing policies adapted to multistage stochastic programs differ from those used in the context of Markov Decision Processes, and that the datasets are obtained in a very different way.
7
Conclusion
This paper has proposed to compute bounds on the optimal value of a multistage stochastic program by generalizing to an admissible policy the solution to a scenario-tree approximation of the program. Such bounds could be helpful to select the parameters of the algorithm that generates the scenario tree, an important aspect in practical applications. We have aimed to adapt to this task the concepts and model selection techniques from machine learning and statistical learning theory. We have outlined the special structure of the data and of the output of admissible policies. The results established in this paper have also shed light on the type of feasibility constraints with which our approach can be expected to be sound. Learning algorithms for the policies have been proposed, offering various quality-complexity tradeoffs.
Acknowledgments This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. The scientific responsibility rests with its authors. Damien Ernst is a Research Associate of the Belgian FNRS of which he acknowledges the financial support.
72
B. Defourny, D. Ernst, and L. Wehenkel
References [1] Frauendorfer, K.: Barycentric scenario trees in convex multistage stochastic programming. Mathematical Programming 75, 277–294 (1996) [2] Dempster, M.: Sequential importance sampling algorithms for dynamic stochastic programming. Annals of Operations Research 84, 153–184 (1998) [3] Dupacova, J., Consigli, G., Wallace, S.: Scenarios for multistage stochastic programs. Annals of Operations Research 100, 25–53 (2000) [4] Høyland, K., Wallace, S.: Generating scenario trees for multistage decision problems. Management Science 47(2), 295–307 (2001) [5] Shapiro, A.: Monte Carlo sampling methods. In: Ruszczy´ nski, A., Shapiro, A. (eds.) Stochastic Programming. Handbooks in Operations Research and Management Science, vol. 10, pp. 353–425. Elsevier, Amsterdam (2003) [6] Casey, M., Sen, S.: The scenario generation algorithm for multistage stochastic linear programming. Mathematics of Operations Research 30, 615–631 (2005) [7] Hochreiter, R., Pflug, G.: Financial scenario generation for stochastic multistage decision processes as facility location problems. Annals of Operations Research 152, 257–272 (2007) [8] Pennanen, T.: Epi-convergent discretizations of multistage stochastic programs via integration quadratures. Mathematical Programming 116, 461–479 (2009) [9] Heitsch, H., R¨ omisch, W.: Scenario tree modeling for multistage stochastic programs. Mathematical Programming 118(2), 371–406 (2009) [10] Shapiro, A.: On complexity of multistage stochastic programs. Operations Research Letters 34(1), 1–8 (2006) [11] Shapiro, A.: Inference of statistical bounds for multistage stochastic programming problems. Mathematical Methods of Operations Research 58(1), 57–68 (2003) [12] Golub, B., Holmer, M., McKendall, R., Pohlman, L., Zenios, S.: A stochastic programming model for money management. European Journal of Operational Research 85, 282–296 (1995) [13] Kouwenberg, R.: Scenario generation and stochastic programming models for asset liability management. European Journal of Operational Research 134, 279–292 (2001) [14] Hilli, P., Pennanen, T.: Numerical study of discretizations of multistage stochastic programs. Kybernetika 44, 185–204 (2008) [15] Billingsley, P.: Probability and Measure, 3rd edn. Wiley, Chichester (1995) [16] Vapnik, V.: Statistical Learning Theory. Wiley, Chichester (1998) [17] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, Heidelberg (2009) [18] Wahba, G., Golub, G., Heath, M.: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 215–223 (1979) [19] Efron, B., Tibshirani, R.: An introduction to the bootstrap. Chapman and Hall, London (1993) [20] Th´eni´e, J., Vial, J.P.: Step decision rules for multistage stochastic programming: A heuristic approach. Automatica 44, 1569–1584 (2008) [21] K¨ uchler, C., Vigerske, S.: Numerical evaluation of approximation methods in stochastic programming (2008) (submitted) [22] Cover, T.: Estimation by the nearest neighbor rule. IEEE Transactions on Information Theory 14, 50–55 (1968) [23] Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the Second International Symposium on Information Theory, pp. 267–281 (1973)
Bounds for Multistage Stochastic Programs
73
[24] Schwartz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461–464 (1978) [25] Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics 14, 1080– 1100 (1986) [26] James, G., Radchenko, P., Lv, J.: DASSO: connections between the Dantzig selector and Lasso. Journal of the Royal Statistical Society: Series B 71, 127–142 (2009) [27] Chapelle, O., Vapnik, V., Bengio, Y.: Model selection for small sample regression. Machine Learning 48, 315–333 (2002) [28] Huber, P.: Projection pursuit. Annals of Statistics 13, 435–475 (1985) [29] Buja, A., Hastie, T., Tibshirani, R.: Linear smoothers and additive models. Annals of Statistics 17, 453–510 (1989) [30] Friedman, J.: Multivariate adaptive regression splines (with discussion). Annals of Statistics 19, 1–141 (1991) [31] Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures. Neural Computation 7, 219–269 (1995) [32] Williams, C., Rasmussen, C.: Gaussian processes for regression. In: Advances in Neural Information Processing Systems 8 (NIPS 1995), pp. 514–520 (1996) [33] Smola, A., Sch¨ olkopf, B., M¨ uller, K.R.: The connection between regularization operators and support vector kernels. Neural Networks 11, 637–649 (1998) [34] Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Chichester (1994) [35] Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996) [36] Sutton, R., Barto, A.: Reinforcement Learning, an introduction. MIT Press, Cambridge (1998) [37] Bagnell, D., Kakade, S., Ng, A., Schneider, J.: Policy search by dynamic programming. In: Advances in Neural Information Processing Systems 16 (NIPS 2003), pp. 831–838 (2004) [38] Lagoudakis, M., Parr, R.: Reinforcement learning as classification: leveraging modern classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), pp. 424–431 (2003) [39] Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556 (2005) [40] Langford, J., Zadrozny, B.: Relating reinforcement learning performance to classification performance. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML 2005), pp. 473–480 (2005) [41] Fern, A., Yoon, S., Givan, R.: Approximate policy iteration with a policy language bias: solving relational Markov Decision Processes. Journal of Artificial Intelligence Research 25, 85–118 (2006)
On Evolvability: The Swapping Algorithm, Product Distributions, and Covariance Dimitrios I. Diochnos1 and Gy¨ orgy Tur´ an1,2 1
Dept. of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago IL 60607, USA 2 Research Group on Artificial Intelligence of the Hungarian Academy of Sciences, University of Szeged
[email protected],
[email protected]
Abstract. Valiant recently introduced a learning theoretic framework for evolution, and showed that his swapping algorithm evolves monotone conjunctions efficiently over the uniform distribution. We continue the study of the swapping algorithm for monotone conjunctions. A modified presentation is given for the uniform distribution, which leads to a characterization of best approximations, a simplified analysis and improved complexity bounds. It is shown that for product distributions a similar characterization does not hold, and there may be local optima of the fitness function. However, the characterization holds if the correlation fitness function is replaced by covariance. Evolvability results are given for product distributions using the covariance fitness function, assuming either arbitrary tolerances, or a non-degeneracy condition for the distribution and a size bound on the target. Keywords: learning, evolution.
1
Introduction
A model combining evolution and learning was introduced recently by Valiant [14]. It assumes that some functionality is evolving over time. The process of evolution is modelled by updating the representation of the current hypothesis, based on its performance for training examples. Performance is measured by the correlation of the hypothesis and the target. Updating is done using a randomized local search in a neighborhood of the current representation. The objective is to evolve a hypothesis with a close to optimal performance. As a paradigmatic example, Valiant [14] showed that monotone conjunctions of Boolean variables with the uniform probability distribution over the training examples are evolvable. Monotone conjunctions are a basic concept class for learning theory, which has been studied from several different aspects [7, 8, 10]. Valiant’s algorithm, which is referred to as the swapping algorithm in this paper, considers mutations obtained by swapping a variable for another one, and adding and deleting a variable1 , and chooses randomly among beneficial mutations (or among neutral ones if there are no beneficial mutations). 1
These mutations may be viewed as swapping a variable with the constant 1.
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 74–88, 2009. c Springer-Verlag Berlin Heidelberg 2009
The Swapping Algorithm, Product Distributions, and Covariance
75
Valiant also established a connection between the model and learning with statistical queries (Kearns [7], see also [8]), and studied different versions such as evolution with and without initialization. Valiant noted that concept learning problems have been studied before in the framework of genetic and evolutionary algorithms (e.g., Ros [11]), but his model is different (more restrictive) in its emphasis on fitness functions which depend on the training examples only through their performance, and not on the training instances themselves. His model excludes, e.g., looking at which bits are on or off in the training examples. Feldman [2, 3] gave general results on the model and its different variants, focusing on the relationship to statistical queries. He showed that statistical query algorithms of a certain type can be translated into evolution algorithms. The translation, as noted by Feldman, does not lead to the most efficient or natural2 evolution algorithms in general. This is the case with monotone conjunctions: even though their evolvability follows from Feldman’s result, it is still of interest to find simple and efficient evolution procedures for this class. Michael [9] showed that decision lists are evolvable under the uniform distribution using the Fourier representation. In general, exploring the performance of simple evolution algorithms is an interesting direction of research; hopefully, leading to new design and analysis techniques for efficient evolution algorithms. The swapping algorithm, in particular, appears to be a basic evolutionary procedure (mutating features in and out of the current hypothesis) and it exhibits interesting behavior. Thus its performance over distributions other than uniform deserves more detailed study. In this paper we continue the study of the swapping algorithm for evolving monotone conjunctions. A modified presentation of the algorithm for the uniform distribution is given, leading to a simplified analysis and an improved complexity bound (Theorem 2). We give a simple characterization of best approximations by short hypotheses, which is implicit in the analysis of the algorithm. We then consider the swapping algorithm for product distributions. Product distributions generalize the uniform distribution, and they are studied in learning theory in the context of extending learnability results from the uniform distribution, usually under non-degeneracy conditions (see, e.g. [4, 5, 6, 12]). We show that the characterization of best approximations does not hold for product distributions in general, and that the fitness function may have local optima. It is shown that the picture changes if we replace the correlation fitness function with covariance. (Using fitness functions other than correlation has also been considered by Feldman [3] and Michael [9]; the fitness functions discussed in those papers are different from covariance.) In this case there is a characterization of best approximations similar to the uniform distribution with correlation. This leads to two positive results for the evolvability of monotone conjunctions under product distributions. Theorem 4 shows that in the unbounded-precision model of evolution, the swapping algorithm using covariance as the fitness function, is an efficient 2
Of course, we do not use the term ‘natural’ here to suggest any actual connection with evolutionary processes in nature.
76
D.I. Diochnos and G. Tur´ an
algorithm for monotone conjunctions over arbitrary product distributions. Thus this result applies to a very simply defined (though clearly not realistic) evolution model, and analyzes a very simple and natural evolution algorithm (swaps using covariance) over a whole class of distributions (product distributions without any restrictions). Therefore, it may be of interest as an initial example motivating further models and algorithms, such as the introduction of short and long hypotheses in order to work with polynomial sample size estimates of performances. Theorem 5 shows that the swapping algorithm works if the target is short and the underlying product distribution is μ-nondegenerate. The paper is structured as follows. Section 2 has an informal description of the swapping algorithm. As we are focusing on a single algorithm and its variants, we do not need to define the evolution model in general. The description of the swapping algorithm and some additional material given in Section 3 contain the details of the model that are necessary for the rest of the paper. Section 4 contains the analysis of the swapping algorithm for the case of uniform distribution. The performance of the swapping algorithm for product distributions is discussed in Section 5. In Section 6 we turn to the swapping algorithm using covariance as fitness function. Finally, Section 7 contains some further remarks and open problems.
2
An Informal Description of the Swapping Algorithm
Given a set of Boolean variables x1 , . . . , xn , we assume that there is an unknown target c, a monotone conjunction of some of these variables. The possible hypotheses h are of the same class. The truth values true and false are represented by 1 and −1. The performance of a hypothesis h is 1 h(x) · c(x), (1) PerfUn (h, c) = n 2 n x∈{0,1}
called the correlation of h and c. Here Un denotes the uniform distribution over {0, 1}n. The evolution process starts with an initial hypothesis h0 , and produces a sequence of hypotheses using a random walk type procedure on the set of monotone conjunctions. Each hypothesis h is assigned a fitness value, called the performance of h. The walk is performed by picking randomly a hypothesis h from the neighborhood of the current hypothesis h which seems to be more fit (beneficial) compared to h, or is about as fit (neutral) as h. Details are given in Section 3. Some care is needed in the specification of the probability distribution over beneficial and neutral hypotheses. Moreover, there is a distinction between short and long conjunctions, and the neighborhoods they induce. Valiant uses a threshold value q = O (log(n/ε)) for this distinction. Section 4.2 has details. Valiant showed that if this algorithm runs for O(n log(n/ε)) stages, and evaluates performances using total sample size O((n/ε)6 ) and different tolerances for short, resp. long conjunctions, then with probability at least 1 − ε it finds a hypothesis h with PerfUn (h, c) 1 − ε.
The Swapping Algorithm, Product Distributions, and Covariance
3
77
Preliminaries
The neighborhood N of a conjunction h is the set of conjunctions that arise by adding a variable, removing a variable, or swapping a variable with another one, plus the conjunction itself3 . The conjunctions that arise by adding a variable form the neighborhood N+ , the conjunctions that arise by dropping a variable form the neighborhood N− , and the conjunctions that arise by swapping a variable form the neighborhood N+− . In other words we have N = N− ∪ N+ ∪ N+− ∪ {h}. As an example, let our current hypothesis be h = x1 ∧ x2 , and n = 3. Then, N− = {x1 , x2 }, N+ = {x1 ∧ x2 ∧ x3 }, and N+− = {x3 ∧ x2 , x1 ∧ x3 }. Note that |N| = O(n2 ) in general. Similarity between two conjunctions h and c in an underlying distribution Dn is measured by the performance function4 PerfDn (h, c) which is evaluated 1 approximately, by drawing a random sample S and computing |S| x∈S h(x) · c(x). The goal of the evolution process is to evolve a hypothesis h such that: Pr [PerfDn (h, c) < PerfDn (c, c) − ε] < δ.
(2)
The accuracy parameter ε and the confidence δ are treated as one in [14]. Given a target c, we split the neighborhood in 3 parts by the increase in performance that they offer. There are beneficial, neutral, and deleterious mutations. In particular, for a given neighborhood N and real constant t (tolerance) we are interested in the sets Bene = N ∩ h | PerfDn h , c PerfDn (h, c) + t (3) Neut = N ∩ h | PerfDn h , c PerfDn (h, c) − t \ Bene. A mutation is deleterious if it is neither beneficial nor neutral. The size (or length) |h| of a conjunction h is the number of variables it contains. Given a target conjunction c and a size q, we will be interested in the best size q approximation of c. Definition 1 (Best q-Approximation). A hypothesis h is called a best qapproximation of c if |h| q and ∀h = h, |h | q : PerfDn h , c PerfDn (h, c) . Note that the best approximation is not necessarily unique. In this paper the following performance functions are considered; the first one is used in [14] and the second one is the covariance of h and c5 : PerfDn (h, c) =
h(x)c(x)Dn (x) = E [h · c] = 1 − 2 · Pr [h = c]
(4)
x∈{0,1}n
Cov [h, c] = PerfDn (h, c) − E [h] · E [c] . 3 4 5
(5)
As h will be clear from the context, we write N instead of N(h). See the end of this section for the specific performance functions considered in this paper. For simplicity, we keep the notation Perf for a specific performance function. A related performance function, not considered here, is the correlation coefficient.
78
4
D.I. Diochnos and G. Tur´ an
Monotone Conjunctions under the Uniform Distribution
Given a target conjunction c and a hypothesis conjunction h, the performance of h with respect to c can be found by counting truth assignments. Let h=
m i=1
xi ∧
r =1
y
and c =
m
xi ∧
i=1
u
wk .
(6)
k=1
Thus the x’s are mutual variables, the y’s are redundant variables in h, and the w’s are undiscovered, or missing variables in c. Variables in the target c are called good, and variables not in the target c are called bad. The probability of the error region is (2r + 2u − 2)2−m−r−u and so PerfUn (h, c) = 1 − 21−m−u − 21−m−r + 22−m−r−u .
(7)
For a fixed threshold value q, a conjunction h is short (resp., long), if |h| q (resp., |h| > q). The following lemma and its corollary show that if the target conjunction is long then every long hypothesis has good performance, as both the target and the hypothesis are false on most instances. Lemma 1 (Performance Lower Bound). If |h| q and |c| q + 1 then PerfUn (h, c) > 1 − 3 · 2−q . Corollary 1. Let q lg(3/ε). If |h| q, |c| q + 1 then PerfUn (h, c) > 1 − ε. 4.1
Properties of the Local Search Procedure
Local search, when switching to h from h, is guided by the quantity Δ = PerfUn h , c − PerfUn (h, c) .
(8)
We analyze Δ using (7). The analysis is summarized in Figure 1, where the node good represents good variables and the node bad represents bad variables. Note that Δ depends only on the type of mutation performed and on the values of the parameters m, u and r; in fact, as the analysis shows, it depends on the size of the hypothesis |h| = m + r and on the number u of undiscovered variables. Comparing h ∈ N+ with h. We introduce a variable z in the hypothesis h. If z is good, Δ = 2−|h| > 0. If z is bad, Δ = 2−|h| (1 − 21−u ). Comparing h ∈ N− with h. We remove a variable z from the hypothesis h. If z is good, Δ = −21−|h| < 0. If z is bad, Δ = −21−|h| (1 − 21−u ). Comparing h ∈ N+− with h. Replacing a good with a bad variable gives Δ = −21−|h|−u. Replacing a good with a good, or a bad with a bad variable gives Δ = 0. Replacing a bad with a good variable gives Δ = 22−|h|−u .
The Swapping Algorithm, Product Distributions, and Covariance
good
bad
(a) u 2
good
bad
(b) u = 1
good
79
bad
(c) u = 0
Fig. 1. Arrows pointing towards the nodes indicate additions of variables and arrows pointing away from the nodes indicate removals of variables. Note that this is consistent with arrows indicating the swapping of variables. Thick solid lines indicate Δ > 0, simple lines indicate Δ = 0, and dashed lines indicate Δ < 0. Usually Figure 1a applies. When only one good variable is missing we have the case shown in Figure 1b. Once all good variables are discovered, Figure 1c applies; hence two arrows disappear. Note that an arrow with Δ > 0 may correspond to a beneficial or neutral mutation, depending on the value of the tolerance t.
Correlation produces a perhaps unexpected phenomenon already in the case of the uniform distribution: adding a bad variable can result in Δ being positive, 0 or negative, depending on the number of undiscovered variables. Now we turn to characterizing the best bounded size approximations of concepts, implicit in the analysis of the swapping algorithm. The existence of such characterizations seems to be related to efficient evolvability and so it may be of interest to formulate it explicitly. Such a characterization does not hold for product distributions in general, as noted in the next section. However, as shown in Section 6, the analogous characterization does hold for every product distribution if the fitness function is changed from correlation to covariance. Theorem 1 (Structure of Best Approximations). The best q-approximation of a target c is c if |c| q, or any hypothesis formed by q good variables if |c| > q. Proof. The claim follows directly from the definitions if |c| q. Let |c| > q. Let h be a hypothesis consisting of q good variables. Then both deleting a variable or swapping a good variable for a bad one decrease performance. Thus h cannot be improved among hypotheses of size at most q. If h has fewer than q variables then it can be improved by adding a good variable. If h has q variables but it contains a bad variable then its performance can be improved by swapping a bad variable for a good one. Hence every hypothesis other than the ones described in the theorem can be improved among hypotheses of size at most q. 4.2
Evolving Monotone Conjunctions under the Uniform Distribution
The core of the algorithm for evolving monotone conjunctions outlined in Section 2 is composed by the Mutator function, presented in Algorithm 1. The role of Mutator is, given a current hypothesis h, to produce a new hypothesis h which has better performance than h if Bene is nonempty, or else a hypothesis
80
D.I. Diochnos and G. Tur´ an
Algorithm 1. The Mutator Function under the Uniform Distribution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Input: q ∈ N∗ , samples sM,1 , samples sM,2 , a hypothesis Output: a new hypothesis h if |h| > 0 then Generate N− else N− ← ∅; if |h| < q then Generate N+ else N+ ← ∅; if |h| q then Generate N+− else N+− ← ∅; vb ← GetPerformance(h); Initialize Bene, Neutral to ∅; if |h| q then t ← 2−2q else t ← 21−q ; for x ∈ N+ , N− , N+− do SetWeight(x, h, N+ , N− , N+− ); if |x| q then SetPerformance(x, c, sM,1 ); else SetPerformance(x, c, sM,2 ); vx ← GetPerformance(x); if vx vb + t then Bene ← Bene ∪ {x}; else if vx vb − t then Neutral ← Neutral ∪ {x};
h, a target c.
/* set tolerance */ /* sM,1 examples */ /* sM,2 examples */
SetWeight(h, h, N+ , N− , N+− ); Neutral ← Neutral ∪ {h}; if Bene = ∅ then return RandomSelect(Bene); else return RandomSelect(Neutral);
h with about the same performance as h, in which case h arises from h by a neutral mutation. Hence, during the evolution, we have g calls to Mutator throughout a sequence of g generations. We pass some slightly different parameters in the Mutator from those defined in [14], to avoid ambiguity. Hence, Mutator receives as input q, the maximum allowed size for the approximation, sM,1 , the sample size used for all the empirical estimates of the performance of each conjunction of size up to q, sM,2 the sample size used for conjunctions of length greater than q, and the current hypothesis h. We view conjunctions as objects that have two extra attributes, their weight and the value of their performance. GetPerformance returns the value of the performance, previously assigned by SetPerformance. The performance of the initial hypothesis has been determined by another similar call to the SetPerformance function with the appropriate sample size. Weights are assigned via SetWeight. Hence, SetWeight assigns the same weight to all members of {h} ∪ N− ∪ N+ so that they add up to 1/2, and the same weight to all the members of N+− so that they add up to 1/2. Finally, RandomSelect computes the sum W of weights of the conjunctions in the set that it has as argument, and returns a hypothesis h from that set with probability wh /W, where wh is the weight of h . Note that the neighborhoods and the tolerances are different for short and long hypotheses, where a hypothesis h is short if |h| q, and 3 . (9) q = lg ε
The Swapping Algorithm, Product Distributions, and Covariance
81
Theorem 2. For every targetconjunction c and every initial hypothesis h0 it holds that after O q + |h0 | ln δ1 iterations, each iteration evaluating the performance of O (nq) hypotheses, andeach performance being evaluated using sample 4 per iteration, equation (2) is satisfied. ln n + ln δ1 + ln ε1 size O ε1 Proof. The analysis depends on the size of the target and the initial hypothesis. Short Initial Hypothesis and Short Target. Note first that for any hypothesis h and for any target c such that |h| = m + r q and |c| = m + u q, for the nonzero values of the quantity Δ it holds that |Δ| 21−m−r−u 21−(m+r)−(m+u) = 21−(|h|+|c|) 21−2q . Tolerance for short hypotheses is t = 12 21−2q = 2−2q . Hence as long as the estimate of the performance is within t of its exact value, beneficial mutations are identified as beneficial. Therefore it is sufficient to analyze the signs of Δ along the arrows in Figure 1. Note that deleting a good variable is always deleterious, and so u is non-increasing. If there are at least two undiscovered variables (i.e., u 2, corresponding to Figure 1a), then beneficial mutations can only add or swap variables. Each swap increases the number of good variables, and so after |c| − 1 many swaps there is at most one undiscovered variable. Hence, as long as u 2, there can be at most q − |h0 | additions and at most |c| − 1 swaps. If there is one undiscovered variable (i.e., u = 1, corresponding to Figure 1b), then, in 1 step, the first beneficial mutation brings this variable into the hypothesis, and all variables become discovered. If all variables are discovered (i.e., u = 0 , corresponding to Figure 1c) then beneficial mutations are those which delete bad variables from h0 . After we get to the target, there are no beneficial mutations, and the only neutral mutation is the target itself, hence there is no change. Thus the number of steps until getting to the target is at most q − |c|. Summing up the above, the total number of steps is at most 2q − |h0 | 2q. Short Initial Hypothesis and Long Target. As long as |h| < q, we have u 2, corresponding to Figure 1a. Therefore adding any variable is beneficial. Note that replacing a bad variable by a good one may or may not be so, depending on the size of c. The same analysis as above implies that after at most 2q beneficial mutations we reach a hypothesis of size q. If |c| q + 2 then u 2 continues to hold, and so all beneficial or neutral mutations will keep hypothesis size at q. However, by Corollary 1, all those hypotheses have performance at least 1 − ε. If |c| = q + 1 then after reaching level q there is one undiscovered variable, corresponding to Figure 1b. Swaps of bad variables for good ones are beneficial. Combining these mutations with the ones needed to reach level q, we can bound the total number of steps until reaching a hypothesis of q good variables by 2q (using the same argument as above). After that, there are only neutral mutations swapping a good variable with another good one, and again all those hypotheses have performance at least 1 − ε.
82
D.I. Diochnos and G. Tur´ an
As a summary, if we start from a short hypothesis and all the empirical tests perform as expected, then, we are always at a good hypothesis after 2q iterations. This will not be the case when we start from a long hypothesis. Long Initial Hypothesis. For long hypotheses the neighborhood consists of hypotheses obtained by deleting a literal, and the hypothesis itself. We set the tolerance in such a way that every hypothesis in the neighborhood is neutral. This guarantees that with high probability in O |h0 | ln δ1 iterations we arrive at a hypothesis of size at most q, and from then on we can apply the analysis of the previous two cases. The model assumes that staying at a hypothesis is always a neutral mutation, hence it is possible to end up in a hypothesis of size bigger than q. Computing sample sizes is done by standard Chernoff bound arguments and the details are omitted.
5
Monotone Conjunctions under Product Distributions Using Correlation
A product distribution over {0, 1}n is specified by the probabilities p = (p1 , . . . , pn ), xi to 1. The probability of a truth where pi is the probability of setting the variable ai 1−ai assignment (a1 , . . . , an ) ∈ {0, 1}n is n . For the uniform i=1 pi · (1 − pi ) distribution Un the probabilities are p1 = . . . = pn = 1/2. We write Pn to denote a fixed product distribution, omitting p for simplicity. Let us consider a target c and a hypothesis h as in (6). Let index (z) be a function that returns the set of indices of the participating variables in a hypothesis z. We define the sets M = index (h) ∩ index (c) , R = index (h) \ M, and U = index (c) \ M. We can now define pi , R= p , and U= pk . M= i∈M
∈R
k∈U
Finally, set |M| = m, |R| = r, and |U| = u. Then (7) generalizes to PerfPn (h, c) = 1 − 2M(R + U − 2RU).
(10)
We impose some conditions on the pi ’s in the product distribution. Definition 2 (Nondegenerate Product Distribution). A product distribution Pn given by p = (p1 , . . . , pn ) is μ-nondegenerate if – min{pz , 1 − pz } μ for every variable z – the difference of any two members of the multiset {p1 , 1 − p1, . . . , pn , 1 − pn } is zero, or has absolute value at least μ. The following Lemma and its Corollary are analogous to Lem. 1 and Cor. 1. Lemma 2 (Performance Lower Bound). Let a hypothesis h such that |h| q − 1 and a target c such that |c| q + 1. Then, PerfPn (h, c) > 1 − 6.2 · e−μq . Corollary 2. Let q μ1 ln 6.2 , |h| q−1, |c| q+1 ⇒ PerfUn (h, c) > 1 −ε. ε
The Swapping Algorithm, Product Distributions, and Covariance
good
83
bad
good
bad
good
bad
bad
good
bad
good
bad
bad
good
bad
good
bad
?
good
? good
(a) U < 1/2
(b) U = 1/2
(c) U > 1/2
Fig. 2. The style and the directions of arrows have the same interpretation as in Figure 1. The middle layer represents variables that have the same probability of being satisfied under the distribution; i.e. pgood = pbad . A node x that is one level above another one y indicates higher probability of satisfying the variable x; i.e. px > py . Here we distinguish the three basic cases for U; for two arrows in the first case we have a ? to indicate that Δ can not be determined by simply distinguishing cases for U.
5.1
Properties of the Local Search Procedure
We want to generalize the results of Section 4.1 by looking at the quantity (11) Δ = PerfPn h , c − PerfPn (h, c) which corresponds to (8). We use (10) for the different types of mutations. The signs of Δ depend on the ordering of the probabilities pi . A variable xi is smaller (resp., larger ) than a variable xj if pi < pj (resp., xi > xj ). If pi = pj then xi and xj are equivalent. Analyzing Δ, we draw the different cases in Figure 2. However, when U < 1/2, two arrows can not be determined. These cases refer to mutations where we replace a bad variable with a bigger good one, or a good variable with a smaller bad one. Both mutations depend on the distribution; the latter has Δ = −2MR(pin /pout − 1 + 2U(1 − pin )), where out is a good variable and in is the bad smaller variable. One application of this equation is that the Structure Theorem 1 does not hold under product distributions. The other application is the construction of local optima; example follows. Example 1. Let Pn be a distribution such that p1 = p2 = 1/3, and the rest of the n − 2 variables are satisfied with probability 1/10. Set the target c = x1 ∧ x2 . A hypothesis h formed by q bad variables has performance PerfPn (h, c) = 1 − 2Pr [error region] < 1 − 2/9 = 7/9. Note that, for the nonzero values of Δ, it holds |Δ| 2μq+2 . Hence, by setting the tolerance t = μq+2 , and the accuracy on the empirical tests on conjunctions of size at most q, equal to M = t = μq+2 , all the arrows in the diagrams can be determined precisely. Starting from h0 = ∅, there are sequences of beneficial mutations in which the algorithm inserts a bad variable in each step, e.g. h0 = ∅ h1 = x3 . . . hq = q+2 =3 x . This is a local optimum, since swapping a bad variable
84
D.I. Diochnos and G. Tur´ an
with a good one yields Δ < 0. Note that μ = 1/10, q = 10 ln(62) = 42, and for ε = 1/10 the algorithm is stuck in a hypothesis with PerfPn (hq , c) < 1 − ε. Under the setup of the example above, the algorithm will insert q bad variables (n−q)(n−q−1) 2 . Rein the first q steps, with probability Γ = q−1 1 − r=0 n−r = n(n−1)
2q quiring n δ we have Γ 1 − δ. Hence, starting from the empty hypothesis, the algorithm will fail for any ε < 2/10, with probability 0.9, if we set n 840. 5.2
Special Cases
Although for arbitrary targets and arbitrary product distributions we can not guarantee that the algorithm will produce a hypothesis h such that (2) is satisfied, we can however, pinpoint some cases where the algorithm will succeed with the correct setup. These cases are targets of size at most 1 or greater than q, and heavy targets; i.e. targets that are satisfied with probability at least 1/2.
6
Covariance as a Fitness Function
The discussion in the previous section shows that there are problems with extending the analysis of the swapping algorithm from the uniform distribution to product distributions. In this section we explore the possibilities of handling product distributions with a different fitness function, covariance, given by (5). Using the same notation as in (6), and with M, R, and U representing the sets of indices as in the previous section, the first term is given by (10). Furthermore, E [h] = −1 + 2 · i∈M pi · ∈R p = −1 + 2MR (12) E [c] = −1 + 2 · i∈M pi · k∈U pk = −1 + 2MU Thus from (10) and (12) we get Cov [h, c] = 4MRU (1 − M) .
We use (13) to examine the difference Δ = Cov h , c − Cov [h, c] .
(13)
Comparing h ∈ N+ with h. We introduce a literal z in the hypothesis h. If z is good, then Δ = 4M2 RU (1 − pz ) > 0. If z is bad, then Δ = (pz − 1) Cov [h, c] 0. We have equality if m = 0; i.e. M = 1. Comparing h ∈ N− with h. We remove a literal z from the hypothesis h. If z is good, then Δ = −4M2 RU(1/pz − 1) < 0. If z is bad, then Δ = (1/pz − 1)Cov [h, c] 0. We have equality if m = 0; i.e. M = 1. Comparing h ∈ N+− with h. We swap a literal out with a literal in. If out is good and in is good, then Δ = 4M2 RU(1 − pin /pout ). If pout pin , then Δ 0, with Δ = 0 if pout = pin . If pout > pin ⇒ Δ > 0. If out is good and in is bad, then Δ = 4MRU · ((pin − 1) + M · (1 − pin /pout )).
The Swapping Algorithm, Product Distributions, and Covariance
good
bad
good
bad
good
bad
good
bad
good
bad
good
bad
(a) M = 1
85
(b) M < 1
Fig. 3. The style and the directions of arrows have the same interpretation as in the previous figures. Similarly, the hierarchy of nodes on levels has the same interpretation. Some arrows are missing in the left picture since there are no good variables in the hypothesis; i.e. M = 1.
We now examine the quantity κ = (pin − 1) + M · (1 − pin /pout ): pout pin : Then (1 − pin /pout ) 0, and (pin − 1) < 0. Therefore Δ < 0. pout > pin : Note M pout . Hence, κ < pin −1 +1 −pin /pout = pin (1 −1/pout ) < 0. If out is bad and in is bad, then Δ = (pin /pout − 1) · Cov [h, c] and Cov [h, c] 0: pout pin : In this case, Δ 0, and Δ = 0 when m = 0, or pout = pin . pout > pin : In this case Δ 0, and Δ = 0 when m = 0. If out is bad and in is good, then Δ = 4MRU(1/pout − 1 + M(1 − pin /pout )). We examine the quantity κ = 1/pout − 1 + M(1 − pin /pout ): pout < pin : Note M 1. Hence, κ > 1/pout − 1 + 1 − pin /pout = (1 − pin )/pout > 0. pout pin : In this case pin /pout 1 ⇒ κ > 0 ⇒ Δ > 0. The effects of the different mutations are summarized in Figure 3. Compared to Figure 2, it is remarkably simple and uniform, and can be summarized as follows. If there are some mutual variables (i.e. good) in the hypothesis, then – Δ > 0 if a good variable is added, a bad variable is deleted, a bad variable is replaced by a good one, a good variable is replaced by a smaller good one, and a bad variable is replaced by a larger bad one, – Δ < 0 if a good variable is deleted, a bad variable is added, a good variable is replaced by a bad one, a good variable is replaced by a larger good one, and a bad variable is replaced by a smaller bad one, – Δ = 0 if a good variable is replaced by an equivalent good one, and a bad variable is replaced by an equivalent bad one. If there are no mutual variables in the hypothesis, then – Δ > 0 if a good variable is added, or a good variable replaces a bad one. – Δ = 0 if a bad variable is added, deleted, or replaced by a bad one. Note that the beneficiality or neutrality of a mutation is not determined by these observations; it also depends on the tolerances. Nevertheless, these properties are
86
D.I. Diochnos and G. Tur´ an
sufficient for an analogue of Theorem 1 on the structure of best approximations to hold for product distributions and the covariance fitness function. Theorem 3 (Structure of Best Approximations). The best q-approximation of a target c, such that |c| 1, is c itself if |c| q, or any hypothesis formed by q good variables, such that the product q i=1 pi is minimized if |c| > q. As mentioned earlier, the existence of characterizations of best approximations is related to evolvability. This relationship is now illustrated for product distribution and the covariance fitness function. First we introduce an idealized version of evolution algorithms, where beneficiality depends on the precise value of the performance function. Definition 3 (Unbounded-Precision Evolution Algorithm). An evolution algorithm is unbounded-precision if, instead of (3) it uses Bene = N ∩ h | PerfDn h , c > PerfDn (h, c) , (14) Neut = N ∩ h | PerfDn h , c = PerfDn (h, c) or, equivalently, arbitrary tolerance to determine which hypotheses are beneficial, neutral or deleterious. All other parts of the definition are unchanged. Consider the following unbounded-precision evolution algorithm: starting from an arbitrary initial hypothesis, apply beneficial mutations as long as possible. Then beneficial mutations can add a good variable, delete a bad variable, replace a bad variable by a good one, replace a good variable by a smaller good one or replace a bad variable by a larger bad one. The amortized analysis argument of Theorem 5 in the next section shows that the number of steps is O(n2 ). Hence the following result holds. Theorem 4. The swapping algorithm using the covariance fitness function is an efficient evolution algorithm for monotone conjunctions over product distributions. 6.1
Evolving Short Monotone Conjunctions under µ-Nondegenerate Product Distributions
The problem with applying the unbounded-precision algorithm to the boundedprecision model is that the presence of the U factor in Δ may make the performance differences superpolynomially small. If we assume that the product distribution is non-degenerate and the target is short then this cannot happen, and an analysis similar to Theorem 2 shows that we indeed get an efficient evolution algorithm. In Section 7 we give some remarks on possibilities for handling long targets. We set 1 1 ln q=O . μ ε
The Swapping Algorithm, Product Distributions, and Covariance
87
Theorem 5. Let Pn be a μ-nondegenerate product distribution. The swapping algorithm, using the covariance fitness function, evolves non-empty short (1 |c| q) monotone conjunctions starting from an initial hypothesis h0 in O nq + |h0 | ln δ1 iterations, each iteration examining the performance of O (nq) hypotheses, and each performance being evaluated using sample size 4 (4/μ) ln(1/μ) 1 1 1 1 1 . ln n + ln + ln + ln O μ ε δ μ ε Proof. The analysis of the proof is similar to that of Theorem 2. Short Initial Hypothesis. Again, we are interested in the non-zero values of Δ so that, given representative samples, we can characterize precisely all the mutations. It holds that |Δ| 4μ2q+2 . Therefore, we set the tolerance t = 2μ2q+2 , and require accuracy for the empirical estimates M = t = 2μ2q+2 . Without loss of generality, we are in the state of Figure 3b, otherwise, in 1 step a good variable appears in the hypothesis, and we move from Figure 3a to Figure 3b. Throughout the evolution, implicitly, we have two arrays; one with good variables, and one with bad variables. The array of bad variables shrinks in size, while the array of good variables expands in size. Due to mutations that swap good with good variables, and bad with bad variables, each entry of those arrays can take at most n different values, since the entries of the bad array increase in value, while the entries of the good array decrease in value; i.e. we have O (nq) such mutations overall. Mutations that change the parameters m, r increase the number of good variables or decrease the number of bad variables, hence we can have at most O (q) such mutations. LongInitial Hypothesis. The arguments are similar to those in Theorem 2, and in O |h0 | ln δ1 stages the algorithm forms a short hypothesis of size q. Then, we apply the analysis above.
7
Further Remarks
We are currently working on completing the results of Section 6, by handling the case of long targets. The problem with long targets is that the term U appearing in Δ can be small. This may lead to the increase in fitness being smaller than the tolerance. The following observation may provide a starting point. It may happen that there are no beneficial mutations, but all mutations are neutral. In this case the evolution process is similar to a random walk on the set of short hypotheses. However, most short hypotheses have maximal size. Therefore, after sufficiently many steps in the random walk, the hypothesis will have maximal size with high probability. Such hypotheses have good performance, and so the evolution process leads to a good hypothesis with high probability. The evolvability of monotone conjunctions for more general classes of distributions, using the swapping algorithm with the covariance fitness function, or
88
D.I. Diochnos and G. Tur´ an
using some other simple evolutionary mechanism, is an important open problem [14]. There is a similar swapping-type learning algorithm for decision lists, where a single step exchanges two tests in the list [13, 1]. Can such an algorithm be used in the evolution model? A positive answer could give an alternative to Michael’s Fourier-based approach [9]. In summary, it appears that from the perspective of learning theory, one of the remarkable features of Valiant’s new model of evolvability is that it puts basic, well-understood learning problems in a new light and poses new questions about their learnability. One of these new questions is the performance of basic, simple evolution mechanisms, like the swapping algorithm for monotone conjunctions. The results of this paper suggest that the analysis of these mechanisms may be an interesting challenge.
References [1] Castro, J., Balc´ azar, J.L.: Simple PAC Learning of Simple Decision Lists. In: Zeugmann, T., Shinohara, T., Jantke, K.P. (eds.) ALT 1995. LNCS, vol. 997, pp. 239–248. Springer, Heidelberg (1995) [2] Feldman, V.: Evolvability from learning algorithms. In: STOC 2008, pp. 619–628. ACM, New York (2008) [3] Feldman, V.: Robustness of Evolvability. In: COLT 2009 (2009) [4] Furst, M.L., Jackson, J.C., Smith, S.W.: Improved learning of AC0 functions. In: COLT 1991, pp. 317–325 (1991) [5] Hancock, T., Mansour, Y.: Learning monotone ku DNF formulas on product distributions. In: COLT 1991, pp. 179–183 (1991) [6] Kalai, A.T., Teng, S.-H.: Decision trees are PAC-learnable from most product distributions: a smoothed analysis. CoRR, abs/0812.0933 (2008) [7] Kearns, M.: Efficient noise-tolerant learning from statistical queries. J. ACM 45(6), 983–1006 (1998) [8] Kearns, M.J., Vazirani, U.V.: An introduction to computational learning theory. MIT Press, Cambridge (1994) [9] Michael, L.: Evolvability via the Fourier Transform (2009) [10] Reischuk, R., Zeugmann, T.: A Complete and Tight Average-Case Analysis of Learning Monomials. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 414–423. Springer, Heidelberg (1999) [11] Ros, J.P.: Learning Boolean functions with genetic algorithms: A PAC analysis. In: FGA, San Mateo, CA, pp. 257–275. Morgan Kaufmann, San Francisco (1993) [12] Servedio, R.A.: On learning monotone DNF under product distributions. Inf. Comput. 193(1), 57–74 (2004) [13] Simon, H.-U.: Learning decision lists and trees with equivalence-queries. In: Vit´ anyi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 322–336. Springer, Heidelberg (1995) [14] Valiant, L.G.: Evolvability. In: Kuˇcera, L., Kuˇcera, A. (eds.) MFCS 2007. LNCS, vol. 4708, pp. 22–43. Springer, Heidelberg (2007)
A Generic Algorithm for Approximately Solving Stochastic Graph Optimization Problems Ei Ando1, , Hirotaka Ono1,2 , and Masafumi Yamashita1,2 1
Dept. Computer Sci. and Communication Eng., Kyushu University, 744 Motooka, Nishi-ku, Fukuoka, Fukuoka 819-0395, Japan 2 Institute of Systems, Information Technologies and Nanotechnologies, Fukuoka SRP Center Building 7F 2-1-22, Momochihama, Sawara-ku, Fukuoka 814-0001, Japan
Abstract. Given a (directed or undirected) graph G = (V, E), a mutually independent random variable Xe obeying a normal distribution for each edge e ∈ E that represents its edge weight, and a property P on graph, a stochastic graph maximization problem asks the distribution function FMAX (x) of random variable XMAX = maxP ∈P e∈A Xe , where property P is identified with the set of subgraphs P = (U, A) of G having P. This paper proposes a generic algorithm for computing an elementary function F˜ (x) that approximates FMAX (x). It is applicable to any P and runs in time O(TAMAX (P) + TACNT (P)), provided the existence of an algorithm AMAX that solves the (deterministic) graph maximization problem for P in time TAMAX (P) and an algorithm ACNT that outputs an upper bound on |P| in time TACNT (P). We analyze the approximation ratio and apply it to three graph maximization problems. In case no efficient algorithms are known for solving the graph maximization problem for P, an approximation algorithm AAPR can be used instead of AMAX to reduce the time complexity, at the expense of increase of approximation ratio. Our algorithm can be modified to handle minimization problems.
1
Introduction
1.1
Motivation and Contribution
Given a (directed or undirected) graph G = (V, E), a function w : E → R that assigns a real number w(e) to each edge e ∈ E as its weight, and a property P on graph, a typical graph optimization problem asks a subgraph P = (U, A) of G having P that maximizes (or minimizes) its weight w(P ), where w(P ) = e∈A w(e). Examples of P include path, matching, spanning tree, edge cut and so on. The set of subgraphs P = (U, A) of G having P is called the feasible set. In this paper, we identify a property P with its feasible set and abuse P to denote it, so |P| is the cardinality of the feasible set. A promising approach to solve a computation problem that we encounter in real world is to model it as a weighted graph and then to solve it using a
Supported by Research Fellowships of the Japan Society for the Promotion of Science for Young Scientists.
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 89–103, 2009. c Springer-Verlag Berlin Heidelberg 2009
90
E. Ando, H. Ono, and M. Yamashita
known graph optimization algorithm. However, there are two main concerns. First, many of the graph optimization problems are known to be NP-hard, and there can be no known efficient algorithms for solving the graph optimization problem that we want to solve. In such a case, an approximation algorithm is frequently used instead. Second, quantities in real applications suggested to model as edge weights tend to vary or to be uncertain. In order to model variable and uncertain quantities, we can introduce for each edge e ∈ E a random variable Xe denoting its edge weight and formulate the problem as a stochastic graph optimization problem, which typically asks the distribution function FMAX (x) (or the value FMAX (d) −1 or FMAX (a) for a constant d or a) of a random variable XMAX = max(U,A)∈P e∈A Xe (or the distribution function FMIN (x) of a random variable XMIN = min(U,A)∈P e∈A Xe ), given a graph G = (V, E), a distribution that Xe obeys for each e ∈ E, and a graph property P. This paper proposes a generic algorithm ALG to compute an elementary function F˜ (x) that approximates FMAX (x), provided that the random variables Xe are mutually independent and obey normal distributions N (μe , σe2 ). ALG is applicable to any property P. It assumes the existence of two algorithms: an algorithm AMAX to solve the graph maximization problem for P in time TAMAX (P) set and an algorithm ACNT to approximate |P|, i.e., the cardinality of the feasible 2 for P, in time T (P). We compute μ = max μ and σ max (U,A)∈P max = e∈A e ACNT max(U,A)∈P e∈A σe2 using AMAX , and an upper bound c on |P| using ACNT . 2 Then an elementary formula representing F˜ (x) is determined from μmax , σmax ˜ and c. Once the formula is computed, we can evaluate F (d) in O(log c)-time for any d. A basic approximation property of F˜ (x) is an inequality F˜ (x) ≤ F (x) for any x ≥ μmax , or equivalently, F −1 (y) ≤ F˜ −1 (y) for any y ≥ 1/2. The value F˜ −1 (a) is more important than F˜ (d) in many applications. To see this, let us consider a typical application of circuit delay analysis. In this context, F (x) is the probability that the circuit delay of a product is less than or equal to x, and a main target of the delay analysis is to estimate F −1 (a) for a given a, since F −1 (a) = d means that the delay of a product is less than or equal to d with the probability at least a. Since F −1 (a) ≤ F˜ −1 (a), taking F˜ −1 (a) as an estimation of F −1 (a) is a safe choice to guarantee the success ratio a. Interestingly, we can calculate F˜ −1 (a) even faster than F˜ (d); F˜ −1 (a) is computable in O(1)-time. Of course, applications require not only that F˜ (x) ≤ F (x) holds, but also an approximation ratio guarantee. Let h = max(U,A)∈P |A|, ν = mine∈E μe , ρ2 = maxe∈E σe2 , and c be the output of ACNT . We show that, for any constant −1 ˜ −1 a ≥ 1/2, the approximation ratio F −1(a)/F (a) is bounded from above by O((ρ/ν) log c/h). For guaranteeing F (a) = 0, we will assume that μe > 0 for any e ∈ E. Note that in O((ρ/ν) log c/h) we consider a as a constant (although we can evaluate the ratio as a function of a, we omit it for the simplicity). One might ask the reason we do not adopt F (d)/F˜ (d) but F˜ −1 (a)/F −1 (a) for the approximation measure. An immediate reason is that many applications are
Approximately Solving Stochastic Graph Optimization Problems
91
more interested in F −1 (a) than F (d) as explained in the previous paragraph. Another reason is that a constant approximation ratio is trivially achievable, if we define it as F (d)/F˜ (d) (or F˜ (d)/F (d)) since F (x) ≤ 1. The third reason is that this definition of approximation ratio can be considered as a natural extension of that for the (deterministic) maximization problems (see e.g., [21]). A graph maximization problem is a special case of the corresponding stochastic graph maximization problem in which all variances of edge weights are 0. The distribution function FMAX (x) is now a step function, and the optimal cost is the point at which FMAX (x) jumps from 0 to 1. The counterpart of the point in −1 the general stochastic problem is FMAX (y) for y (0 < y < 1). Note that we do −1 not deal with the case y < 1/2, since FMAX (y) may be 0 or negative. The following four facts are worth emphasizing. We discuss Items 3 and 4 further in Section 6. 1. The stochastic graph optimization problem is very hard to solve. The problem of correctly computing the distribution function has been discussed only for a couple of properties under very restricted conditions, and usually Monte Carlo simulation is conducted to evaluate (i.e., to approximate) FMAX (d) at a single point d. Monte Carlo simulation is very slow to guarantee the precision. Algorithm ALG, which is applicable to any property, outputs a function F˜ (x) that approximates FMAX (x), whose quality is formally guaranteed. 2. For many properties P, we can derive a good upper bound h of h such that h = O(h). In such cases, since m c = (me/h )h ≥ h is an upper bound can guarantee a near√constant approxima on |P|, ALG √ tion ratio (ρ/ν) log c/h ≤ (ρ/ν) h log(me)/h = O( log m) = O( log n), where n = |V |, m = |E| and e is the Napier’s constant. 3. If the graph optimization problem for P is difficult, we can use an approximation algorithm AAPR for the problem instead of AMAX . Since 2m is a trivial upper bound on |P|, we can tune ALG so that it meets the given time constraint, e.g., linear time, by using a linear time approximation algorithm AAPR instead of AMAX . 4. This paper discusses maximization problems under the assumption that each random variable Xe obeys a normal distribution. However, we can modify ALG to solve minimization problems. Also, we can handle other distributions like the gamma distributions. To demonstrate how ALG is applied to properties, we investigate three graph maximization problems; the longest path problem for directed acyclic graphs (DAGs), the maximum spanning tree problem, and the maximum weight perfect matching problem for planar graphs. We choose these problems since they have efficient algorithms AMAX and ACNT , but as was emphasized in above, the existence of efficient algorithms AMAX and ACNT is not a necessary condition for ALG to be applicable to the problem.
92
1.2
E. Ando, H. Ono, and M. Yamashita
Related Works
The study of stochastic optimization problems has a considerably long history starting at least in 1950’s (see e.g., [22]). Nevertheless, not so many works have been done for computing (or approximating) the distribution function of the optimal value. PERT [17] and Critical Path Planning [13] are classical approaches to approximately solve the stochastic longest path problem for DAGs. Critical Path Planning transforms the stochastic problem into a (deterministic) longest path problem by taking some definite values as edge weights, and PERT tries to approximate the expected longest path length, whose accuracy is later discussed in [6]. Martin [16] proposed an algorithm for symbolically computing the accurate distribution function, provided that the random variables obey distributions such that their density functions are polynomials. His algorithm needs to consider all distinct paths and hence its time complexity is Ω(2n ) for some DAG of order n. There are several works for approximating the distribution function of the longest path length in a DAG, from the view point of estimating the project completion time of a PERT network. Kleindorfer [14] proposed an algorithm that computes upper and lower bounds of the distribution function. Dodin [7] improved the Kleindorfer’s bounds. However, it is not clear that how close their bounds are to the exact value. Badinelli [3] proposed to measure the quality of approximation F˜ (x) in an interval [xU , xL ] by xU ˜ |F (t) − F (t)| 1 dt, xU − xL xL F (t) where F (x) is the function to be approximated. Ludwig et al. [15] evaluated the project completion time of stochastic project networks empirically. These results again do not give any theoretical approximation guarantee. A natural idea is to discretize the ranges of random variables and then to apply a combinatorial optimization algorithm. Hagstrom [8] however showed that computing a value of the distribution function of project duration in PERT problem is #P-complete, and hence this approach does not work efficient for large instances. Motivated by the importance of the circuit delay analysis, many heuristic algorithms to approximate the distribution function of the longest path length have been proposed [4, 9, 19]. They run fast, but have no guarantee about the approximation also. Recently, Ando et al. [1]proposed an algorithm for approximating the distribution function of the longest path length in a DAG. This runs in linear time and formally guarantees its approximation ratio. Another work on stochastic graphs asks an algorithm to compute a feasible set that optimizes a given measure. For example, Ishii et al. [11] proposed a polynomial time algorithm for the following problem: Given a graph G and a real number a, compute a spanning tree T of G that minimizes the value subject to the probability that the weight of T is at most is at least a. Recently a substantial number of works have been done on the recourse model (see e.g., [18]).
Approximately Solving Stochastic Graph Optimization Problems
93
All of the methods and algorithms mentioned above are designed for some particular problems, and their purposes are different from ours in this paper, which presents a generic algorithm. The only general method we have in mind is the Monte Carlo simulation. It has many advantages, but still has drawbacks; it takes a long computation time to guarantee stable behavior, cannot formally guarantee an approximation ratio, cannot give us a formula that approximates the distribution function, and so on. The paper is organized as follows: Section 2 prepares definitions and notations. In Section 3, we first give an inequality that bounds FMAX (x) from below by utilizing an enumeration of all P ∈ P, and then, using this inequality, we present an algorithm ALG to calculate an elementary formula F˜ (x) that approximates FMAX (x). We also propose an efficient method to compute F˜ −1 (a) in O(1)-time. Section 4 derives the approximation ratio. Three application examples are given in Section 5. In Section 6, we conclude the paper by discussing how to extend algorithm ALG to more general cases and giving some open problems.
2
Preliminaries
Let X be a random variable. The probability P (X ≤ x) is called the (cumulative) distribution function of X. We call the derivative of the distribution function with respect to x the (probability) density function. A random variable X is normally distributed or obeys a normal distribution when its distribution function is given as, 2 x 1 t−μ 1 √ exp − dt, P (X ≤ x) = 2 σ −∞ σ 2π where μ and σ 2 are the mean and the variance of X, respectively. A normal distribution with mean μ and variance σ 2 is denoted by N (μ, σ 2 ). A normal distribution N (0, 1) is called the standard normal distribution and its distribution function and density function are denoted by Φ(x) and φ(x), respectively. The distribution function F (x) of N (μ, σ 2 ) is given as F (x) = Φ ((x − μ)/σ). It is well known that the sum of two normally distributed random variables also obeys a normal distribution. Let random variables X1 and X2 obey N (μ1 , σ12 ) and N (μ2 , σ22 ), respectively. The distribution of the sum X1 + X2 is given by N (μ1 + μ2 , σ12 + σ22 ). The distribution of max{X1 , X2 } is not normal even if X1 and X2 are normally distributed. The distribution function of max{X1 , X2 } is the product of distribution functions of X1 and X2 . Let F1 (x) and F2 (x) be the distribution functions of X1 and X2 . Suppose that F3 (x) is the distribution function of max{X1 , X2 }. Then we have F3 (x) = F1 (x)F2 (x). Sometimes we assume that mutually independent random variables X and X have distribution functions F (x) and F (x) that satisfies F (x) = F (x) for any x. In this case, X and X are mutually independent random variables with a common distribution. We denote an undirected edge between two vertices a and b by {a, b}. A directed edge from a vertex a to b is denoted by (a, b). An undirected graph is denoted by
94
E. Ando, H. Ono, and M. Yamashita
a pair G = (V, E), where V is its vertex set V = {v0 , v1 , . . . , vn−1 } and E is its undirected edge set E ⊆ {{vi , vj } | vi , vj ∈ V }. A directed graph is denoted by → − a pair G = (V, E), where V is also its vertex set and E is its directed edge set E ⊆ {(vi , vj ) | vi , vj ∈ V }. Each edge e ∈ E is associated with its edge weight, a random variable Xe that obeys a normal distribution N (μe , σe2 ), where μe > 0 and σe > 0. We identify a graph property P with its feasible set, i.e., the set of subgraphs P = (U, A) of G that have P. We also identify a subgraph P = (U, A) with its edge set A, as long as it is clear from the context. Let G = (V, E) be a graph with edge weights. This paper investigates the problem of approximating the distri
bution function FMAX (x) of a random variable XMAX = maxP ∈P e∈P Xe .
3 3.1
Algorithm ALG A Lower Bound on FMAX (x)
We start with a series of lemmas to obtain a lower bound on FMAX (x). Lemma 1. Let X1 , X2 , Z and Z be mutually independent random variables. Suppose that Z and Z are with a common distribution. We have P (max {X1 + Z, X2 + Z } ≤ x) ≤ P (max {X1 + Z, X2 + Z} ≤ x) .
(1)
Since this Lemma 1 seems to be well known, we omit the proof due to the space limit (for the proof see e.g. [1]). Lemma 2. For each e ∈ E, let Xe and Xe be mutually independent random variables with a common distribution N (μe , σe2 ). For any P1 , P2 ⊆ E, we have
P max Xe , Xe ≤ x ≤ P max Xe , Xe ≤ x . (2) e∈P1
e∈P2
e∈P1
e∈P2
Proof. We give the proof by applying Lemma 1. Let X1 = e∈P1 \(P1 ∩P2 ) Xe , X2 = e∈P2 \(P1 ∩P2 ) Xe , X2 = e∈P2 \(P1 ∩P2 ) Xe , Z = e∈P1 ∩P2 Xe and Z = e∈P1 ∩P2 Xe . The right hand side of (2) is equal to P (max{X1 +Z, X2 +Z} ≤ x). The left hand side of (2) is equal to P (max{X1 + Z, X2 + Z } ≤ x). Since X2 and X2 are mutually independent random variables with a common distribution, we have P (max{X1 + Z, X2 + Z } ≤ x) = P (max{X1 + Z, X2 + Z } ≤ x). Since these random variables X1 , X2 , Z and Z are mutually independent, the proof completes by Lemma 1. Let P = {P1 , P2 , . . . , P|P| }. We define mutually independent random variables X , X , . . . , XP|P| ; for i = 1, . . . , |P|, XPi obeys N (μi , σi2 ), where μi = P1 P2 2 2 e∈Pi μe and σi = e∈Pi σe . Notice that XPi ’s and Xe ’s are mutually independent, although the distributions of XPi ’s are defined by using μe ’s and σe ’s.
Approximately Solving Stochastic Graph Optimization Problems
Lemma 3
P
max
Pi ∈P
Xe
≤x
(1)
(2)
≥P
e∈Pi
95
max {XPi } ≤ x .
Pi ∈P
(|P|)
Proof. For each e ∈ E, let Xe , Xe , . . . , Xe and Xe be mutually indepen(k) dent random variables with a common distribution. Note that Xe ’s are mutually independent for any combination of 1 ≤ k ≤ |P| and e ∈ E. Since XPi ’s and Xe ’s are mutually independent, we have
(i) ≤x . Xe P max {XPi } ≤ x = P max Pi ∈P
Pi ∈P
e∈Pi
Partition P into two subsets P (i) and Q(i) , where P (i) = {P1 , P2 , . . . , Pi } and Q(i) = P \ P (i) . By Lemma 2, we have
P max Xe ≤ x = P max Xe , max Xe ≤x Pi ∈P
≥P
e∈Pi
max
Xe(1) , max Pi
e∈P1
∈Q(1)
Pi ∈Q(1)
e∈P1
Xe
e∈Pi
≤x
e∈Pi
By repeating the same argument, we can prove this lemma. Since we have
FMAX (x) = P max P ∈P
Xe ≤ x ≥ P max{XP } ≤ x = P (XP ≤ x) = FP (x),
e∈P
P ∈P
P ∈P
P ∈P
by Lemma 3, we state a lower bound on FMAX (x). Theorem 1. For any P ∈ P, let FP (x) be the distribution function of XP . Then FP (x). (3) FMAX (x) ≥ P ∈P
The inequality (3) does not directly lead to an efficient calculation of a lower bound on FMAX (x), since calculating (3) requires enumerating all P ∈ |P| and the cardinality of P can be exponential of n or more. For example, the number of spanning trees can be nn−2 . 3.2
Constructing F˜ (x)
Let FB (x)= Φ((x − μmax )/σmax ) be a distribution function where μmax = 2 maxP ∈P { e∈P μe } and σmax = maxP ∈P { e∈P σe2 }. For any P ∈ P, it is easy to show that FB (x) ≤ FP (x) if x ≥ μmax . Since FP (x) is the distribution function
96
E. Ando, H. Ono, and M. Yamashita
have FP (x) = Φ((x − μP )/σP ), where μP = e∈P μe and σP2 = Φ(x) is monotonically increasing, FP (x) = Φ((x − μP )/σP ) ≥ Φ((x − μmax )/σP ) for all x. Again, since Φ(x) is monotonically increasing, we have Φ((x − μmax )/σP ) ≥ Φ((x − μmax )/σmax ) = FB (x) for x ≥ μmax , which implies that FB (x) ≤ FP (x) for x ≥ μmax . By Theorem 1, we have (FB (x))|P| ≤ FMAX (x) if x ≥ μmax . We now have a straightforward algorithm to approximate FMAX (d) from below for a given d ≥ μmax ; first calculate μmax , σmax and an upper bound c on |P|, and then calculate (FB (d))c , by accessing the table of standard normal distribution. There are however two issues that we need to discuss further. The first issue is about algorithms to compute μmax , σmax and c. Recall that we assume the existence of an algorithm AMAX that solves the graph maximization problem for P in time TAMAX (P) and an algorithm ACNT that outputs an upper bound on |P| in time TACNT (P). To compute μmax (resp. σmax ), we solve the graph maximization problem for G = (V, E), whose edge weight w(e) of each edge e ∈ E is μe (resp. σe2 ) using AMAX , and c is computed by ACNT . Thus, in time O(TAMAX (P) + TACNT (P)), we can calculate μmax , σmax and c. The second is about the complexity of computing (FB (d))c . We introduce a 2 new family of elementary functions L(x; μ, σ 2 ) and show that L(x; μmax , σmax ) 2 ) is a achieves a lower bound on FB (x) if x ≥ μmax , and hence Lc (x; μmax , σmax 2 lower bound on FMAX (x) if x ≥ μmax . Furthermore, Lc (x; μmax , σmax ) is easy to calculate, since L(x; μ, σ 2 ) is an elementary function given as an explicit formula. Lemma 4 can be proved by standard calculus technique. of
e∈P Xe , we 2 e∈P σe . Since
Lemma 4. Let F (x; μ, σ 2 ) = Φ ((x − μ)/σ) and define L(x; μ, σ 2 ) by
1 L(x; μ, σ ) = exp − exp − 2 2
x−μ σ
2 + ln ln 2
.
Then L(x; μ, σ 2 ) ≤ F (x; μ, σ 2 ) for any x ≥ μ. 2 2 We propose Lc (x; μmax , σmax ) to use as F˜ (x). Since μmax and σmax are computed by AMAX in TAMAX (P)-time and c is computed by ACNT in TACNT (P)-time, (an explicit elementary representation of) F˜ (x) is composed in (TAMAX (P) + 2 TACNT (P))-time. Once μmax , σmax and c have been computed, evaluation of ˜ F (d) for any d is possible in O(log c)-time.1
3.3
Efficient Evaluation of F˜ −1 (a)
In Introduction, we explained why efficient calculation of F˜ −1 (a) is required in many applications. In this section, we show that F˜ −1 (a) can be efficiently evaluated; it takes a constant time. 1
Some approximation formulas proposed so far (see e.g., [10]) take mostly the same amount of time.
Approximately Solving Stochastic Graph Optimization Problems
97
Equation L(x; μ, σ 2 ) = a has two real solutions for 1/2 ≤ a < 1. We denote the larger solution by L−1 (a; μ, σ 2 ). By an easy calculation, L−1 (a; μ, σ 2 ) = μ + σ −2 ln(− ln a) + 2 ln ln 2. 2 Then F˜ −1 (a) = L−1 (a1/c ; μmax , σmax ), where 2 L−1 (a1/c ; μmax , σmax ) = μmax + σmax 2(ln c − ln(− ln a)) + 2 ln ln 2.
(4)
Unlike the evaluation of F˜ (d), evaluation of F˜ −1 (a) is possible in O(1)-time, 2 once we have computed μmax , σmax and c. Note also that ln c (not c) is required here, which suggests a possibility of using a more efficient algorithm to compute an upper bound on ln |P| instead of ACNT .
4
Bounding Approximation Ratio
This section shows the following theorem. Let h = maxP ∈P |P |, ν = mine∈E μe , ρ2 = maxe∈E σe2 , and c be the output of ACNT . Theorem 2. For any a ≥ 1/2, F˜ −1 (a) =O −1 FMAX (a)
ρ log c . ν h
We first prepare three lemmas. Let F (x) and f (x) be a normal distribution function and its normal density function with mean μ and variance σ 2 , respectively. Lemma 5. Let
1 U (x) = exp − exp − 2
x−μ σ
2 − 2 ln
x−μ +1 −3 . σ
Then for x ≥ μ, U (x) > F (x) holds. Lemma 5 can be proved by standard calculus technique. Lemma 6. Let W (y) = μ + σ
2((− ln(− ln y) + 3) − 2 ln(− ln(− ln y) + 3)).
Then for 1/2 ≤ y < 1, U −1 (y) ≥ W (y) holds. Proof. We define a function W1 (x) by its inverse function W1−1 (y) = μ + σ 2((y + 3) − 2 ln(y + 3)), for W1 (x) ≥ −1, and observe that W1−1 (y) is monotone if y ≥ −1 and hence function W1 (x) is well-defined. Note that W (y) = W1−1 (− ln(− ln y)). We show
98
E. Ando, H. Ono, and M. Yamashita
that U −1 (y) ≥ W1−1 (− ln(− ln y)) = W (y) for any y ≥ −1. Since − ln(− ln y) ≥ −1, i.e., y ≥ exp(−e) = 0.06598 . . . this proves the lemma. We define a function W2 (x) as follows: 1 W2 (x) = 2
x−μ σ
2 − 2 ln
x−μ + 1 − 3. σ
Since W2 (W1−1 (y)) = y − 2 ln(y + 3) − 2 ln( 2((y + 3) − 2 ln(y + 3)) + 1) < y, by replacing y with W1 (x), we have W2 (x) < W1 (x). Thus, exp(− exp(−W1 (x))) > exp(− exp(−W2 (x))) = U (x), which implies that U −1 (y) ≥ W1−1 (− ln(− ln y)) = W (y). Lemma 7. For a real number r, let r x−μ F (x) = Φ . σ Then for 2−r ≤ y < 1, F −1 (y) = Θ(μ + σ
2(ln r − ln(− ln y)) + 2 ln ln 2).
Proof. By Lemma 4, 5 and 6, we have F −1 (y) = Ω(W (y 1/r )) and F −1 (y 1/r ) = O(L−1 (y 1/r )) for 2−r ≤ y < 1. Since the ratio W (y 1/r )/L−1 (y 1/r ) converges to 1 as r tends to ∞, we have W (y 1/r ) = Ω(L−1 (y 1/r )), which shows the lemma. Proof of Theorem 2 √ 2 ≤ hρ2 by definition, F˜ −1 (a) = Θ(μmax + ρ h log c) by Lemma 7, Since σmax −1 where ρ2 = maxe∈E σe2 . We show 1/FMAX (a) = O(1/μmax ), which is sufficient to complete the proof, since μmax ≥ hν > 0, where ν = mine∈E μe . Let Q ∈ P be a subset thatachieves μmax = e∈Q μe . Let XQ = e∈Q Xe . Since FMAX (x) = P (maxP ∈P e∈P Xe ≤ x), we have ⎛ ⎞ FQ (x) = P ⎝ Xe ≤ x⎠ ≥ FMAX (x), e∈Q −1 (y) for all y. Since FQ−1 (a) = Ω(μmax ) for all x, which implies that FQ−1 (y) ≤ FMAX −1 (a) = O(1/μmax ). for a ≥ 1/2 by Lemma 7, which implies that 1/FMAX
Approximately Solving Stochastic Graph Optimization Problems
5 5.1
99
Three Examples The Stochastic Longest Path Problem in a DAG
→ − Let us consider the stochastic longest path problem. Given a DAG G = (V, E) and normal distributions N (μe , σe2 ) that random variables Xe assigned to edge → − e ∈ E obeys, we consider a DAG G with a weight function w : E → R defined by w(e) = μe (resp. σe2 ) for any e ∈ E and run a known longest path algorithm for a DAG (like PERT [17]) to compute μmax (resp. σmax ) in O(n + m), where n = |V | and m = |E|. As for an upper bound c on |P|, mn is a trivial upper bound. However, we can use the dynamic programming technique to compute the exact value of |P| in O(n + m)-time. We first topologically sort V = {v1 , v2 , . . . , vn } and assume that vi < vi+1 for 1 ≤ i ≤ n − 1. That is, v1 is the source and vn is the sink. Since the number of paths from source to vi is the sum of the number of paths from source to the parents of vi , a dynamic programming algorithm can count the number of paths from the source to the sink. Thus our algorithm runs in O(n√ + m)-time. h log d), the Whereas the approximation ratio that is proved in [1] is O((ρ/ν) approximation ratio of our algorithm is O((ρ/ν) (log |V |)/h). This result is also mentioned in the author’s previous paper [2]. 5.2
The Stochastic Maximum Spanning Tree Problem
Next consider the stochastic maximum spanning tree problem. Currently the fastest maximum spanning tree algorithm is shown by Chazelle [5], and it runs in O(mα(m, n))-time, where α(m, n) is the inverse of Ackermann’s function. We can compute μmax and σmax using this algorithm. As for an upper bound c on |P|, a trivial upper bound is nn−2 . We can also count it by Matrix-Tree Theorem in O(TDET (n − 1))-time, where TDET (n − 1), which is O(n3 ), is the time complexity to calculate the determinant of an (n − 1) × (n − 1) matrix. Thus our algorithm runs in O(n3 )-time. 5.3
The Stochastic Maximum Weight Perfect Matching Problem in a Planar Graph
Finally, consider the stochastic maximum weight perfect matching problem for planar graphs. Currently the fastest maximum weight perfect matching algorithm for planar graphs is due to Varadarajan [20], and it runs in O(n3/2 log5 n)-time. The number of perfect matching can be obtained by an algorithm due to [12] as a square root of a determinant of (n − 1) × (n − 1) matrix, whose elements are 1, −1 or 0. Thus our algorithm runs in O(n3 )-time.
6
Discussions and Conclusions
Given a graph G = (V, E), a mutually independent random variable Xe obeying a normal distribution N (μe , σe2 ) for each edge e ∈ E, and a property P on graph, we
100
E. Ando, H. Ono, and M. Yamashita
proposed a method ALG to compute an elementary function F˜ (x) (as an explicit formula) that approximates the distribution function FMAX (x) of random variable XMAX = max(U,A)∈P e∈A Xe in time O(TAMAX (P) + TACNT (P)), provided the existence of an algorithm AMAX that solves the graph maximization problem for P in time TAMAX (P) and an algorithm ACNT that outputs an upper bound on |P| in time TACNT (P). Given formula F˜ (x), F˜ (d) is computed in O(log c)-time for any d, and interestingly F˜ −1 (a) is computable in O(1)-time (faster than F˜ (d)) for any a, where c is the upper bound on |P| computed by ACNT . We showed that, for any a ≥ 1/2, the approximation ratio F˜ −1 (a)/F −1 (a) is bounded from above by O((ρ/ν) log c/h), where h = max(U,A)∈P {|A|}, ν = mine∈E {μe } and ρ2 = maxe∈E {σe2 }. There are several issues that we did not discuss in this paper. 1. a method to modify ALG so that it can handle minimization problems 2. a method to use an approximation algorithm AAPR instead of AMAX when AMAX is slow 3. a method to handle cases in which edge weights obey other distributions 4. a method to handle cases in which edge weights are not mutually independent 5. a method to use a combination of a wider variety of elementary functions to construct a better approximation F˜ (x) of F (x) As we promised in Section 1, we briefly discuss items (1)-(3) in below. However, the discussions are still on a preliminary stage, and we leave thorough discussions as future works. 6.1
Handling Minimization Problems
Given a graph G = (V, E), a distribution that Xe obeys for each e ∈ E, and a graph property P, consider how to approximate the distribution function FMIN (x) of a random variable XMIN = min(U,A)∈P e∈A Xe . Observe that the complementary distribution function of the weight of any edge set P ∈ P is an upper bound on the complementary distribution function of XMIN . We thus consider how to bound FMIN (x) from above. For constructing the algorithm to approximate FMIN (x) from above, we must show an upper bound on FMIN (x). Let P = {P1 , . . . , P|P| }, and XP1 , . . . , XP|P| be random variables that obey N (μi , σi2 ) where μi = e∈Pi μe and σi2 = e∈Pi σe2 . The following theorem is symmetric to Theorem 1. Theorem 3. Let FP (x) be the distribution function of XP . Then we have FMIN (x) ≤ 1 − (1 − FP (x)).
P ∈P
Let μmin = min(U,A)∈P { e∈A μe } and F (x; μ, σ 2 ) = Φ((x − μ)/σ). For any P ∈ 2 P, F (x; μmin , σ 2 ) ≥ FP (x) if x ≤ μmin , and hence 1 − (1 − F (x; μmin , σmax ))|P| ≥ FMIN (x) if x ≤ μmin by Theorem 3. The following is thus a straightforward approximation algorithm for FMIN (x): First we calculate μmin , σmax and an upper bound c on |P|, and then output 1 −
Approximately Solving Stochastic Graph Optimization Problems
101
2 (1 − F (x; μmin , σmax ))c . We however make use of a more sophisticated function L 2 given in Section 3 and propose 1 − (1 − L(x; μmin , σmax ))c as F˜ (x). By the same arguments in the proofs of Lemmas 4 and 5, we can show the next lemma.
Lemma 8. Let L(x; μ, σ 2 ) be a function defined in Lemma 4. Then for any x ≤ μ, L(x; μ, σ 2 ) ≤ 1 − F (x; μ, σ 2 ). 1/c 2 2 Let L−1 ; μmin , σmax ) be the smaller root of equation L(x; μmin , σmax ) = a. min (a −1 1/c 2 Then Lmin (1 − (1 − a) ; μmin , σmax ) gives a lower bound on FMIN (x) if x ≤ 2 F −1 (a1/c ; μmin , σmax ). As for estimating the approximation ratio, for convenience, we assume that all μe for e ∈ E are negative, which allows us to have exactly the same approximation ratio as in the maximization problem.
6.2
Using AAPR Instead of AMAX
Let us discuss the effect of using an approximation algorithm AAPR instead of AMAX . AAPR is used to compute μmax and σmax . Let μapr and σapr be the out2 2 2 puts and assume that μmax ≤ μapr ≤ αμmax and σmax ≤ σapr ≤ ασmax , where α > 1. We should be careful about this approximation since an “approximation algorithm” in usual sense for a maximization problem are for approximating μmax or σmax from below (see e.g., [21]). However, we must have μapr and σapr greater than μmax and σmax , respectively. That is, we ask AAPR to approximate μmax and σmax from above. This is possible as long as there is an approximation algorithm, because approximation ratio is shown as the ratio between approximation and an upper bound that can be obtained in polynomial time; we utilize the upper bound as AAPR since we use the value of μmax and σmax (we do not need the structures 2 that gives the values). We then propose Lc (x; μapr , σapr ) as F˜ (x). Since by (4), 2 L−1 (a1/c ; μapr , σapr ) ≤ α, −1 1/c 2 L (a ; μmax , σmax )
the approximation ratio is estimated as follows: F˜ −1 (a) ρ log c . =O α −1 ν h FMAX (a) As an example, we consider the following stochastic knapsack problem. Given a set S = {a1 , . . . , an } of items. Each item ai ∈ S is associated with profit Xi , that is a normally distributed random variable and obeys N (μi , σi2 ). We assume that the profits are mutually independent. Each item ai ∈ S is also associated with size s(ai ) ∈ R, that is a static value. Let ν be the maximum mean of a profit of an item. According to [21], we can 2 approximately compute μmax and σmax in O(n2 ((α − 1)ν )/n) and O(n2 ((α − 1)ρ)/n) time, respectively, with approximation ratio α. The upper bound on the
102
E. Ando, H. Ono, and M. Yamashita
number of possible can be obtained trivially as 2n . By applying our algorithm for the stochastic optimization problem, we can approximately compute the value of −1 FMAX (y) with approximation ratio O(α(ρ/ν) n/h) for the stochastic knapsack problem. 6.3
Handling Other Distributions
The Gamma Distribution. Our approach seems to be applicable to many other cases in which the edge weights obey other distributions, although the approximation guarantee has to be examined separately for each distribution. Let us first consider the gamma distribution with a shape k and a scale θ. Then its density function f (x; k, θ) is given as 0 x≤0 f (x; k, θ) = xk−1 exp(−x/θ) x > 0. θ k Γ (k) If every |P | for P ∈ P is sufficiently large and all edge weights are random variables with a common distribution, the central limit theorem works, and the algorithm in this paper becomes applicable and guarantees the same approximation ratio. However, if |P | is small, or edge weights are not with a common distribution, for gamma distribution function F (x;k, θ), we need to discover a func2 2 ) such that (1) L(x; μmax , σmax ) < F (x; k, θ) for x that tion L(x; μmax , σmax is larger than certain known value, and (2) the exact or approximate value of 2 ) is efficiently computable. We obtain the approximation raL−1 (a1/c ; μmax , σmax 2 )/ FMAX (x; k, θ). tio by evaluating L−1 (a1/c ; μmax , σmax This method would produce a good approximation ratio when the distribution functions of edge weights do not have “heavy tails,” and investigation of the cases in which some distribution functions have heavy tails is left as a future work.
References [1] Ando, E., Nakata, T., Yamashita, M.: Approximating the Longest Path Length of a Stochastic DAG by a Normal Distribution in Linear Time. Journal of Discrete Algorithms (2009), doi:10.1016/j.jda.2009.01.001 [2] Ando, E., Ono, H., Sadakane, K., Yamashita, M.: A Counting-Based Approximation the Distribution Function of the Longest Path Length in Directed Acyclic Graphs. In: Proceedings of the 7th Forum on Information Technology, pp. 7–8 (2008) [3] Badinelli, R.D.: Approximating Probability Density Functions and Their Convolutions Using Orthogonal Polynomials. European Journal of Operational Research 95, 211–230 (1996) [4] Berkelaar, M.: Statistical delay calculation, a linear time method. In: Proceedings of the International Workshop on Timing Analysis (TAU 1997), pp. 15–24 (1997) [5] Chazelle, B.: A Minimum Spanning Tree Algorithm with Inverse-Ackermann Type Complexity. Journal of the ACM 47(6), 1028–1047 (2000)
Approximately Solving Stochastic Graph Optimization Problems
103
[6] Clark, C.E.: The PERT model for the distribution of an activity time. Operations Research 10, 405–406 (1962) [7] Dodin, B.: Bounding the Project Completion Time Distribution in PERT Networks. Networks 33(4), 862–881 (1985) [8] Hagstrom, J.N.: Computational Complexity of PERT Problems. Networks 18, 139–147 (1988) [9] Hashimoto, M., Onodera, H.: A performance optimization method by gate sizing using statistical static timing analysis. IEICE Trans. Fundamentals, E83-A 12, 2558–2568 (2000) [10] Hastings, C.: Approximations for Digital Computers. Princeton University Press, Princeton (1955) [11] Ishii, H., Shiode, S., Nishida, T., Namasuya, Y.: Stochastic spanning tree problem. Discrete Appl. Math. 3, 263–273 (1981) [12] Kasteleyn, P.W.: Graph theory and crystal physics. In: Harary, F. (ed.) Graph Theory and Theoretical Physics, pp. 43–110. Academic Press, London (1967) [13] Kelley Jr., J.E.: Critical-path planning and scheduling: Mathematical basis. Operations Research 10, 912–915 (1962) [14] Kleindorfer, G.B.: Bounding Distributions for a Stochastic Acyclic Network. Operations Research 19, 1586–1601 (1971) [15] Ludwig, A., M¨ ohring, R.H., Stork, F.: A Computational Study on Bounding the Makespan Distribution in Stochastic Project Networks. Annals of Operations Research 102, 49–64 (2001) [16] Martin, J.J.: Distribution of the time through a directed, acyclic network. Operations Research 13, 46–66 (1965) [17] Malcolm, D.G., Roseboom, J.H., Clark, C.E., Fazar, W.: Application of a technique for research and development program evaluation. Operations Research 7, 646–669 (1959) [18] Ravi, R., Sinha, A.: Hedging Uncertainly: Approximation algorithms for stochastic optimization problems. Mathematical Programming 108(1), 97–114 (2006) [19] Tsukiyama, S., Tanaka, M., Fukui, M.: An Estimation Algorithm of the Critical Path Delay for a Combinatorial Circuit. In: IEICE Proceedings of 13th Workshop on Circuits and Systems, pp. 131–135 (2000) (in Japanese) [20] Varadarajan, K.R.: A divide-and-conquer algorithm for min-cost perfect matching in the plane. In: Proc. 38th Annual IEEE Symposium on Foundations of Computer Science (1998) [21] Vazirani, V.V.: Approximation Algorithms. Springer, Berlin (2001) [22] Wets, R.J.-B.: Stochastic Programming. In: Nemhauser, G.L., Rinnooy Kan, A.H.G., Todd, M.J. (eds.) Handbooks in Operations Research and Management Science. Optimization, vol. 1, North-Holland, Amsterdam (1991)
How to Design a Linear Cover Time Random Walk on a Finite Graph Yoshiaki Nonaka1 , Hirotaka Ono1,2 , Kunihiko Sadakane3 , and Masafumi Yamashita1,2 1
2
Department of Informatics, Kyushu University, Japan
[email protected], {ono,mak}@csce.kyushu-u.ac.jp Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Japan 3 Principles of Informatics Research Division, National Institute of Informatics, Japan
[email protected] Abstract. Arandom walk on a finite graph G = (V, E) is random token circulation on vertices of G. A token on a vertex in V moves to one of its adjacent vertices according to a transition probability matrix P . It is known that both of the hitting time and the cover time of the standard random walk are bounded by O(|V |3 ), in which the token randomly moves to an adjacent vertex with the uniform probability. This estimation is tight in a sense, that is, there exist graphs for which the hitting time and cover times of the standard random walk are Ω(|V |3 ). Thus the following questions naturally arise: is it possible to speed up a random walk, that is, to design a transition probability for G that achieves a faster cover time? Or, how large (or small) is the lower bound on the cover time of random walks on G? In this paper, we investigate how we can/cannot design a faster random walk in terms of the cover time. We give necessary conditions for a graph G to have a linear cover time random walk, i,e., the cover time of the random walk on G is O(|V |). We also present a class of graphs that have a linear cover time. As a byproduct, we obtain the lower bound Ω(|V | log |V |) of the cover time of any random walk on trees.
1 1.1
Introduction Random Walks
Given a finite undirected and connected graph G = (V, E) and a transition probability matrix P = (puv )u,v∈V such that puv > 0 only if (u, v) ∈ E, a random walk ω on G starting at a vertex u ∈ V under P is an infinite sequence ω = ω0 , ω1 , · · · of random variables ωi whose domain is V , such that ω0 = u with probability 1 and the probability that ωi+1 = w provided that ωi = v is pvw for i = 0, 1, · · · . The standard random walk on a graph is the one in which the transition probability is uniform, that is, the transition probability matrix P0 is defined by, for any u, v ∈ V , 1 v ∈ N (u), puv = deg(u) 0 otherwise, O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 104–116, 2009. c Springer-Verlag Berlin Heidelberg 2009
How to Design a Linear Cover Time Random Walk on a Finite Graph
105
where N (u) and deg(u) are the set of vertices adjacent to a vertex u and the degree of u respectively. The hitting time HG (P ; u, v) from u ∈ V to v ∈ V is the expected number of transitions necessary for random walk ω starting from u to reach v for the first time under a P , and the hitting time of the HG (P ) of G is defined to be HG (P ) = maxu,v∈V HG (P ; u, v). The cover time CG (P ; u) from u ∈ V is the expected number of transitions necessary for random walk ω starting from u to visit all vertices in V under a P , and the cover time CG (P ) of G is defined to be CG (P ) = maxu∈V CG (P ; u). Let n = |V | and m = |E|. Then the cover time of the standard random walk on any graph G with n vertices and m edges are CG (P0 ) ≤ 2m(n − 1) holds[1, 2], whose result was later refined by Feige [5, 6]: (1 − o(1))n log n ≤ CG (P0 ) ≤ (1 + o(1))
4 3 n . 27
Since there is a graph L (called a Lollipop) such that HL (P0 ) = (1 − o(1))
4 3 n , 27
both the hitting and cover times of standard random walks are Θ(n3 ) [4]. 1.2
Related Work
Previous studies of the standard random walk have shown a tight analysis of the hitting time and the cover time. The result implies that to achieve a faster random walk we need to adopt a different transition probability from the one of the standard random walk. Ikeda et. al. [11] proposed a transition probability matrix P1 = (puv )u,v∈V defined by, for any u, v ∈ V , ⎧ ⎨ deg(v)−1/2 if v ∈ N (u), −1/2 w∈N (u) deg(w) puv = ⎩0 otherwise, and showed that the hitting and cover times of the random walk for any graph G are bounded by O(n2 ) and O(n2 log n) respectively, where n is the order of G. In addition, they proved that the hitting and the cover times of any random walk on path graph are bounded by Ω(n2 ). It should be noted that the above random walk is defined only by local degree information. As another random walk using only local degree information, the Metropolis walk is known. Given a graph G and a target (stationary) distribution (πv )v∈V on vertices, we can construct a random walk that attains the stationary distribution (πv )v∈V by applying the Metropolis-Hastings algorithm [9] to the standard random walk. The constructed random walk is called the Metropolis walk. Although we omit the precise definitions, the transition probability of the Metropolis walk also depends only on the local degree information. Recently the authors of this paper have shown that the hitting and cover times of Metropolis walk are bounded by O(f n2 ) and O(f n2 log n) respectively, where f = maxu,v πu /πv , and also there
106
Y. Nonaka et al.
exist pairs of a graph and its stationary distribution for which the hitting and cover times of the Metropolis walk take Ω(f n2 ) and Ω(f n2 log n), respectively. By relaxing the restriction that the transition probability is defined only by local degree information, we may obtain a faster random walk. In fact, it is easy to see that, for any graph G, there exists a random walk whose cover time is O(n2 ), by considering a random walk on its spanning tree. The paper follows this research direction: Given the whole topological information of graph G, by carefully designing a transition probability of a random walk on G, is it possible to attain a faster cover time? 1.3
Our Results
In this paper, we investigate the conditions for which a graph G has a linear cover time random walk, i.e., a random walk whose cover time is bounded by O(n). We first show that for any tree with n vertices and any transition probability matrix, the cover time is Ω(n log n). This result implies that any tree does not have a linear cover time random walk and we need to well utilize cycle structures of a graph to obtain a linear cover time random walk. Thus, we next focus on biconnected component decomposition of graph G, which gives necessary conditions for the existence of a linear cover time random walk. Finally, we consider Hamiltonian component decomposition, which yields a sufficient condition for a linear cover time random walk.
2
Preliminaries
We assume a standard terminology of graph theory. Suppose that G = (V, E) is a finite, undirected, simple, connected graph with the order n = |V | and the size m = |E|. For u ∈ V , let N (u) = {v|(u, v) ∈ E} and deg(u) = |N (u)| be the set of vertices adjacent to a vertex u ∈ V and the degree of u, respectively. A vertex v is a pendant-vertex if deg(u) = 1. A vertex v is a cut-point if G − v is not connected. A biconnected component is a maximal subgraph that has no cut-point. An edge e ∈ E is a bridge if G − e is not connected. A graph is Hamiltonian if it has a spanning cycle. Let diam(G) be the diameter of G, that is, maxu,v∈V dist(u, v), where dist(u, v) denote the shortest path length between u and v. Given graph G = (V, E), let biconnected component decomposition be a partition of V to V1 , V2 , · · · , Vk such that each induced subgraph of G on Vi (i = 1, · · · , k) is biconnected. Let P = (puv )u,v∈V be a transition probability matrix on G. That is, for any u ∈ V and v ∈ V , puv ≥ 0, and puv > 0 only if (u, v) ∈ E, and v∈V puv = 1. Let π = (πv )v∈V be the stationary distribution vector of P . That is, for any u∈V, v∈V
πv pvu = πu .
How to Design a Linear Cover Time Random Walk on a Finite Graph
107
Now, for any graph G and a transition probability matrix P and its stationary distribution π, since HG (P ; u, v) = puw (1 + H(P ; w, v)) − puv HG (P ; v, v), w∈V
we have
πu HG (P ; u, v) =
u∈V
πw HG (P ; w, v) +
w∈V
πu puw − πv HG (P ; v, v),
u∈V w∈V
which implies that πv HG (P ; v, v) = 1
3
for any v ∈ V.
(1)
Lower Bound of the Cover Time on Tree
In this section, we prove that for any tree with n vertices and for any transition probability matrix, the cover time of a random walk is Ω(n log n). In Section 3.1, we show that the tight lower bound of the cover time of a random walk on the star graph is n · h(n) ≈ n log n. Then, in Section 3.2, the star graph has the minimum cover time random walk among trees with n vertices. 3.1
Lower Bound on the Cover Time for a Star Graph
A star graph is a tree whose diameter is 2. We define a star graph with n vertices as Sn = ({v1 , · · · , vn }, {(v1 , vi )|i = 2, · · · , n}), that is, n − 1 vertices are connected to a vertex v1 as pendant-vertices. We first show that for any random walk on Sn the cover time is Ω(n log n). To prove this, we consider the transition probabilities between a vertex v0 and two pendant-vertices v1 , v2 connected to v0 in a general setting. Let G = (V, E) be a graph that has two pendant-vertices v1 and v2 . These pendant-vertices are both connected to a vertex v0 . Let P be a transition probability matrix of a
Fig. 1. Star graph
Fig. 2. A graph with two pendantvertices
108
Y. Nonaka et al.
random walk on G, and p1 and p2 denote the transition probabilities from v0 to v1 and from v0 to v2 , respectively. By fixing the probabilities in P other than p1 and p2 , we can consider another transition probability P (q1 , q2 ) by changing p1 and p2 with q1 and q2 , respectively. (Of course, q1 and q2 should satisfy that 0 ≤ q1 ≤ 1, 0 ≤ q2 ≤ 1 and q1 + q2 = p1 + p2 .) Lemma 1. The cover time CG (P (q1 , q2 )) is minimized at q1 = q2 . Proof. Let V = V \ {v1 , v2 }, and let pi be the transition probability from v0 to wi ∈ {N (v0 ) ∩ V } for i = 3, · · · , deg(v0 ). The hitting times from v0 to pendant-vertices v1 and v2 are represented as follows: 1 (1 + q2 + (pH)∗ ), q1 1 H(P (q1 , q2 ); v0 , v2 ) = (1 + q1 + (pH)∗ ), q2 H(P (q1 , q2 ); v0 , v1 ) =
(2) (3)
def deg(v ) where (pH)∗ = i=3 0 pi H(wi , v0 ). We consider the following two cases: (i) Starting at u ∈ V Hitting times between two vertices other than v1 and v2 do not depend on q1 and q2 . This implies that in case (ii) the expected number of transitions for the token to visit all the vertices other than v1 and v2 is fixed. Thus, the cover time is minimum if and only if the expected number of transitions for the token to visit both v1 and v2 is minimum. Let C be the expected number of transitions that the token starting from v0 takes to visit v1 and v2 . Then,
C = 2 +
2 p∗ + (pH)∗ 1 , (q1 + q22 )(1 + (pH)∗ ) + q13 + q23 + ∗ (1 − p )q1 q2 1 − p∗
(4)
def deg(v ) where p∗ = i=3 0 pi In the following, we consider to find q1 and q2 minimizing C , at which the cover time C(P (q1 , q2 ); u) is minimum. Since q1 + q2 = 1 − p∗ by the definition of p∗ , we have
q12 + q22 = (q1 + q2 )2 − 2q1 q2 = (1 − p∗ )2 − 2q1 q2 , q13 + q23 = (1 − p∗ )((1 − p∗ )2 − 3q1 q2 ). Since p∗ is fixed, Eq (4) is minimized when q1 q2 takes the maximum, i.e., q1 = q2 . (ii) Starting at u ∈ {v1 , v2 } We simply denote H(P (q1 , q2 ); v0 , v1 ) and H(P (q1 , q2 ); v0 , v2 ) by H(v0 , v1 ) and H(v0 , v2 ), respectively. The cover time C(P (q1 , q2 ; u)) is minimized when max{H(v0 , v1 ), H(v0 , v2 )} is minimum. Let qm be max{q1 , q2 }, and max{H(v0 , v1 ), H(v0 , v2 )} =
1 (1 + qm + (pH)∗ ). 1 − p∗ − qm
Eq (5) is minimized when qm is minimum, that is q1 = q2 .
(5)
How to Design a Linear Cover Time Random Walk on a Finite Graph
109
Lemma 1 can be generalized to the case in which v0 is connected with k pendantvertices for any positive integer k ≥ 2. Let G = (V, E) be a graph that has k pendant-vertices v1 , · · · , vk , and all of these pendant-vertices are connected to vertex v0 . Given a transition probability matrix P on G, we can define P (q1 , q2 , . . . , qk ) similarly to P (q1 , q2 ), that is, P (q1 , q2 , . . . , qk ) is the transition probability matrix obtained from P by replacing the transition probabilities from v0 to vi in P with a variable qi . Then we have the following theorem: Theorem 2. The cover time CG (P (q1 , q2 , . . . , k)) is minimized at q1 = · · · = qk . By applying Theorem 2 to a star graph repeatedly, we can see that the standard random walk is the fastest random walk for star graph in terms of the cover time. Since the cover time of the standard random walk on the star graph can be evaluated as the coupon collector problem, we obtain the following: Theorem 3. The cover time of the standard random walk on star graph Sn is 2n · h(n − 1), where h(i) is the i-th harmonic number. Corollary 1. For any random walk of transition probability matrix P on star graph Sn with n vertices, CSn (P ) = Ω(n log n). 3.2
Lower Bound of the Cover Time on Tree
As mentioned at the beginning of Section 3, we will show that Sn is the graph that has the fastest cover time random walk among the trees of n vertices in this subsection. To this end, we first present a way to “compare” two graphs having a similar structure. Let us consider two graphs GA = (V, EA ) and GB = (V, EB ), which have almost the same structure. Suppose that GA and GB have a common maximal induced subgraph, say G0 = (V0 , E0 ). Both the induced subgraphs of GA and GB on V \ V0 ∪ {x} are isomorphic and form the star graph Sk+1 = ({v1 , . . . , vk+1 }, {(v1 , vi ) | i = 1, . . . , k + 1}). Only the difference is that x = vk+1 in GA but x = v1 in GB . Here we rename the stars of GA and GB as follows: SA = ({y, z1 , · · · , zk }, {(y, x)} ∪ {(y, zi ) | i = 1, . . . , k}), SB = ({y, z1 , . . . , zk }, {(y, x)} ∪ {(x, zi ) | i = 1, . . . , k}). Lemma 4. For any transition probability matrix P for GA and u ∈ V0 , there is a transition probability matrix P for GB such that CGA (P ; u) ≥ CGB (P ; u). Proof. In order to show the lemma, it is sufficient to consider the case when (A) P = (puv )u,v∈V on GA gives the minimum cover time. Thus we assume that in P transition probabilities to pendant-vertices are uniform by Theorem2. Let (A) (A) pxy = p, pyx = q, for simplicity. (A) (A) p(A) yz1 = pyz2 = · · · = pyzk =
1−q . k
110
Y. Nonaka et al.
Fig. 3. GA
Fig. 4. GB
For this P , we construct P = (puv )u,v∈V as follows: (B)
p(B) uv
⎧ (A) puv ⎪ ⎪ ⎪ ⎨ (A) γpxv = ⎪p /(k + 1) ⎪ ⎪ ⎩ 1
u, v ∈ V0 \ {x}, u = x, and v ∈ V0 , u = x, and v ∈ {y, z1 , . . . , zk }, v = x, and u ∈ {y, z1 , . . . , zk },
where p and 0 ≤ γ ≤ 1 are tuned later. From now on, we show that P has p and γ such that CGA (P ; u) ≥ CGB (P ; u) holds. For GA and P , let ρ be the expected number of steps that it takes for the token starting from x to visit a vertex in V0 \ {x}. Then we have ρ=
2p . q(1 − p)
(6)
Similarly, we define ρ for GB and P , which is ρ =
2p . 1 − p
(7)
Since G0 and its transition probability are common for (GA , P ) and (GB , P ), it suffices to show ρ ≥ ρ for the claim. Let s be the expected number of times that the token visits x to cover the vertices in V \ V0 in GA and P . We can see that 2 sρ ≥ CSk+1 − 1 + , q where CSk+1 is the cover time of the standard random walk on star graph with k + 1 vertices, which is the fastest random walk on Sk+1 . For this s, we can set
How to Design a Linear Cover Time Random Walk on a Finite Graph
111
p and γ such that sρ = CSk+1 + 1. Then this implies that ρ ≥ ρ , which leads to CGA (P ) ≥ CGB (P ). By applying Lemma 4 to a tree T repeatedly, we obtain the following theorem and corollary. Theorem 5. For any tree T with n vertices and any transition probability matrix P on T , CT (P ) ≥ CSn (P0 ) holds, where P0 is the transition probability matrix of the standard random walk on Sn . Corollary 2. The cover time of any random walk for any tree is Ω(n log n).
4
Necessary Conditions of Linear Cover Time Random Walk
In the previous section, we prove that it is impossible to design a linear cover time random walk on any tree. This result implies that an underlying tree structure of a graph is essential to achieve a linear cover time random walk. To investigate this, we consider to decompose a graph G into biconnected components in order to focus on the relationship between the derived tree structure and a linear cover time random walk. 4.1
Biconnected Component Decomposition
Suppose that a partition V1 , . . . , Vk of V gives the biconnected component decomposition of G. Let Gi (Vi , Ei ) be the induced subgraph of G on Vi . By definition, Gi (Vi , Ei ) is biconnected. Let α = max1≤i≤k |Vi |, i.e., the maximum number of vertices among the biconnected components. We say two biconnected components Gi and Gj are adjacent if there exists an edge (u, v) ∈ E such that u ∈ Vi and v ∈ Vj (Figure 5), or if there exists a vertex (cut-point) u ∈ Vi ∩ Vj (Figure 6). In the former case, we call cut-points u and v bridge-end-vertices. In the latter case, we call cut-point u shared-vertex.
bridge-end-vertex
shared-vertex
Fig. 5. Bridge-end-vertex
Fig. 6. Shared-vertex
Given a graph G = (V, E), the operation of vertex-contraction Vc to vc produces a graph G = (V , E ) such that V = V \ Vc ∪ {vc }, E = E \ {(x, y)|x ∈ Vc } ∪ {(vc , y)|∃x ∈ Vc , (x, y) ∈ E}.
112
Y. Nonaka et al.
If a biconnected component Vi has at most 1 shared-vertex u, we denote Vc = Vi \ {u}. Let G be a graph produced by vertex-contraction Vc to vc . Although we omit the proof, we have the following lemma1 . Lemma 6. There is a transition probability matrix P on G such that CG (P ) ≥ CG (P ). We can construct a tree, say TG , which has k vertices by applying Lemma 6 to a graph G repeatedly as follows: 1. 2. 3. 4. 5.
i = 1. Choose a component Vi such that has at most 1 shared-vertex. If Vi includes a shared-vertex u then Vc = Vi \ {u}, else Vc = Vi . Do vertex-contraction Vi to vi . If i < k then i = i + 1 and go to 2.
Since TG is a tree, we obtain the following theorem, which shows necessary conditions for G to have a linear cover time random walk. Theorem 7. If a graph G has a linear cover time random walk, G satisfies the n following conditions: (i) α = Ω(log n), (ii) k = O log n , and (iii) diam(TG ) = √ O( n), where α, k and diam(TG ) are defined in this subsection. Proof. For any transition probability matrix P on G, there is a transition probability matrix P on TG such that CG (P ) ≥ CTG (P ) (Lemma 6). Since k ≥ n/α and α ≤ n holds by definition, we have CTG (P ) = Ω(k log k) = Ω
n α
log n ,
hold. Also, since CGT should hold.
5
n log n
should √ = Ω(diam(TG ) ) is shown in [11], diam(TG ) = O( n)
by Corollary 2. Thus if CG = O(n), both α = Ω(log n) and k = O 2
Sufficient Condition of Linear Cover Time Random Walk
In this section, we present a technique to design a fast random walk, which utilizes cycle structures of a graph. It is based on Hamiltonian component decomposition. By using this technique, we can show that several graphs have linear cover time random walks. 1
Interested readers can find a proof in http://tcslab.csce.kyushu-u.ac.jp/~nonaka/saga2009/proof.pdf
How to Design a Linear Cover Time Random Walk on a Finite Graph
5.1
113
Linear Cover Time Random Walk on a Hamiltonian Graph
For a graph with n vertices, a simple cycle with length n is called a Hamilton cycle. A graph is called a Hamiltonian graph if it has a Hamilton cycle. In this paper, we consider G with one vertex Hamiltonian. Also since a graph is assumed to be simple, a graph with two vertices is not Hamiltonian. It is easy to show that a Hamiltonian graph has a linear cover time random walk. We denote a Hamilton cycle of G by (v1 , v2 , . . . , vn , v1 ). Then we define the transition probability matrix: 1 j ≡ i + 1(mod n), pij = (8) 0 otherwise. In this random walk, the token moves along the Hamiltonian cycle in one direction, and visits all vertices just n − 1 steps. 5.2
Hamiltonian Component Decomposition
A Hamiltonian component decomposition of a graph is defined as follows. Definition 8. For a graph G = (V, E), consider a partition V1 , V2 , · · · , Vk of V . The set of the induced subgraphs H1 , H2 , . . . , Hk of G on V1 , V2 , · · · , Vk is called a Hamiltonian component decomposition of size k if each Gi is Hamiltonian. By the definition of Hamiltonian graph, in a Hamiltonian component decomposition H1 , . . . , Hk , the size of each Vi for i = 1, . . . , k is 1 or larger than 2. Mote that a graph has several Hamiltonian component decompositions, and it is NP-hard to find a minimum size of Hamiltonian component decomposition. Now we explain how we construct a fast random walk from a Hamiltonian component decomposition H1 = (V1 , E1 ), H2 = (V2 , E2 ), . . . , Hk = (Vk , Ek ) of G. Since each Hi is a Hamiltonian, we can define a oriented subgraph Ci = (i) (i) (i) (i) (Vi , EiC ) of Hi , where EiC = { uj , uj+1 | j = 1, . . . , |Vi | − 1} ∪ { u|Vi | , u1 }.
k Note that if (u, v) ∈ E \ i=1 Ei , u and v belong to different Hamiltonian
components. We choose an Et ⊆ E \ ki=1 Ei such that G = (V, Et ∪ ki=1 Ei ) is connected. We propose a transition probability matrix
k PH based on these Hi ’s, Ci ’s and Et . Let NH (u) = {v ∈ N (u) | u, v ∈ i=1 EiC ∨ (u, v) ∈ Et } and d (u) = |NH (u)|. We define the transition probability matrix PH by 1 v ∈ NH (u), puv = d (u) 0 otherwise. In this random walk, the token circulates via the tree structure of the Hamiltonian components. In the following, we show that the cover time of PH is O(nk). We give a bound on the hitting time H(PH ; u, v) for (u, v) ∈ Et . Intuitively, each (u, v) ∈ Et is a bridge of the tree structure of the Hamiltonian components; (u, v) divides the tree structure of the Hamiltonian components into two (sub)tree structures, say
114
Y. Nonaka et al.
1
direction of circle
0 0
1/2
1
0
0
0
1/2
0
0
0
0
1/3
1
0
0 0
1/3 1/3
1
Fig. 7. Transition probability for Hamiltonian Component
Fig. 8. Division by edge e ∈ Et
G1 = (V1 , E1 ) and G2 = (V2 , E2 ). Let nuv = |V1 |, muv = |E1 ∩Et |+1, nvu = |V2 |, and mvu = |E2 ∩ Et | + 1. We then have the following lemma, although we omit the proof. Lemma 9 H(PH ; u, v) ≤ nuv + 2muv − 1
H
(9)
:Hamiltonian component
H
H H
H
H
Fig. 9. The sequence of covering graph
Theorem 10. We can design a random walk whose cover time is O(nk), where k is the number of Hamiltonian components. Proof. Suppose a graph G = (V, E) is decomposed into Hamiltonian components H1 , . . . , Hk . We consider a traversal sequence v1 → v2 → · · · → vt of vertices that traverses the vertices according to the orientation defined by EiC ’s via the tree structure of the Hamiltonian components (Figure 9). Then, the cover time is bounded by t−1 CG (PH ; v1 ) ≤ H(PH ; vi , vi+1 ). (10) i=1
How to Design a Linear Cover Time Random Walk on a Finite Graph
115
By Lemma 9, the hitting time from a vertex to an adjacent vertex H(PH ; vi , vi+1 ) is estimated as follows: 1 ∃j : vi−1 , vi , vi+1 ∈ Vj , H(PH ; vi , vi+1 ) = (11) O(n) otherwise. In Eq (10), the number of O(n) terms is at most 4k − 1. Therefore we have CG (PH ) = O(nk). Corollary 3. We can design a linear cover time random walk if the number of Hamiltonian components is constant.
6
Conclusion
In this paper, we investigate the existence of a linear cover time random walk. We first prove that for any tree with n vertices and for any transition probability matrix, the cover time is Ω(n log n). By using this fact, we present necessary conditions for the existence of a linear cover time random walk. The conditions are based on the biconnected component decomposition. Then, we propose a way of designing a random walk, which is based on the Hamiltonian component decomposition. Since it is shown that the cover time of the random walk is O(nk) where k is the number of Hamiltonian components, we can say that a graph G has a linear cover time random walk if G has a constant size of Hamiltonian component decomposition. There is still a big gap between the sufficient condition and the necessary conditions for the existence of a linear cover time random walk. Thus an interesting open problem is to find a necessary and sufficient condition.
References [1] Aleliunas, R., Karp, R.M., Lipton, R.J., Lov´ asz, L., Rackoff, C.: Random walks, universal traversal sequences, and the complexity of maze problems. In: Proc. 20th Symposium on Foundations of Computer Science, pp. 218–223 (1979) [2] Aldous, D.J.: On the time taken by random walks on finite groups to visit every state. Z. Wahrsch. verw. Gebiete 62, 361–363 (1983) [3] Aldous, D.J., Fill, J.: Reversible Markov Chains and Random Walks on Graphs, http://www.stat.berkeley.edu/users/aldous/RWG/book.html [4] Brightwell, G., Winkler, P.: Maximum hitting time for random walks on graphs. Journal of Random Structures and Algorithms 3, 263–276 (1990) [5] Feige, U.: A tight upper bound on the cover time for random walks on graphs. Journal of Random Structures and Algorithms 6, 51–54 (1995) [6] Feige, U.: A tight lower bound on the cover time for random walks on graphs. Journal of Random Structures and Algorithms 6, 433–438 (1995) [7] Feller, W.: An Introduction to Probability Theory and its Applications, vol. 1. Wiley, New York (1967) [8] Gilks, W.R., Richardson, S., Spiegelhalter, D.J.: Markov Chain Monte Carlo in Practice. Chapman & Hall, Boca Raton (1996)
116
Y. Nonaka et al.
[9] Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970) [10] Ikeda, S., Kubo, I., Okumoto, N., Yamashita, M.: Impact of Local Topological Information on Random Walks on Finite Graphs. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 1054– 1067. Springer, Heidelberg (2003) [11] Ikeda, S., Kubo, I., Yamashita, M.: The hitting and cover times of random walks on finite graphs using local degree information. Theoretical Computer Science 410(1), 94–100 (2009) [12] Lov´ asz, L.: Random walks on graphs: Survey. In: Bolyai Society, Mathematical Studies 2, Combinatorics, Paul Erd¨ os is Eighty, Keszthely, Hungary, vol. 2, pp. 1–46 (1993) [13] Matthews, P.: Covering Problems for Markov Chain. The Annals of Probability 16(3), 1215–1228 (1988) [14] Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of state calculations by fast computing machine. Journal of Chemical Physics 21, 1087–1091 (1953) [15] Motowani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, New York (1995) [16] Nonaka, Y., Ono, H., Sadakane, K., Yamashita, M.: The Hitting and Cover Times of Metropolis Walks (submitted for publication) [17] Sinclair, A.: Algorithms for Random Generation & Counting. Birkh¨ auser, Boston (1993)
Propagation Connectivity of Random Hypergraphs Robert Berke and Mikael Onsj¨ o Dept. of Mathematical and Computing Sciences Tokyo Institute of Technology, Tokyo, Japan {berke9,mikael}@is.titech.ac.jp
Abstract. We consider the average-case MLS-3LIN problem, the problem of finding a most likely solution for a given system of perturbed 3LIN-equations generated with equation probability p and perturbation probability q. Our purpose is to investigate the situation for certain message passing algorithms to work for this problem whp We consider the case q = 0 (no perturbation occurs) and analyze the execution of (a simple version of) our algorithm. For characterizing problem instances for which the execution succeeds, we consider their natural 3-uniform hypergraph representation and introduce the notion of “propagation connectivity”, a generalized connectivity property on 3-uniform hypergraphs. The execution succeeds on a given system of 3LIN-equations if and only if the representing hypergraph is propagation connected. We show upper and lower bounds for equation probability p such that propagation connectivity holds whp on random hypergraphs representing MLS-3LIN instances.
1
Introduction
This paper yields basic and preliminary investigations for analyzing the averagecase performance of message passing algorithms to find a “most likely solution” for a given system of “perturbed” linear equations on GF(2). In particular, we consider the case that equations are restricted to having three variables such as xi ⊕ xj ⊕ xk = b, which we call 3LIN-equations. For our average-case scenario, we consider the following way to generate a set of linear equations on n variables for a given n: for a randomly chosen “planted solution”, generate a set of linear equations by putting each 3LIN-equation that is consistent with the planted solution into the set with probability p independently. Then “perturb” some of those equations by flipping their right hand side values independently at random with some small probability q. We denote this random generation scheme by 3 LIN n,p,q . By a “most likely solution” we intuitively mean the planted solution. Our problem is essentially the same as the one that has been studied quite well as the MAX-3LIN problem or the MAX-3XORSAT problem see, e.g., [Has01]. The problem is in the worst-case NP-hard and it is believed (see, e.g., [Ale03]) that the problem is hard even on average. Nevertheless, it seems possible to solve it in polynomial-time on average to some extent, particularly, by using O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 117–126, 2009. c Springer-Verlag Berlin Heidelberg 2009
118
R. Berke and M. Onsj¨ o
message passing algorithms such as the one obtained following the belief propagation recipe [MMC98,Pea88]. In fact, the Bisection problem, see for instance [WY07], which can be regarded as a special case of the MAX-2LIN problem, can be solved in polynomial-time on average by some message passing algorithm based on the belief propagation, and its performance has been analyzed [WY07] by using spectral techniques. On the other hand, only little investigations have been reported towards some positive results for the MAX-3LIN problem, in particular by message passing algorithms. (Cf. an important improvement on the polynomial-time solvability is reported in [CCF09].) We would like to understand the situation (more specifically, the ranges of p and q) for message passing algorithms to work. We consider a generic hard decision type message passing algorithm (see, Figure 2). As preliminary analysis, we first investigate its simplest version, that is, the algorithm that determines values of variables from the correct assignment to two initially chosen variables. (Precisely speaking, the execution we analyze is the one trying all 22 possible initial assignments; the correct one is certainly included in one of those trials.) We primarily consider the case that q = 0, that is, the case that no perturbation is introduced and the system of equations indeed has a complete solution. In this case, the condition that this value propagation process succeeds to obtain the complete solution is characterized in terms of the “propagation connectivity” of a 3-uniform hypergraph that naturally represents a given set of equations. Furthermore, from our random generation scheme of 3LIN-equations (over n variables), we may assume that the corresponding hypergraph is simply generated by the standard random hypergraph generation scheme Hn,p with hyperedge probability p. In a nutshell, our task is to determine a hyperedge probability p such that almost all random hypergraphs are propagation connected. The propagation connectivity is a generalization of standard connectivity for normal graphs. In fact, the standard connectivity characterizes the solvability of 2LIN-equations by our simple message passing algorithm. It has been shown in [ER61] that p = Θ(log n/n) is the edge probability threshold for connectivity of random graphs generated by Gn,p . We obtain some evidence that a possible hyperedge probability threshold p for propagation connectivity exists between (log log n)2 /(n log n) and 1/(n(log n)2 ) for 3-uniform random hypergraphs generated by Hn,p . The case q = 0 applies directly to the lower bound on the proposed threshold which is one of the main points. For non-trivial noise it is not, from these results, obvious that the algorithm will work when the hyperedge density is higher than the upper bound. It is intuitively clear, however, that for reasonably small (even constant) noise, the situation cannot change very much. What happens as the noise approaches 1/2 more or less quickly with n, is an interesting question. For full noise, i.e. q = 1/2, no algorithm can hope to reclaim the planted solution.
2
Preliminaries: Problem and Algorithm
We will mainly use standard notions and notations that have been used in design and analysis of algorithms. One exception is to use +1 for 0 and −1 for 1; then
Propagation Connectivity of Random Hypergraphs
119
the GF(2) addition or the XOR operation ⊕ is simply the integer multiplication; that is, we have +1 ⊕ +1 = +1, −1 ⊕ −1 = +1, etc. This notation is somewhat convenient because we can now use 0 (for undefined) and assume that x ⊕ 0 = 0 ⊕ x = 0. First we define our target problem precisely as one of Most Likely Solution problems (in short, MLS problems) introduced in [OW09]. For this, we begin with a way to generate problem instances, (perturbed) systems of 3LIN-equations. Our generation scheme 3 LIN n,p,q is specified by three parameters n, p, and q, which are respectively number of variables, equation generation probability, and perturbation probability. Since the generation 3 LIN n,p,q is natural, it may be easy to see from the outline illustrated in Figure 1; yet we define it precisely for preparing notions and notations. Let the parameter n, the number of variables, monote 3XOR-clauses: E select clauses xi1 ⊕ xj1 ⊕ xk1 → ind. with prob. p .. n from all 3 clauses . xim ⊕ xjm ⊕ xkm i.i.d. →
assignment (a∗1 , ..., a∗n )
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭
3LIN-eq.s xi1 ⊕ xj1 ⊕ xk1 = b∗1 =⇒ .. . xim ⊕ xjm ⊕ xkm = b∗m || flip each b∗j ⇓ with prob. q perturbed 3LIN-eq.s: F xi1 ⊕ xj1 ⊕ xk1 = b1 .. . xim ⊕ xjm ⊕ xkm = bm
Fig. 1. 3 LIN n,p,q : An instance generation scheme
be fixed. Let x1 , ..., xn denote variables each of which takes a value in {+1, −1}. A monotone 3XOR-clause is of the form xi ⊕xj ⊕xk with 1 ≤ i < j < k ≤ n. In our random model 3 LIN n,p,q , we first generate a set E = {e1 , ..., em } of monotone 3XOR-clauses by randomly selecting each of the n3 clauses independently with n probability p; hence, m = p 3 on average. We also fix one assignment a∗ = (a∗1 , ..., a∗n ) of x1 , ..., xn as a planted solution; we may assume that it is chosen uniformly at random from all 2n assignments. Then for each 3XOR-clause es of E, define b∗s by the value of es on a∗ , and define one 3LIN-equation “es = b∗s ”. Finally, a set F = {f1 , ..., fm } of perturbed equations is obtained by flipping each b∗s to −b∗s with prob. q independently; that is, each fs of F is “es = bs ” where bs is either b∗s (with prob. 1 − q) or −b∗s (with prob. q). Thus F is our problem instance and 3 LIN n,p,q denotes this way of generating F . Note that (provided q = 0) F can be generated from any assignment a. For any 3LIN-equation set F and any assignment a, let Pr[3 LIN n,p,q (a) = F ] denote the probability that E is generated from a as explained above. Now we define our problem as follows.
120
R. Berke and M. Onsj¨ o
Most Likely 3LIN Solution (MLS-3LIN) input: A set F of (perturbed) 3LIN-equations; task: Compute a that maximizes Pr[3 LIN n,p,q (a) = F ]. We consider only q < 1/2, and it is easy to see in this case that the most likely solution is the one that maximizes the number of satisfied equations. Thus, in this sense, our problem is the same as the MAX-3LIN problem. Yet we propose to consider the MLS-3LIN problem because for this problem, it is natural to assume that an input instance F is generated following 3 LIN n,p,q . In fact, when p and q are in a reasonable range, e.g., the one guaranteed by Proposition 1 below, we may assume that the planted solution is the unique solution for F . By “whp” (with high probability) we mean throughout this paper the probability 1 − o(1) with respect to n. Proposition 1. For some sufficiently large constants c1 and c2 , if we have (1) c1 log n log n 1 p > n2 , and (2) q < 2 − c2 n2 p , then whp the planted solution a∗ is the unique solution to the MLS-3LIN problem. (Note that the first condition implies some nontrivial range for q in the second condition.) We can consider a natural correspondence between systems of 3LIN-equations and hypergraphs. Let E and F be respectively a set of monotone 3XOR-clauses on n variables and a system of 3LIN-equations obtained from E. Then the representing hypergraph H = (V, HE) is defined by identifying each variable xi as a vertex and each clause e ∈ E as an hyperedge. (For notational convenience, we express vertices and edges by indices; that is, V = {1, ..., n} and HE = { {i, j, k} : xi ⊕ xj ⊕ xk ∈ E}.) This hypergraph H is called the representation of F . Note that these hypergraphs are all 3-uniform, meaning all their hyperedges consist of three vertices. Let Hn,p denote the standard scheme that generates a random 3-uniform hypergraph with n vertices by adding each hyperedge independently with probability p. Then it is easy to see that a hypergraph H representing F generated by 3 LIN n,p,q can be regarded as a hypergraph generated by Hn,p . Later we will consider systems of 2LIN-equations, in which case a similar correspondence is used between systems of 2LIN-equations and normal graphs. Also we have the same correspondence between 2 LIN n,p,q and the standard random graph scheme Gn,p . Since our problem is equivalent to the MAX-3LIN problem, it is NP-hard to solve in the worst-case. It seems, nevertheless, that some message passing algorithms including the one based on the (loopy) belief propagation work reasonably well on average. Here we consider a simple hard decision type message passing algorithm given in Figure 2. Note that the algorithm is stated in a slightly general form by not specifying MAXSTEP nor a way of selecting a set I0 of initial variables of size n0 . In the following, we consider a variation of this base algorithm and discuss its average-case performance.
Propagation Connectivity of Random Hypergraphs
121
procedure BaseMessagePassingAlgo for input F = {f1 , ..., fm }; I0 ← a set of some n0 initial variables (chosen in some way); for all possible assignments to I do { repeat MAXSTEP times do { i ≤ n, in parallel for all xi , 1 ≤ do {
; —(∗) xi ← sgn m f →i s Fs ∈Fi } if stabilized then exit;
} a ← the obtained solution; if a is better than the current best then replace the current best;
} output the current best solution; end-procedure • Fi = the set of equations containing xi ; • mfs →i = bs ⊕ xj ⊕ xk , where fs = “xi ⊕ xj ⊕ xk = bs ”; (mfs →i = 0 if some of xi , xj , xk is 0) • sgn(z) = +1 (resp., = −1) if z > 0 (resp., z < 0), and it is 0 if z = 0; Fig. 2. A base message passing algorithm for MLS-3LIN
3
Propagation Connectivity
We consider the case that q = 0, that is, the case that no violation is introduced and the system of equations indeed has a complete solution. In this case, the problem is nothing but to solve a given (and solvable) system of equations. This can be done efficiently by, e.g., some Gaussian elimination scheme. It may well, however, still be interesting to study whether some message passing algorithm works for this easy case. For a message passing algorithm, we consider the execution of the algorithm of Figure 2 and with n0 = 2, that is, the execution starting from an initial assignment to two variables. More precisely, we set MAXSTEP = n and consider to execute the algorithm for all possible sets of two variables for I0 (and to use the best solution). We discuss whether the planted solution is obtained by at least one of these executions. Consider a given set F of 3LIN-equations generated by k LIN n,p,0 . Since the algorithm in Figure 2 tries all possible initial assignments, we may consider the inner repeat loop execution with the initial assignment consistent with the planted solution. Then it is easy to see that the value of each variable does not change once it is determined. Consider the process of determining these values. First the values of two variables of I0 are given by the initial assignment. Then if there is an equation having these two variables (and only if such an equation exists), then the algorithm can determine the value of the third one of the equation by the updating formula (∗). This process is repeated so long as there exists some equation whose two variables are assigned +1 or −1 and the third one is not yet assigned. Thus, the execution of the inner repeat loop can
122
R. Berke and M. Onsj¨ o
be regarded as a process of such assignment propagation from the assignment of two variables. From this view point, we refer this inner loop execution for a given I0 as 2-propagation from I0 , and refer to all n2 executions trying all possible pairs of variables for I0 as all-2-propagation. In the following we discuss characterize the cases when all-2-propagation solves/does not solve random 3LIN-equations with high probability. As explained in the previous section, we can represent each MLS-3LIN instance F by a hypergraph H. A similar propagation process corresponding the above assignment propagation process on F can be defined for the hypergraph H. That is, the process first marks two initial vertices (that correspond to variables of I0 ) and then it marks an unmarked vertex if it is in some hyperedge of H whose other two vertices are already marked. This marking process terminates if no new vertex is marked (including the case that all vertices in V are marked). For any set I0 , we say that the marking process from I0 succeeds if all vertices are marked by the process started from marking vertices of I0 and it fails otherwise. This marking process defines the following new connectivity notions of hypergraphs: For any h ≥ 2, a 3-uniform hypergraph H is h-propagation connected if a marking process succeeds from some initial vertex set I0 of size h. We simply say that H is propagation connected if it is 2-propagation connected. Then the following characterization is clear from our above discussion. Proposition 2. Let F be any set of 3LIN-equations generated according to 3 LIN n,p,0 . For any set I0 of two variables, 2-propagation from I0 finds the planted solution if and only if the marking process succeeds from I0 . Thus, all-2-propagation finds the planted solution if and only if its hypergraph representation is propagation connected. As explained in the previous section, hypergraphs corresponding to 3LINequations generated by 3 LIN n,p,0 are random hypergraphs generated by the standard random hypergraph scheme Hn,p . Thus, in the rest of this section, we discuss the propagation connectivity of random hypergraphs following Hn,p . For the standard connectivity of normal random graphs following Gn,p , we know [ER61] that Θ(log n/n) is a threshold for p such that a random graph following Gn,p is connected whp. We believe that a similar threshold exists for our generalized connectivity, which is of some interest by itself. Here we give some supporting evidence by showing upper and lower bounds for p such that a random hypergraph following Hn,p is propagation connected whp. For our analysis, the following expansion lemma plays an important role. This lemma states that either the marking process terminates earlier or it succeeds in marking the entire hypergraph. Lemma 1. Suppose that d(n) and k(n) are such that n ≥ d(n) > n−1/4 , k(n) tends to infinity as n → ∞, d(n)k(n) > c log k(n) for a sufficiently large constant c. Then a random hypergraph following Hn,d(n)/n whp contains no propagation connected component C with k(n) ≤ |C| < n. Proof. Assume that n is sufficiently large, let d = d(n) and p = d/n. A propagation connected component C is a maximal (w.r.t. vertex inclusion) subset
Propagation Connectivity of Random Hypergraphs
123
of vertices of V such that C is propagation connected. For any k such that k(n) ≤ k ≤ n − 1, we estimate the probability ρ(k) that H ∼ Hn,p contains a propagation connected component C with k vertices. Let us first bound the probability that a subset S of k vertices of V is propagation connected. We observe that S needs to be connected (in the usual sense) and thus S contains a spanning hypertree T . Denote the number of edges of T by e(T ). Note that if S is indeed propagation connected, then certainly it contains more than k − 2 hyperedges as each edge in the marking process can mark no more than one new vertex. Hence, it also contains another k − 2 − e(T ) edges. (k) k−2−e(T ) 3 p . Thus An upper bound for the probability of this event is k−2−e(T )
the probability that a given set S with k vertices is propagate connected is at most k
k e(T ) k−2−e(T ) 1/2 k−2 k−2 3 3 p ≤ (k/2 ) p p . k − 2 − e(T ) k/2 T :hypertree on k vertices
The last bound is obtained by noting that e(T ) ≥ (k − 1)/2 , and that the number of (vertex labeled) 3-uniform hypertrees with k vertices is at most (k/21/2 )k−2 [SP06]. We now bound ρ(k), by additionally taking into account that S actually forms a propagation connected component C and therefore no new vertex is marked S. We have
k
k n 3 · (1 − p)(2)(n−k) . ρ(k) ≤ · (k/21/2 )k−2 · pk−2 · k k/2 Then by considering the two cases √ that k ≤ n/2 and n/2 < k ≤ n − 1, we can show that ρ(k) ≤ e−k or ρ(k) ≤ e− n respectively. From these bounds, we can conclude that the total probability of having some C of size ≥ k(n) (and C = V ) is o(1). Now we prove our upper bound result. We show that if for any constant c, p > c(log log n)2 /n log n, then whp the propagation connectivity holds. Let us estimate roughly the function k(n) of the lemma for this range of p. Note that p in Lemma 1 is defined as d(n)/n, we may assume that d(n) ≥ (log log n)2 / log n; hence, in order to satisfy the condition of the lemma, it is enough to consider k(n) = c log n for some constant c > 0. Thus, it suffices to show that whp the marking process can mark more than c log n vertices. Before starting our analysis let us introduce some random variables and give some preliminary analysis. First note that our marking process can be split into stages. At each stage we only use marked vertices in the previous stage for propagation. For any t ≥ 1, let Lt (resp., Kt ) denote the number of newly marked (resp., all
marked) vertices at stage t. Let K0 = L0 = 2. Kt can be written as Kt = ti=0 Li . Note that Lt+1 ∼ Bin(Kt Lt (n − Kt ), p); but since we are considering relative small Kt , e.g., O(log n), we may well approximate it by Bin(Kt Lt n, p).
124
R. Berke and M. Onsj¨ o 2
log n) Theorem 1. For any constant c, if we have p > c (log n log n , then H ∼ Hn,p is whp propagation connected.
Proof. Suppose that pn = o(1), which is intuitively the most difficult case; the other is left to the reader. Consider the first step of the marking process and estimate L1 . Note that L1 ∼ Bin(4n, p) and that μ1 = E[L1 ] > (log log n)2 / log n, and μ1 = o(1) by assumption. We show that the probability that L1 is much larger than this expectation is not so small. For choosing two starting vertices, split the vertex set into some subsets of some cardinality n (that is to be defined later). Consider any one of the subsets; for any x, let Ex be the event that there is no starting pair of vertices in the subset such that L1 = x, where L1 is the number of marked vertices in the subset from this starting pair. Let p = Pr[L1 = x], 4n x which is approximated as x p > exp(x log(pn/x)) for sufficiently large n.
n n2 exp(x log(pn /x)) , Pr[Ex ] = (1 − p )( 2 ) ≈ exp − 2 which is less than a given if x log(pn )−1 + x log x < 2 log(n / log(−1 )). This condition is satisfied, e.g., by the following choice of parameters: n = n/(4 log n), = n−4 , and x = log n/ log log n. Therefore, we conclude that whp there are at least 4 log n starting vertex pairs we can choose such that for each we may mark, in the first step of the marking process, a set of size log n/ log log n that is disjoint from any other. Now consider L2 but this time do not restrict the expansion to a particular subset. The expectation is E[L2 ] = L21 np > c log n and the probability of not reaching the expectation is obviously ≤ 1/2. Thus, with prob. > 1−(1/2)4 log n = 1 − n−4 , there is certainly some choice of starting vertices such that L2 is larger than c log n. But by applying Lemma 1 with e.g., k(n) = c log n, we have that if c log n vertices can be marked, then so can the entire graph. Hence whp is H propagation connected. With the same line of reasoning but by applying upper estimates on Li , we can show the asymptotic absence of propagation connectivity in graphs where p is somewhat smaller. Theorem 2. For any constant c, if we have p < whp not propagation connected.
c n(log n)2 ,
then H ∼ Hn,p is
Proof. Suppose (for induction) that for a given i, it holds that Ki = O(log n) and Li = o(log n). Then since Li+1 ∼ Bin(Ki Li n, p), we have μ = E[Li+1 ] = Ki Li np = o(1). For some δ = δ(n) that tends to infinity, from the standard Chernoff bound, we have Pr[ Li+1 ≥ μδ ] ≤ exp(−μδ log δ). This is less than n−10 , e.g., if δ is chosen so that δμ = (11 log n)/ log log n. Hence if Ki = O(log n) for some i, then we may assume that indeed all Li , i ≤ i+1 are o(log n). Note that
i+1 −x Ki+1 = j=0 Lj and that Pr[Lj = xj ] ≤ (o(1)e/xj )xj < xj j for any xj (where 00 = 1 in this notation). Using these, we estimate Pr[Ki+1 = s] for some number s. For this, consider the ways of picking i + 1 non-negative integers x1 , ..., xi+1
Propagation Connectivity of Random Hypergraphs
125
< exp(i log(exp(s + i)/i)) ways such that their sum becomes s; there are s+i i of doing so. In each case, we have ⎞ ⎛
i+1 i+1 s ⎠ ⎝ Pr[Lj = xj ] ≤ exp − xj log xj . ≤ exp −s log i+1 j=1 j=0 Hence by a union bound, Pr[Ki+1 = s] ≤
s and exp i log e(s+i) − s log i i+1
again, this is less than n−10 , if, e.g., i ≤ 10 log n and s = 40 log n. But then by recursive application of the union bound, we have with prob. > 1 − n−9 that Ki = O(log n) and Li = o(log n) for all i ≤ 10 log n, which was the assumption made at the beginning of this proof. Hence with prob. > 1 − n9 we have a situation such that in each of 10 log n steps the expectation of the increment E[Li ] is o(1). Therefore with prob. > 1 − n−9 (o(1))10 log n > 1 − n−8 , one of the increments is zero, which means that the propagation stopped.
4
Some Notes
More elaborately, we consider the algorithm with 2 log n initial variables in I0 , chosen randomly. It is used on instances in which a respectable number of the equations are perturbed. Some message passing algorithms have been analyzed by spectral techniques (see, e.g., [WY07]) by using their computational similarity to the power method for computing the largest eigenvector (i.e., the eigenvector with the largest eigenvalue). It seems, however, to apply this approach directly to algorithms for kLIN with k ≥ 3 is difficult since no useful spectral notion has been known for hypergraphs. Finally, it should be noted that some polynomialtime algorithm has been shown recently (see [CCF09], Theorem 27) that can be used for much smaller p = Ω(logO(1) n/n1.5 ).
Acknowledgement This research was supported in part by the JSPS Global COE program “Computationism as a Foundation for the Sciences”.
References Ale03.
Alekhnovich, M.: More on average case vs approximation complexity. In: Proc. FOCS 2003, pp. 298–307. IEEE, Los Alamitos (2003) CCF09. Coja-Oghlan, A., Cooper, C., Frieze, A.M.: An efficient sparse regularity concept. In: Proc. SODA 2009, pp. 207–216 (2009) MMC98. McEliece, R., MacKay, D., Cheng, J.: Turbo decoding as an instance of Pearl’s “Belief Propagation” algorithm. IEEE Journal on Selected Areas in Communications 16(2), 140–152 (1998) ER61. Erd˝ os, P., R´enyi, A.: On the evolution of random graphs. Bull. Inst. Internat. Statist. 38, 343–347 (1961)
126 Has01. OW09. Pea88. SP06.
WY07.
R. Berke and M. Onsj¨ o H˚ astad, J.: Some optimal inapproximability results. J. ACM 48, 798–859 (2001) Onsj¨ o, M., Watanabe, O.: Finding most likely solutions. In: Theory of Computing Systems. On-Line Version (2009) Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco (1988) Shannigrahi, S., Pal, S.: Efficient Pr¨ ufer-like coding and counting labelled hypertrees. In: Asano, T. (ed.) ISAAC 2006. LNCS, vol. 4288, pp. 588–597. Springer, Heidelberg (2006) Yamamoto, M., Watanabe, O.: Belief propagation and spectral methods. Research Report C-248, Dept. of Math. Comput. Sci., Tokyo Inst. of Tech. (2007)
Graph Embedding through Random Walk for Shortest Paths Problems Yakir Berchenko1 and Mina Teicher1,2 1 2
The Leslie and Susan Gonda Multidisciplinary Brain Research Center Dept. of Mathematics, Bar Ilan University, Ramat Gan 52900, Israel
[email protected]
Abstract. We present a new probabilistic technique of embedding graphs in Z d , the d-dimensional integer lattice, in order to find the shortest paths and shortest distances between pairs of nodes. In our method the nodes of a breath first search (BFS) tree, starting at a particular node, are labeled as the sites found by a branching random walk on Z d . After describing a greedy algorithm for routing (distance estimation) which uses the 1 distance (2 distance) between the labels of nodes, we approach the following question: Assume that the shortest distance between nodes s and t in the graph is the same as the shortest distance between them in the BFS tree corresponding to the embedding, what is the probability that our algorithm finds the shortest path (distance) between them correctly? Our key result comprises the following two complementary facts: i) . by choosing d = d(n) (where n is the number of nodes) large enough our algorithm is successful with high probability, and ii) d does not have to be very large - in particular it suffices to have d = O(polylog(n)). We also suggest an adaptation of our technique to finding an efficient solution for the all-sources all-targets (ASAT) shortest paths problem, using the fact that a single embedding finds not only the shortest paths (distances) from its origin to all other nodes, but also between several other pairs of nodes. We demonstrate its behavior on a specific nonsparse random graph model and on real data, the PGP network, and obtain promising results. The method presented here is less likely to prove useful as an attempt to find more efficient solutions for ASAT problems, but rather as the basis for a new approach for algorithms and protocols for routing and communication. In this approach, noise and the resulting corruption of data delivered in various channels might actually be useful when trying to infer the optimal way to communicate with distant peers. Keywords: Graph embedding, shortest paths problem.
1
Introduction
The problem of finding the shortest paths (or, in a relaxed version: just the shortest distances) between pairs of nodes in a graph has been studied extensively in O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 127–140, 2009. c Springer-Verlag Berlin Heidelberg 2009
128
Y. Berchenko and M. Teicher
many disciplines. Traditionally, researchers from pure mathematical disciplines have been more interested in the efficiency of the solution. After finding the optimal O(n2 ) solution to the single-source all-targets problem (SSAT), an effort was made to finding a better solution to the all-sources all-targets problem (ASAT) than the naive O(n3 ) solution which can be obtained by simple repetitions of the SSAT solution. A well-known result [1] provided an O(nγ ) solution to the ASAT problem (where γ ≈ 2.37 is the exponent of matrix multiplication). 1 Even if we allow for small errors of size ς, most solutions cost O(n2+f ( ς ) ) (where f is some increasing function) thus giving rise to a leading order of n which is > 2 [4]. On the other hand, researchers in other disciplines such as Physics and Sociology have tried to explain the navigability found in many networks (i.e., the ability to find the shortest paths using local information alone, such as in the famous work by Milgram [11]), usually by assuming that the network is embedded a-priory in some spatial or geographical domain [7, 9]. Recently however, both these issues have attracted increased attention, partially due to the interest in efficient distributed algorithms [6, 5, 10, 2]. The work presented here is aimed at studying decentralized routing and solutions for shortest paths problems using local information alone. However, the techniques and results are compelling even for the general case and compare well with known results. The core of our work is the following simple and efficient embedding technique (more details in section 2): Given a (simple, connected, non-directed and un-weighted) graph G with n nodes and m edges, we select (possibly at random) a node o and label it as the origin of Z d , the d-dimensional integer lattice (i.e. by a vector of d zeros). In subsequent steps, using breath first search, each labeled node, vi , labels its unlabeled neighbors, U N (vi ). This is done by (randomly) assigning to each vk ∈ U N (vi ), one of the sites on the lattice which is adjacent to the site occupied by vi . Simply stated, the idea is that if we want to deliver a package from s to t we should simply pass it from s to a node which has the minimal 1 distance to t (more precisely: to the site which t occupies). In addition, if we are asked only for their graph distance (number of edges in the shortest path from s to t), this can be estimated from their 2 distance. At a first glance, this scheme, although very intuitive, seems too optimistic even for the SSAT problem since we have no guarantee that assigning the label o (as the origin) to the single source, will lead to an efficient and successful routing. Nevertheless, our key result (section 2) demonstrate the following two . complementary facts: by choosing d = d(n) large enough we are successful with high probability (w.h.p., i.e. with probability → 1 as n → ∞), and d does not have to be very large - in particular it suffices to have d = O(polylog(n)). It could be argued that our O(m+n·polylog(n)) solution, though distributed, does not make a strong contribution to the SSAT problem. However, it is easy to see that after a single embedding we find not only the shortest paths (distances) from o to all other nodes, but also between several other pairs of nodes (in
Graph Embedding through Random Walk for Shortest Paths Problems
129
particular, pairs of nodes where the shortest path between them in the graph is the same as the shortest path between them in the resulting BFS tree after the embedding). This observation suggests that we might be able to use our approach to obtain an O(m · polylog(n)) solution for the ASAT problem, which makes it extremely interesting. Indeed, in section 3, which is devoted to the adaptation and application of the technique to the ASAT problem, we present some heuristics showing why it is reasonable to expect ”good behavior” and when this is less reasonable. In section 3.1 we briefly study the performance of the method on a particular non-sparse random graph model [15], suggested as a model for complex networks where degree power laws are observed, and report promising results. We then examine the behavior of the method on a real-world network, the PGP network [3], which could be expected to perform badly. However, our simulation of routing and distance estimation produces very good performance (section 3.2). We conclude and discuss our findings in section 4.
2
Foundations and Fundamental Results
We begin with a few definitions and notations, and formulate our basic assumption (sec. 2.1). It is important to note here that although our method is suitable for distributed implementation, we wish to compare it to the general case and hence we will not deal with issues such as how to perform synchronous distributed BFS [4]. In section 2.2 we describe the embedding procedure. There are several questions to address, perhaps the most pressing of which is what happens if the same lattice site is assigned to two different nodes. A number of solutions are available depending on personal inclinations and the researcher’s objectives; however, as a corollary of lemma 1 the ”throw everything and start from the beginning” approach still takes only O(m + n · polylog(n)). In sections 2.3-4 we present our key results: theorem 1 concerning the shortest paths, and theorem 2 for finding the shortest distances using the labels of the source and target alone. Although by theorem 1 it is possible to obtain the distances as well, we present a more efficient method (described in theorem 2) which only utilizes the labels of s and t. The motivation for this is the adaptation of the method for ASAT: for a given graph create a small amount, r ( where r = o(nα ) , α > 0), of embedding; then, for each pair of nodes estimate the distance using theorem 2. The main question now is to determine when a small number of BFS trees will cover all the shortest paths appropriately, resulting in an O(r · n2 ) algorithm for ASAT. This is addressed in section 3 for a particular non-sparse random graph model (sec. 3.1) and for a real-world network, the PGP network (sec. 3.2). 2.1
Definitions and Notations
Our notations are standard; by f (n) = o(g(n)) we mean that as n → ∞ the (n) ratio fg(n) → 0.
130
Y. Berchenko and M. Teicher
f (n) = O(g(n)) means that as n → ∞ the ratio 0 ≤ C ≤ constant.
f (n) g(n)
→ C where
f (n) = Θ(g(n)) means that as n → ∞ the ratio
f (n) g(n)
→ constant.
f (n) = ω(g(n)) means that as n → ∞ the ratio constant ≤ C ≤ ∞.
f (n) g(n)
→ C where
Z d is the standard d-dimensional integer lattice; however, we say that two sites x := (x1 , x2 , . . . , xd ) and y := (y1 , y2 , . . . , yd ) are neighbors if and only if they differ by ±1 in each coordinate. Let G be a simple, connected, non-directed and un-weighted graph with n . nodes and m edges. If m = m(n) = O(n) we say that G is sparse, if G is not sparse we say that G is non-sparse and if m = Θ(n2 ) we say that G is dense. The distance between a pair of nodes is the number of edges in the shortest path between them. When we refer to the 1 distance between a pair of nodes, we mean that the nodes are labeled as sites x, y in Z d and we need to look at | xi − yi |. In any case it is clear from the context which definition of ”distance” is meant. From now on we assume that the length of the longest shortest path in G (also known as the diameter of G) is O(log(n)), though with a bit of work we can relax this assumption. The largest degree of G, not necessarily bounded, is denoted by Δ. 2.2
The Embedding Procedure
Given a graph G, we select (possibly at random) a node o and label it as the origin of Z d , the d-dimensional integer lattice (i.e. by a vector of d zeros). In the subsequent steps, using breath first search, each labeled node labels its unlabeled neighbors. This is done by (randomly) assigning to each one of them, one of the 2d sites on the lattice which are the neighbors of the site, x, occupied by the labeling node. In other words, in each step we add +1 or −1 with equal probability, to every coordinate of x and use the result as the new label for one of the neighbors of the node with label x. As stated earlier, we will not deal here with issues such as how to perform synchronous distributed BFS [4]. Clearly, this procedure takes O(m + d · n) time (and memory); but what if we do not want to allow different nodes to have the same label (duplicates)? Lemma 1. If d = ω(log 1+ (n)) then the probability of having duplicates → 0 as n → ∞. Proof. When embedding any graph with n nodes, we are less (or as) likely to create duplicates than when embedding a node connected to all n − 1 other nodes, in the origin, and then its neighbors. However, the latter is simply the problem of placing n − 1 balls randomly in 2d cells. Since (n − 1)2 = o(2d ) we get that the probability of having a ”cell” with two ”balls” → 0. Remark. Note that we can refer to the ”resulting BFS tree”. Each edge (v, u) belongs to the BFS tree iff node v labeled node u or node u labeled node v.
Graph Embedding through Random Walk for Shortest Paths Problems
2.3
131
Routing
Is it possible to use the embedding for routing with local information alone? More specifically, is it possible to deliver a (memoryless) package without routing tables where each node only knows the label of the target and the labels of its immediate neighbors? The affirmative answer utilizes the following simple scheme: when a node v needs to deliver a package to t, it simply passes it to its neighbor | wi − ti | ) where w ∈ T N (v) with the minimal 1 distance to t (i.e. minimal T N (v) are the neighbors of v in the BFS tree. Remark 1. The node v ”decides” that its neighbor w is in T N (v) , i.e. w ∈ T N (v), if | w − v |= (1, 1, 1.., 1). Remark 2. It might be beneficial to relax the condition on w, w ∈ T N (v), and indeed this is possible, but for simplicity’s sake in proving theorem 1 this is omitted. Remark 3. The routing scheme, as stated above, does not deal with the case where there are several neighbors which have the minimal 1 distance to t. This can be resolved by choosing one of them randomly; however, as the proof of theorem 1 shows, the probability of this case occurring is negligible. We now have: Theorem 1. Given i. a graph with diameter O(log(n)) ii. a single random embedding, as described in sec. 2.2, with d = log α (n) (where α will be determined below) and iii. a pair of nodes s, t. If the shortest path between s and t in the graph is the same as the shortest path between them in the resulting BFS tree after the embedding, then our routing scheme will find it w.h.p. (i.e. with probability → 1 as n → ∞). Remark. Naturally, the shortest path between s and t in the graph need not be unique. Proof. Let P = t, v 1 , v 2 , v 3 , ...c, u, ..s be the shortest path between s and t in the graph which is the same as the shortest path between them in the BFS tree. Assume we want to deliver the package from a certain node, u, to t (with possibly u = s). Assume, without loss of generality, that the label of t is made up entirely of zeros, and (without loss of generality) that all the non-zero entries of u are positive. First, we need the following proposition: Proposition 1. The number of entries of u which are nonzero is w.h.p. Θ(d). Proof of proposition 1. Since P is part of the BFS tree every node v j either labeled its neighbor v j+1 or was labeled by it. Thus, the value of a given coordinate (e.g. the first) of a node v j (i.e. v1j ) is distributed like the value of a symmetric random walk of length j. Given that it is always true that P(ti − ui = 0) ≤ 0.5,
132
Y. Berchenko and M. Teicher
the Chernoff bound guarantees us with high probability that Θ(d), i.e. C · d, of the coordinates of the vector t − u are nonzero. In fact, it is easy to strengthen proposition 1: Proposition 2. There is a constant C > 0 such that w.h.p. the labels of every node in P have at least C · d nonzero entries (with the exception of t). Proof of proposition 2 (sketch of the proof): Recall that the length of P is O(log(n)). Using the union bound and the result of proposition 1, proposition 2 follows immediately. Now, let c be the label of the ”correct neighbor”, i.e. the node to which u should pass the package en-route to t. We are interested in the distribution of its ith component, ci . Given the labels of the target node, t, and of the current location, u, what is the probability of a successful delivery to c? In other words, we are actually interested in the conditional distribution of ci . Since P is part of the BFS tree we can consider ui and ti as the starting and finishing points of a symmetric random walk; from Martingale theory we know that E[ui − ci ] =
ui −ti |steps|
(where E[·] is the expectation and | steps | is the number of steps it takes to get from u to t). Thus, for 0 = ti < ui we get: E[ui − ci ] = ω(
1 ). log(n)
(1)
Because the random variable (ui − ci ) is ±1, from (1) we get the next lemma: > 0 such Lemma 2. For large enough n and 0 = ti < ui there is a constant C that P(ci < ui ) ≥
C 1+ log(n) 2
.
Proof of lemma 2. Consider the following system of two equations (i) E[ui − ci ] = P(ui − ci = 1) − P(ui − ci = −1) (ii) P(ui − ci = 1) + P(ui − ci = −1) = 1
and use (1) to solve it. th
Remark. Lemma 2 states that the probability that the value of the i coordinate will decrease, provided it belongs to the next node on the correct path (i.e. c), C is ≥ 12 + 2log(n) . Although this seems a humble increment in comparison to the unconditioned probability (1/2), this result, lemma 2, is the key feature of our method and will allow the ”gap” between the ”right” and ”wrong” choices to emerge. Now, in order to pass the package in the correct direction, to c, we simply need to have a relatively large number of ”decreasing” coordinates of c, i.e.
Graph Embedding through Random Walk for Shortest Paths Problems
133
having 0 = ti ≤ ci < ui . Let C > 0 be the fraction of nonzero entries of u and w ∈ T N (u) a neighbor of u in the BFS tree (w = c). It is easy to see that the number of decreasing coordinates in w (i.e. having 0 = ti ≤ wi < ui ) is α distributed Binomial B( 12 , Cd) with an average of 12 Cd = C log 2(n) . Lemma 2 gives the following important lower bound: the number of decreasing coordinates 1+
C
, Cd). in c is larger than a Binomial B( log(n) 2 What is the probability of passing the package incorrectly to w? Consider the following steps: a) In the first step, instead of calculating P(b( 12 , Cd) > b(
C 1+ log(n) 2
, Cd))
we actually calculate: α−1 α−1 1+ C C Clog (n) C Clog (n) P(b( 12 , Cd) > Cd b( log(n) , Cd) ≤ Cd ) 2 + 4 2 2 + 4 where stands for OR, and b(◦, •) stands for the value of a random variable distributed according to B(◦, •). This means we are actually calculating the probability of being closer to the ”other” mean. b) Next, we replace the expression above P(• ) with P(•)+P( ). We estimate P(•) ≈ P( ) and calculate only P(•). As usual, we discard constants (i.e. 2). c) Using the Chernoff bound we get α−1
2
(n)) ) = O(exp(−log α−2 (n))). P(•) = O(exp( −(log logα (n)
Thus we get that the probability of passing the package to specific wrong neighbor w ∈ T N (u) at a specific step → 0 if α > 2. When considering all other potential wrong neighbors in that step, it is possible to replace P(•) with ΔP(•), however even when Δ is unbounded we can use α > 3 and have an adequately fast convergence to 0. After steps a-c we finish by estimating the probability of a successful delivery from u to t as: (1− probability of error in a single step)|steps| . If we use the result of step c with α > 3 we get, as desired, that this probability is bounded below by: 1 (1 − o( log(n) ))log(n) −→ 1.
2.4
Distance Estimation
Clearly, once the nodes are labeled it is much more costly to find the distance between a pair s-t by finding the shortest path (via theorem 1) than to estimate it directly from their labels. The question is how to estimate the distance from the labels, and how good this estimate is.
134
Y. Berchenko and M. Teicher
The first question is quite easy. If the shortest path between s and t in the graph is the same as the shortest path between them in the resulting BFS tree then the differences (si − ti ) in each coordinate are independent identically distributed random variables (iid r.v.). The distribution of these random variables, RW ( 12 , k), is obviously the well studied distribution for the location of a random walker after k steps, starting at the origin and with a probability of 12 of jumping right (left) in each step. Since we have N = log α (n) samples, we can use standard tools of statistical inference to estimate the parameter k. It is very common to see examples dealing with this issue, usually applying the maximum likelihood (ML) method to infer p from a sample distributed RW (p, k0 ) when k0 is known. However, in order to be able to apply the ML method in a simple way, the set of data values which has positive probability should not depend on the unknown parameter. This is obviously not correct for the case we are studying, and therefore we used the method of moments. The general procedure in the method of moments is as follows: the moments of the empirical distribution are determined (the sample moments), equal in number to the number of parameters to be estimated; they are then equated to the corresponding moments of the probability distribution, which are functions of the unknown parameters; the system of equations thus obtained is solved for the parameters and the solutions are the required estimates [14]. In practice, this means we need to find only the sample variance (which is actually related to the 2 distance). The second question, how good is the estimate which utilizes the method of moments, is not much more difficult. Under fairly general conditions the method of moments allows one to find estimators that are asymptotically normal, have a mathematical expectation that differs from the true value of the parameter only by a quantity of order N1 and a standard deviation that deviates by a quantity of order √1N . Here we only need the fact that our estimator for the variance is concentrated near the true value of the variance, i.e. the variance of the estimator of the variance is o(1). This can be shown, for example, by adapting Chapter 5 sec. 10 of [8]. And now we have: Theorem 2. Given i. a graph with diameter O(log(n)) ii. a single random embedding, as described in sec. 2.2, with d = log α (n) ( α > 3 for brevity) and iii. a pair of nodes s, t. If the shortest distance between s and t in the graph is the same as the shortest distance between them in the resulting BFS tree after the embedding, then our distance estimation scheme will find it with probability → 1 as n → ∞. Proof. Denote by ξ the variance of the estimator of the variance. Since ξ = o(1) 1 ) for some increasing f (n). Recalling that k is an integer we can write ξ = O( f (n) (and in addition we know whether it is odd or even) entails that our estimator need only have an absolute error of less than 1 in order for our scheme to succeed. The probability of the complement is: P(| estimator − true variance |≥ 1)
Graph Embedding through Random Walk for Shortest Paths Problems
135
But now, using Chebyshev’s inequality we get: P(| estimator − true variance |≥
3
f (n) f (n) )
≤
1 f 2 (n)
= o(1)
Adaptation and Application for ASAT
So far we have presented an O(m + n · polylog(n)) solution to the SSAT problem. This result, despite the possibility of implementing it in a distributed manner, is not too interesting. However, it is easy to see that after a single embedding we find not only the shortest paths (distances) from the source to all other nodes, but also between several other pairs of nodes (in particular, pairs of nodes whose shortest path between them in the graph is the same as the shortest path between them in the BFS tree). Consider, for example, the two extreme scenarios: if G is a tree then a single embedding suffices in order to find all the shortest paths and distances, and if G is the complete graph then a single embedding will not help us to discover the shortest distances, but it will have an error of size of only 1, and it will help us in routing. The next, third, example is especially interesting: if G is a cycle and we perform a single embedding and choose randomly two nodes s and t, then it is easy to see that with a probability ≥ 12 the shortest path between s and t in the graph is the same as the shortest path between them in the BFS tree and we find the path/distance precisely. In addition, it is also easy to see that in case of error, the larger the error, the less likely it is. These observations and examples suggest that we might be able to use our approach to obtain an O(m · polylog(n)) solution for the ASAT problem, which makes it extremely interesting for dense graphs, by the following procedure: For a given graph G create a small amount, r (for e.g. r = log(n)), of embeddings starting at random nodes; then, for each pair of nodes estimate the distance, using theorem 2, r times and take the minimum. The main question now is when a small number of BFS trees will cover all the shortest paths appropriately, resulting in an O(r · n2 ) algorithm for ASAT. This is addressed below for a particular non-sparse random graph model (sec. 3.1) and tested on a real-world network, the PGP network (sec. 3.2). 3.1
Non-sparse Random Graphs
In this section we rely on the random graph model and results of [15] and explain heuristically why we might conjecture that our algorithm will perform well in this model. The paper by van der Hofstad, Hooghiemstra and Znamenski [15] gives a rigorous derivation for the random fluctuations of the distance between two arbitrary nodes in a graph with i.i.d. degrees with infinite mean. This model with i.i.d. degrees is a variation of the configuration model, which was originally proposed by Newman, Strogatz and Watts [13], where the degrees originate from a given
136
Y. Berchenko and M. Teicher
deterministic sequence (for references and applications of the original configuration model see [12]). Arguably, this model, where the tail of the distribution function of the degrees is regularly varying with exponent τ ∈ (1, 2), may serve as a model for complex networks where degree power laws are observed [15]. Remark : Note, however, that the graphs obtained in this model are not simple. The main result of [15] is that the graph distance converges for τ ∈ (1, 2) to a random variable with a probability mass exclusively on the points 2 and 3. The heuristic explanation for this result is that most of the edges belong to a finite number of nodes which have giant degrees (the so-called giant nodes). A basic fact in the configuration model is that two nodes with large sets of stubs are connected with high probability. Thus the giant nodes are all attached to each other, forming a complete graph of giant nodes. Each stub is attached to a giant node with a probability close to 1, and therefore, the distance between any two nodes is, with a high probability, at most 3. In fact this distance equals 2 when the two nodes are connected to the same giant node, and is 3 otherwise. Returning to ASAP we might conjecture: Conjecture: In the model described above, it is possible to create r = log(n) embeddings and get an O(r · n2 ) solution to the ASAT distances problem. The idea is that since there is a finite number of giant nodes (with roughly the same degree) we need only a finite number of trials to hit each one of them (or a non-giant neighbor of it) as the selection for the origin of the embedding. This will guarantee we are covering all the shortest paths with BFS trees. Note we need r = o(d), otherwise because we take the minimum of all the estimated distances (per pair), although ξ is o(1), we will underestimate the distance. Although this section is mostly speculative, we feel it would be interesting to check the effect of ”simplifying” the graphs in this model, i.e. erasing loops and double edges. This will probably not effect the results of [15] or the performance of our algorithm; the main question now is: how will it effect the (non) sparseness of G? 3.2
The PGP Network
The realm of social networks and peer-to-peer networks is a natural place to implement our method, due to their decentralized nature. In order to evaluate the performance of our algorithm on such a network we tested it on the PGP network (its largest connected component). The Pretty-Good-Privacy web of trust (PGP) can be considered as one of the first social networks. Its largest connected component has the following properties: n = 10680, mean degree = 4.6, power law degree distribution, typical length of paths: ∼ 6 − 12, maximal degree = 206 and a so-called ”hierarchical” / ”community” structure, where there are many groups of nodes with many connections inside a group and few connections between groups. Note that the latter two properties should make it more difficult for our method to avoid errors. In the first experiment we tested the results of section 2.3, theorem 1. For this purpose we selected a node randomly, used it as the origin of the embedding
Graph Embedding through Random Walk for Shortest Paths Problems
137
and tried to find the shortest path from it to every other node according to 2.3 with d = 500 < 0.05n ≈ log 2.5 (n) (with the necessary addition that if it does not find the target within 45 steps it terminates). A typical result is displayed in Figure 1: the ’x’ axis gives the index ∈ [1 − 10680] of the target node; blue circles - the true distance from the source, green stars - the distance that was found, red stars - the error. It can be seen that the vast majority of packages found the shortest path, and all the errors except four are zero.
45 40 35 30 25 20 15 10 5 0
0
2000
4000
6000
8000
10000
12000
Fig. 1. Routing - the ’x’ axis gives the index ∈ [1 − 10680] of the target node; blue circles - the true distance from the source, green stars - the distance that was found, red stars - the error. The vast majority of the packages found the shortest path, all except four errors were zero.
In the second experiment we tested the results of section 2.4, theorem 2. Again, we selected a node randomly, used it as the origin of the embedding and tried to find the shortest distances from it to every other node, now according to 2.4, with d = 500 < 0.05n ≈ log 2.5 (n). Figure 2 shows a typical histogram of the errors, a Gaussian-like distribution with a standard deviation of ∼ 0.7. It is interesting to note that when we ran this experiment with d = 9 (data not shown) the distribution of the error had a standard deviation of ∼ 4, exhibiting the scaling by √1N = √1d . Our third experiment tested the scheme suggested in section 3. We selected randomly six nodes ,v1 − v6 , and used five of them (v1 − v5 ) as the origin of five different embeddings while the sixth, v6 , served as the source from which all the distances should be estimated. First we restricted the estimation to use only the labels induced by the embedding with v1 as its origin. The results, as expected, were quite poor with
138
Y. Berchenko and M. Teicher
1800 1600 1400
Counts (per bin)
1200 1000 800 600 400 200 0 −10
−8
−6
−4
−2
0 2 Size of error
4
6
8
Fig. 2. Distance estimation - a typical histogram of the errors, a Gaussian-like distribution with a standard deviation of ∼ 0.7
2500 5 sources 1 source
Counts (per bin)
2000
1500
1000
500
0 −5
0 Size of error
5
Fig. 3. Distance estimation from a node which is not an origin of an embedding, for the ASAT problem. Red line - using only the labels induced by a single embedding. The mean error size is ∼ 2 and the standard deviation is ∼ 1.7. Blue line - using labels induced by five different embeddings. Mean error size is ∼ 0.47 and the standard deviation is ∼ 1.
Graph Embedding through Random Walk for Shortest Paths Problems
139
a mean error size of ∼ 2 and a standard deviation ∼ 1.7 (figure 3: red line). However, when we used all the labels induced by the different embeddings, the result was remarkably better: with a mean error size of ∼ 0.47 and a standard deviation ∼ 1 (figure 3: blue line); bearing in mind that since the distance is an integer means we estimated most of the distances accurately.
4
Conclusions and Discussion
We presented a new approach to finding shortest paths and shortest distances in graphs. We focused mostly on our key result that labels of length log 3+ (n) should suffice in order to have effective routing (and distance estimation), and on the interesting idea of using several different embeddings for ASAT. However, despite the promising observation regarding a particular random graph model (sec. 3.1) and a real-world network (sec. 3.2), there are many issues that need to be resolved. Perhaps the most interesting and important is briefly the following: given three randomly selected nodes s, t and u, if we perform an embedding starting at u, what is the probability of the shortest path between s and t in the BFS tree being a shortest path between them in the graph as well? We feel, however, that the method presented here should prove useful not as an attempt to find more efficient solutions for ASAT problems, but rather as the basis for a new approach for algorithms and protocols for routing and communication. In this approach, noise, information theory public enemy number one, and the resulting corruption of data delivered in various channels, might actually be useful. When trying to infer the optimal way to communicate with distant peers some version of one of our theorems could be adapted and applied. As far as we are aware, this counterintuitive idea has not yet been implemented.
References [1] Alon, N., Galil, Z., Margalit, O., Naor, M.: Witnesses for Boolean matrix multiplication and for shortest paths. In: Proceeedings of the 33rd IEEE Symposium on Foundations of Computer Science, Pittsburgh, PA, pp. 417–426 (1992) [2] Barriere, L., Fraigniaud, P., Kranakis, E., Krizanc, D.: Efficient Routing in Networks with Long Range Contacts. In: Welch, J.L. (ed.) DISC 2001. LNCS, vol. 2180, pp. 270–284. Springer, Heidelberg (2001) [3] Boguna, M., Pastor–Satorras, R., Diaz–Guilera, A., Arenas, A.: Models of social networks based on social distance attachment. Phys. Rev. E 70, 056122 (2004) [4] Elkin, M.: Computing almost shortest paths. In: Proceedings of the 20th ACM Symposium on Principles of Distributed Computing, Newport, RI, August, pp. 53–63 (2001) [5] Fraigniaud, P., Gavoille, C.: Polylogarithmic Network Navigability Using Compact Metrics with Small Stretch. In: SPAA 2008: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, pp. 62–69. ACM, New York (2008)
140
Y. Berchenko and M. Teicher
[6] Fraigniaud, P., Gavoille, C., Kosowski, A., Lebhar, E., Lotker, Z.: Universal Augmentation Schemes for Network Navigability: Overcoming the sqrt(n)–Barrier. In: Proceedings of the Nineteenth Annual ACM Symp. on Parallel Algorithms and Architectures, pp. 1–7. ACM, New York (2007) [7] Kleinberg, J.M.: Navigation in a small world. Nature 406, 845 (2000) [8] Lass, H., Gottlieb, P.: Probability and statistics. Addison-Wesley, Reading (1971) [9] Liben–Nowell, D., Novak, J., Kumar, R., Raghavan, P., Tomkins, A.: Geographical routing in social networks. Proceedings of the National Academy of Science 102, 11623–11628 (2005) [10] Martel, C., Nguyen, V.: Analyzing kleinberg’s (and other) small-world models. In: PODC 2004: Proceedings of the twentythird annual ACM symposium on Principles of distributed computing, pp. 179–188. ACM Press, New York (2004) [11] Milgram, S.: The small world problem. Psychology Today 2, 60–67 (1967) [12] Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence, Random Struct. Algorithms 6, 161–179 (1995) [13] Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Random graphs with arbitrary degree distributions and their applications, Phys. Phys. Rev. E 64, 026118 (2001) [14] Prohorov, Y.V., Rozanov, Y.A.: Probability theory, basic concepts. Limit theorems, random processes. Springer, Heidelberg (1969) (Translated from Russian) [15] van den Esker, H., van der Hofstad, R., Hooghiemstra, G., Znamenski, D.: Distances in random graphs with infinite mean degrees. Extremes 8, 111–140 (2006)
Relational Properties Expressible with One Universal Quantifier Are Testable Charles Jordan and Thomas Zeugmann Division of Computer Science Hokkaido University, N-14, W-9, Sapporo 060-0814, Japan {skip,thomas}@ist.hokudai.ac.jp
Abstract. In property testing a small, random sample of an object is taken and one wishes to distinguish with high probability between the case where it has a desired property and the case where it is far from having the property. Much of the recent work has focused on graphs. In the present paper three generalized models for testing relational structures are introduced and relationships between these variations are shown. Furthermore, the logical classification problem for testability is considered and, as the main result, it is shown that Ackermann’s class with equality is testable. Keywords: property testing, logic.
1
Introduction
Property testing is an application of induction. Given a large object such as a graph or database, we wish to state a conclusion about the entire object after examining a small, randomly selected sample. Lov´asz [19] has described it as the “third reincarnation” of this approach, after statistics and machine learning. Property testers are probabilistic approximation algorithms that examine only a small part of their input. Our goal is always to distinguish inputs that have some desired property from inputs that are far from having it. We are especially interested in classification, i.e., the testability of large classes of properties. The paper is structured as follows. In Subsection 1.1, we outline the history of testing, focusing on results that influence our approach. Much recent work has focused on graphs, while we seek a general framework that we call relational property testing. Definitions and notation are in Section 2. In Section 3 we show the relationships between variations of our framework. We use the framework from previous sections to state the classification problem for testability in Section 4, and show that Ackermann’s class with equality is testable (cf. Theorem 4) in all of the variations considered in previous sections.
Supported by a Grant-in-Aid for JSPS Fellows under Grant No. 2100195209. Supported by MEXT Grand-in-Aid for Scientific Research on Priority Areas under Grant No. 21013001.
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 141–155, 2009. c Springer-Verlag Berlin Heidelberg 2009
142
1.1
C. Jordan and T. Zeugmann
History of Property Testing
We begin with a brief history and overview of property testing. There are also a number of surveys of property testing, see for example Fischer [12] or Ron [25]. Property testing is a form of approximation where we trade accuracy for efficiency. Probabilistic machines seem to have been first formalized by de Leeuw et al. [18], who showed that such machines cannot compute uncomputable properties under reasonable assumptions. However, they mention the possibility that probabilistic machines could be more efficient than deterministic machines. An early example of such a result is Freivalds’ [14] matrix multiplication checker. Property testing itself began in program verification (see Blum et al. [8] and Rubinfeld and Sudan [26]). Goldreich et al. [16] first considered the testability of graph properties and showed the existence of testable NP-complete properties. An approach using incidence lists to represent bounded-degree graphs was introduced by Goldreich and Ron [15]. Parnas and Ron [22] generalized this approach and attempted to move away from the functional representation of structures. For other types of structures, Alon et al. [4] showed that the regular languages are testable and that there exist untestable context-free languages. Chockler and Kupferman [11] extended the positive result to the ω-regular languages. There is also recent work on testing properties of (usually uniform) hypergraphs. In particular, Fischer et al. [13] defined a general model that is roughly equivalent to one of our models, namely Tr based on Definition 8, and showed that hypergraph partition problems are testable in this framework. However, much of the recent work has been focused on graphs and Alon and Shapira [6] survey some of the recent results in testing graph properties. Alon et al. [2] began a logical characterization of the testable (graph) properties, see Section 4. Alon and Shapira [5] gave a (near) characterization of a natural subclass of the testable graph properties, which R¨ odl and Schacht [24] generalized to hypergraphs. Alon et al. [3] showed a combinatorial characterization of the graph properties testable with a constant number of queries.
2
Preliminaries
Instead of restricting our attention to, for example, graphs, we focus on property testing in a general setting. We begin by defining vocabularies. Definition 1. A vocabulary τ is a tuple of distinct predicate symbols Ri together with their arities ai , τ := (R1a1 , . . . , Rsas ) . Two examples of vocabularies are τG := (E 2 ), the vocabulary of directed graphs and τS := (S 1 ), the vocabulary of binary strings. Definition 2. A structure A of type τ is an (s + 1)-tuple A A := (U, RA 1 , . . . , Rs ) , ai is a predicate corresponding to where U is a finite universe and each RA i ⊆ U the predicate symbol Ri of τ .
Relational Properties Expressible with One ∀ Quantifier Are Testable
143
We identify U with the non-negative integers {0, . . . , n − 1} and use n = #(A) for the size of the universe of a structure A. The universe U of a binary string is the set of bit positions, which we will identify as {0, . . . , n − 1} from left to right. For i ∈ U , we interpret i ∈ S as “bit i of the string is 1.” n (τ ) and the The set of all structures of type τ and universe size n is STRUC n set of all structures of type τ is STRUC (τ ) := 0≤n STRUC (τ ). A property of type τ is any subset of STRUC (τ ). For A ∈ P , we say A has P . We use language to refer to string properties, P to denote properties and P1 \P2 for set difference. 2.1
Property Testing Definitions
We wish to distinguish, with high probability, between inputs that have a desired property and inputs that are far from having the property. We begin by defining a distance measure between structures. Changing the definition of distance results in a different model for relational testing. The symbol ⊕ is exclusive-or. Definition 3. Let A, B ∈ STRUC (τ ) be any structures such that #(A) = #(B) = n. The distance between structures A and B is dist(A, B) :=
1≤i≤s
B |{x | x ∈ U ai and RA i (x) ⊕ Ri (x)}| s . ai i=1 n
The dist distance is the fraction of assignments on which the two structures disagree. It is equivalent to the definition that would result from mapping relational structures to binary strings and using the usual definitions for testing strings. We now give the remaining definitions for testing, and will then give alternatives to Definition 3 (cf. Definitions 8 and 12). Definition 4. Let P be a property of structures with vocabulary τ and let A be such a structure with a universe of size n. Then, dist(A, P ) :=
min
A ∈P ∩STRUC n (τ )
dist(A, A ) .
Definition 5. An ε-tester for property P is a randomized algorithm given an oracle which answers queries for the universe size and truth values of relations on desired tuples in a structure A. The tester must accept with probability at least 2/3 if A has P and must reject with probability at least 2/3 if dist(A, P ) ≥ ε. Definition 6. Property P is testable if for all ε > 0 there are ε-testers making a number of queries which is upper-bounded by a function depending only on ε. We allow different ε-testers for each ε > 0. The situation is similar to that familiar from circuit complexity (cf. Straubing [30]), where we have uniform and non-uniform cases, see, e.g., Alon and Shapira [7]. Our results hold in both cases and so we will not distinguish between them.
144
2.2
C. Jordan and T. Zeugmann
Logical Definitions
We use a predicate logic with equality that does not contain function symbols. There are no ordering symbols such as ≤ or arithmetic relations such as PLUS. The first-order logic of vocabulary τ is built from the atomic formulas xi = xj and Ri (x1 , . . . , xai ) for variable symbols xj and predicate symbols Ri ∈ τ by using the Boolean connectives and quantifiers ∃ and ∀ in the usual way. Formula ϕ of vocabulary τ is interpreted as usual and defines property P := {A | A ∈ STRUC (τ ) and A |= ϕ}. Lower-case Greek letters ϕ, ψ and γ refer to first-order formulas and x, y, and z to first-order variables. Our classification definitions are from B¨ orger et al. [9] except that we omit function symbols. The following is for completeness, where N = {0, 1, . . .} is the set of natural numbers. Definition 7. A prefix vocabulary class is specified as [Π, p]e , where Π is a string over the four-character alphabet {∃, ∀, ∃∗ , ∀∗ }, p is either the special phrase ‘all’ or a sequence over N and the first infinite ordinal ω, and e is ‘=’ or λ. The first-order sentence ϕ := π1 x1 π2 x2 . . . πr xr : ψ in prenex normal form, with quantifiers πi and quantifier-free ψ, is a member of the prefix vocabulary class given by [Π, (p1 , p2 , . . .)]e , where pi ∈ N ∪ {ω} iff 1. The string π1 π2 . . . πr is contained in the language specified by Π when Π is interpreted as a regular expression. 2. If p is not all, at most pi distinct predicate symbols of arity i appear in ψ. 3. Equality (=) appears in ψ only if e is ‘=’. Here, Π is the pattern of quantifiers, p is the maximum number of predicate symbols of each arity and e determines if the equality symbol is permitted. A prefix class is testable if every formula in it expresses a testable property for every vocabulary in which it is evaluable. An extension of a vocabulary τ is any vocabulary formed by adding a new, distinct predicate symbol to τ . Lemma 1. Let ϕ be a formula in the first-order logic of vocabulary τ and let τ be any extension of τ . If ϕ defines a property that is testable in the context of τ , then the property of type τ defined by ϕ is also testable. Proof. Let ϕ define property P of type τ and property P of type τ . Assume the “new” predicate symbol in τ is N of arity a. Let Tετ be an ε-tester for P . We will show that it is also an ε-tester for P . Assume A ∈ STRUC (τ ) has property P . Removing the N predicate, the corresponding A ∈ STRUC (τ ) has property P and so Tετ accepts with probability at least 2/3, as desired. Assume that dist(A, P ) ≥ ε and again let A be the structure of type τ formed by removing the N predicate from A. By the definition of distance, ai B and RA i (x) ⊕ Ri (x)}| 1≤i≤s |{x | x ∈ U s dist(A , P ) = min ≥ ai B∈P i=1 n ai B and RA i (x) ⊕ Ri (x)}| 1≤i≤s |{x | x ∈ U dist(A, P ) = min ≥ ε. s B∈P na + i=1 nai The tester rejects such A with probability at least 2/3, as desired.
Relational Properties Expressible with One ∀ Quantifier Are Testable
145
Testable properties remain testable when the vocabulary is extended. So it suffices to consider the minimal relevant vocabulary. A prefix class is untestable if it contains an untestable property. Simple modifications of the proof of Lemma 1 give the corresponding results for the variations considered in the next section.
3
Variations of Relational Property Testing
In Definition 3, any difference in low-arity relations is asymptotically dominated by the number of high-arity tuples. However, there are situations where this is not ideal. Consider (not necessarily admissible, vertex) 3-colored graphs with the vocabulary τC := (E 2 , R1 , G1 , B 1 ). We might wish to test if the given coloring is admissible. In large graphs, this is equivalent to testing if the graph is 3-colorable and ignores the given coloring. We need a different model for our task. Here we give two alternate definitions for the distance between structures. In testing we wish to distinguish structures that have a desired property and those that are far from the property, and so modifying the definition of distance changes the task of testing. As in Definition 3, the symbol ⊕ is exclusive-or. Definition 8. Let A, B ∈ STRUC n (τ ) be structures. Then, the r-distance is B |{x | x ∈ U ai and RA i (x) ⊕ Ri (x)}| . 1≤i≤s nai
rdist(A, B) := max
While Definition 3 gave equal weight to each tuple regardless of its arity, the above gives equal weight to each relation. However, loops in graphs and other subtypes of relations are similar to low-arity relations. Definition 9. Let R be a relation with arity a. The subtypes of R are the partitions of {1, . . . , a}. For example, {{1}, {2}} is a subtype of the edge predicate E for graphs. This corresponds to the set of pairs of E for which the element in position 1 of the pair occurs only in position 1 and the element in position 2 occurs only in position 2. That is, this subtype is the set of edges that are not loops. The subtype {{1, 2}} corresponds to the set of loops. This is more formally defined as follows. Definition 10. Let R be a relation with arity a and S be a subtype of R, i.e., |S| |S| S = {t11 , . . . , t1b1 }, . . . , {t1 , . . . , tb|S| } . Tuple (x1 , . . . , xa ) belongs to S if for all ti1 , it is the case that xti1 = xtij for all j and, if xu = xv for some u, v then u and v occur in the same element of S. We define the S-distance between structures, for a subtype S of relation Ri . Definition 11. Let A, B ∈ STRUC n (τ ) be structures with universe U , and let S be a subtype of relation Ri ∈ τ . Then, the S-distance between A and B is S-dist(A, B) :=
|{x | x ∈ U ai , x belongs to S and RiA (x) ⊕ RiB (x)}| . n|S|
146
C. Jordan and T. Zeugmann
If the S-dist for all subtypes of all relations is small, no query has a high probability of finding a difference. Denote the set of subtypes of R by SUB(R). Definition 12. For A, B ∈ STRUC n (τ ), the mrdist is mrdist(A, B) := max
max
1≤i≤s S∈SUB(Ri )
S-dist(A, B) .
We let T be the set of testable properties using the dist definition, Tr be the set of testable properties using the rdist definition and Tmr be the set of testable properties using the mrdist definition. It is easy to show the following. Theorem 1. Let τ be a vocabulary and A, B ∈ STRUC n (τ ). Then, dist(A, B) ≤ rdist(A, B) ≤ mrdist(A, B). Assume a tester distinguishes between structures A having some property P and those for which mrdist(A, P ) ≥ ε. Theorem 1 trivially implies that it also distinguishes between structures A that have P and those for which rdist(A, P ) ≥ ε. The case with rdist and dist is analogous, which proves the following. Corollary 1. Tmr ⊆ Tr ⊆ T . Of course it is always desirable to show that such containments are strict. We show the separations by encoding the following language of binary strings, where ← u denotes the usual reversal of string u. ← ←
Theorem 2 (Alon et al. [4]). The language L = √ {uuv v}, where u and v are strings over {0, 1} is not testable with complexity o( n). In some vocabularies, e.g., binary strings and loop-free graphs, all three definitions are equivalent. However, we will show the following. Theorem 3. Tmr ⊂ Tr ⊂ T . Proof. The inclusions are by Corollary 1 and so only the separations remain. We first show that T \Tr is not empty. We use the vocabulary τC := (E 2 , S 1 ). We will show P1 ∈ T \Tr , where P1 ⊆ STRUC (τC ) is the set of structures where the S assignments encode the language L of Theorem 2. That is, A has P1 if there is some 0 ≤ k ≤ n/2 such that for all 0 ≤ i < k, S(i) is true iff S(2k−1−i) is true and for all 0 ≤ j < (n − 2k)/2, S(2k + j) is true iff S(n − 1 − j) is true. The property uses only the low-arity relation S; the E relation is for “padding” to make P1 testable under the dist definition for distance. We first show that P1 is in T . A structure with a universe of odd size cannot have P1 . A tester can begin by checking the parity of n and rejecting if it is odd and so we assume in the following that the size of the universe is even. Lemma 2. Property P1 is testable under the dist definition for distance.
Relational Properties Expressible with One ∀ Quantifier Are Testable
147
Proof. For any (even) n, 1n is of the form uuv v. Given A, we create A by changing all S(i) assignments to be true. This involves at most n modifications and so dist(A, P1 ) ≤ dist(A, A ) = O(n)/Θ(n2 ) < ε, where the final inequality holds for sufficiently large n. Let N (ε) be the smallest value of n for which it holds. The following is an ε-tester for P1 , where the input has universe size n. ← ←
1. If n < N (ε), query all assignments and output whether the input has P1 . 2. Otherwise, accept. If A has P1 , we accept with zero error. If dist(A, P1 ) ≥ ε, then n < N (ε). In this case we query all assignments and reject with zero error. Lemma 2
It remains to show that P1 is not testable when using the rdist definition for distance. We do this by showing that it would contradict Theorem 2. Lemma 3. Property P1 is not testable under the rdist definition for distance. Proof. Suppose there exist Tr -type ε-testers T ε for all ε > 0. We will show that the following is an ε-tester using Definition 3 for the language L of Theorem 2. Let the input be w, a binary string of length n. 1. 2. 3. 4.
Run T ε and intercept all queries. When a query is made for S(i), return the value of S(i) in w. When a query is made for E(i, j), return 0. Output the decision of T ε .
We run T ε on the A ∈ STRUC n (τC ) that agrees with w on S and where all E assignments are false. If w ∈ L, then any such A has property P1 and so our tester accepts with probability at least 2/3. Assume dist(w, L) ≥ ε. Then, rdist(A, P1 ) = dist(w, L) ≥ ε and so our tester rejects with probability at least 2/3. These are testers for the untestable language Lemma 3
of Theorem 2, and so P1 is untestable under the rdist definition. Lemmata 2 and 3, together with Corollary 1 show Tr ⊂ T . The separation Tmr ⊂ Tr is shown in a similar way, using a property with sufficient “padding” to make Tr testing simple but Tmr testing would contradict Theorem 2. An example is the property of graphs where the “loops” E(i, i) encode the language from Theorem 2. We omit the details due to space. Theorem 3
There exist properties that are testable in the rdist sense but not in the mrdist sense. However, the definition of subtypes and Tmr testability allows for a simple mapping between vocabularies such that rdist-testability of certain classes of properties implies mrdist-testability of the same classes. For these classes, proving testability in the rdist sense is equivalent to proving it in the mrdist sense, and so it suffices to use whichever definition is more convenient. Lemma 4 is given in the context of the classification problem for first-order logic but it is not difficult to prove similar results in other contexts. Our main result is the testability of Ackermann’s class with equality, which is of the form required by the lemma. A formula is testable if the property it defines is testable.
148
C. Jordan and T. Zeugmann
Lemma 4. Let C := [Π, all]= be a prefix vocabulary class. Then, C is testable in the rdist sense iff it is testable in the mrdist sense. Proof. Recalling Theorem 3, Tmr testability implies Tr testability. We prove Tr testability of such prefix classes implies Tmr testability using Lemma 5. In the following, S(n, k) is the Stirling number of the second kind. Lemma 5. Let C = [Π, (p1 , p2 , . . .)]= be a prefix vocabulary class and let qj = p S(i, j). If C = [Π, (q1 , q2 , . . .)]= is Tr testable, then C is Tmr testable. i≥j i Proof. Let ϕ ∈ C be arbitrary and assume that the predicate symbols of ϕ are {R11 , R21 , . . . , Rp11 , R12 , . . .}, where the arity of Rji is i. We construct a ϕ ∈ C and show that Tr testability of ϕ implies Tmr testability of ϕ. In ϕ we will use a distinct predicate symbol for each subtype of each Rji in ϕ. A subtype S of Rji such that |S| = k is a partition of the integers {1, . . . , i} into k non-empty sets and so there are S(i, k) such subtypes. We therefore require a total of qk distinct predicate symbols of arity k. For example, we will map the “loops” in a binary predicate E to a new monadic predicate and the non-loops to a separate binary predicate. Formally, we let t map the subtypes of a predicate to the sets of tuples comprising the subtypes. For our example of a binary predicate, (0, 1) ∈ t({{1}, {2}}) and (0, 0) ∈ t({{1, 2}}). Next, we let r be a bijection from the subtypes of predicates to their new names, the predicate symbols that we will use in ϕ . We create ϕ by modifying ϕ. Replace all occurrences of Rji (x1 , . . . , xi ) with ⎛ ⎝
⎞
(x1 , . . . , xi ) ∈ t(S) ∧ r(S, Rji )(y) ⎠ .
S∈SUB(Rij )
Note that (x1 , . . . , xi ) ∈ t(S) is an abbreviation for a simple conjunction, e.g., x1 = x2 ∧ x1 = x3 ∧ · · · . Likewise, y is an |S|-ary tuple, formed by removing the duplicate components of (x1 , . . . , xi ). The implicit mapping from (x1 , . . . , xi ) is invertible given S. To continue our example of a binary predicate E, we would replace all occurrences of E(x, y) in ϕ with ([x = y ∧ E1 (x)] ∨ [x = y ∧ E2 (x, y)]) . We assume that ϕ is Tr testable, and so there exists an ε-tester T ε for it. We run this tester and intercept all queries. For a query to r(S, Rji )(y), we return the value of Rji (x1 , . . . , xi ). This is possible because r is a bijection, and so we can retrieve S and Rji using its inverse. Then, we can reconstruct the full i-ary tuple (x1 , . . . , xi ) from y and S.
Relational Properties Expressible with One ∀ Quantifier Are Testable
149
The tester implicitly defines a map1 from structures A which we wish to test for ϕ to structures A which we can test for ϕ . Given an A |= ϕ, the corresponding A |= ϕ and so T ε will accept with probability at least 2/3. We map each subtype S to a distinct predicate symbol with arity |S|. Therefore, for any structures A, B, the implicit mapping to A , B is such that mrdist(A, B) = rdist(A , B ). For an A such that mrdist(A, P = {B | B |= ϕ}) ≥ ε, we simulate T ε on an A such that rdist(A , P = {B | B |= ϕ }) ≥ ε. The tester T ε rejects with probability at least 2/3, as desired. Lemma 5
Proving Tr testability for [Π, all]= implies proving it for all (q1 , . . .) that are “images” of some (p1 , . . .) and so Lemma 5 is stronger than required.
Lemma 4
4
The Classification Problem for Testability
Here we consider the classification problem of first-order logic for testability, inspired by the classification problem for decidability and results in testability such as those by Alon et al. [2]. The goal is a complete classification of the prefix vocabulary classes of first-order logic into testable and untestable classes. We first outline the traditional classification problem, focusing on results with parallels to results in testing. See B¨orger et al. [9] for the complete classification and proofs. Then we prove the testability of Ackermann’s class with equality. 4.1
Classification Similarities
L¨owenheim [20] proved the decidability of monadic first-order logic, and McNaughton and Papert [21] showed that it (with ordering and some arithmetic) characterizes the star-free regular languages. The testability of this logic is then implied by a result of Alon et al. [4]. Using instead B¨ uchi’s [10] result that monadic second -order logic characterizes the regular languages, the parallel is with Skolem’s [28] extension of L¨owenheim’s result to second-order logic. Skolem [29] showed that [∀∗ ∃∗ , all ] is a reduction class. Alon et al. [2] found an untestable graph property (an encoding of graph isomorphism) expressible in [∀∗ ∃∗ , (0, 1)]= , a class close enough to Skolem’s [29] to be interesting. Alon et al. [2] also proved that [∃∗ ∀∗ , (0, 1)]= is testable. The class [∃∗ ∀∗ , all ]= is known as Ramsey’s class and its decidability was shown by Ramsey [23]. 1
Explicitly, map A to an A with the same universe size, where y ∈ r(S, Rji ) in A if (x1 , . . . , xi ) ∈ Rji in A. Note that we have not yet defined the assignments of tuples y with duplicate components. By construction, the assignments of these tuples do not affect ϕ and so any reasonable convention will do. We define that z ∈ Q where z is any tuple with at least one duplicate component and Q is any predicate symbol in ϕ . The resulting map is injective but not necessarily surjective.
150
5
C. Jordan and T. Zeugmann
Ackermann’s Class with Equality
Above, we saw several similarities between the classifications for decidability and testability. Here we give an additional example: [∃∗ ∀∃∗ , all]= . Ackermann [1] proved the decidability of this class without equality. If we allow equality and a unary function symbol, the result is Shelah’s class, which Shelah [27] proved decidable. Unlike the decidable classes above, Shelah’s class does not have the finite model property and it would be interesting to determine if it is testable. Kolaitis and Vardi [17] showed the satisfiability problem for Ackermann’s class with equality is complete for NEXPTIME and that a 0-1 law holds for existential second-order logic where the first-order part belongs to [∃∗ ∀∃∗ , all]= . The main goal of this section is Theorem 4. Recalling Theorem 3, this also implies that such properties are testable in the dist and rdist senses. If the vocabulary consists of a single relation, the rdist and dist definitions are equivalent to the dense hypergraph model. We therefore obtain the corresponding results in the dense hypergraph and dense graph models as special cases. We denote the set of monadic predicate symbols in a vocabulary τ by M := {Ri | Ri ∈ τ and ai = 1}. The set of assignments of the symbols in M for an element in a universe is called the color of the element and there are 2|M| possible colors. We define Col(A, c) to be the set of colors that occur at least c times in A. Theorem 4. All formulas in [∃∗ ∀∃∗ , all]= define properties that are in Tmr . Proof. Ackermann’s class with equality is [∃∗ ∀∃∗ , all]= and so it suffices to show the testability of property P of type τ = (R1a1 , . . . , Rsas ) defined by formula ϕ := ∃x1 . . . ∃xa ∀y∃z1 . . . ∃zb : ψ, where ψ is quantifier-free. We can trivially test any ϕ that has only finitely-many models with a constant number of queries and zero error, and so it suffices to assume that ϕ has infinitely-many models. The class [∃∗ ∀∃∗ , all]= is of the form required by Lemma 4 and so it is mrdisttestable iff it is rdist-testable. It therefore suffices to show that P is testable in the rdist sense. We will show that the following is an ε-tester in the rdist sense for P on input A ∈ STRUC n (τ ). Here, k := k(τ, ε) is the number of elements queried and N := N (ϕ, τ, ε) is a constant, both of which are determined below. Note the actual number of queries in step 2 is not exactly k, but rather a constant multiple of it depending on τ . Finally, we explicitly give κ := κ(ϕ, τ ) below. 1. If n < N , query all of A and decide exactly whether A has P . 2. Uniformly and independently choose k members of the universe of A and query all monadic predicates on the members in this sample. 3. Search over all A ∈ STRUC κ (τ ). Accept if we find an A such that A |= ϕ and the colors in our sample are a subset of Col(A , a + 1). 4. Otherwise, reject. We must show that the tester accepts if A |= ϕ and rejects if rdist(A, P ) > ε, with probability at least 2/3 in both cases. We do this by showing that with probability at least 2/3, we get a “good” sample in step 2 (cf. Lemma 6).
Relational Properties Expressible with One ∀ Quantifier Are Testable
151
A “good” sample is one that contains all colors of A that occur on at least an ε/(2 · 2|M| ) fraction of the elements of A and no colors that occur on at most a elements. We then show that the tester is correct if it obtains a good sample. Lemma 6. There is a constant k such that, with probability at least 2/3, the tester obtains a sample that contains all colors that occur at least εn/(2 · 2|M| ) times in A and no colors that occur at most a times. Proof. The probability that any particular query misses a fixed color that occurs on at least an ε/(2 · 2|M| ) fraction of A is at most (1 − ε/(2 · 2|M| )). Moreover, the probability that we miss such a fixed color after k independent queries is at most (1 − ε/(2 · 2|M| ))k . There are at most 2|M| such colors, and so the probability of our sample containing at least one representative of all such colors is at least
k 2|M | ε/2 1 − 1 − |M| . 2
The |M | is a constant, and we take k such that this probability is at least 2/3. Next, the probability of a particular query seeing a fixed color that occurs at most a times is at most a/n and the probability that k independent queries miss it is at least (1 − a/n)k . There are at most 2|M| such colors, and so the 2|M | probability that we miss all of them is at least (1 − a/n)k . The k and |M | are now constant, and we let N be such that for n > N this probability is at least 2/3. 2 Lemma 6
The probability of a “good” sample is at least 2/3 = 2/3. We now show that if A |= ϕ, the tester will accept if it obtains a good sample. We begin with Lemma 7. Lemma 7. Let A be a model of ϕ such that #(A) > N and let s ai ai ai −j + 2|M| (a + 1). κ := a + 3b a + 2 i=1 j=1 ( j )a Then, there is an A |= ϕ such that #(A ) = κ and Col(A, a+1) ⊆ Col(A , a+1). Proof. Assume that N > κ. The structure A is a model of ϕ, and so there exists at least one tuple of a elements (u1 , . . . , ua ) such that ϕ is satisfied when the existential quantifiers bind ui to xi . We consider the xi and the substructure induced by them to be fixed, and refer to this substructure as Ax . s ai ai ai −j j=1 ( j )a i=1 There are at most κ2 := a + 2 many distinct structures constructed by adding an element labeled y to Ax when we include the structures where the label y is simply placed on one of the xi . We let v ≤ κ2 be the number of such structures that occur in A and assume there is an enumeration of them. For each of these v substructures there exist b elements, w1 , . . . , wb , such that when we label wi with zi , the structure induced by (x1 , . . . , xa , y, z1 , . . . , zb )
152
C. Jordan and T. Zeugmann
models ψ. We construct Ai,j for 1 ≤ i ≤ 3 and 1 ≤ j ≤ v such that Ai,j is a copy of the w1 , . . . , wb used for the j-th structure. We connect each Ai,j to Ax in the same way as in A, modifying assignments on tuples (Ax ∪ Ai,j )ak . For each wh in Ai,j , we consider the case where y is bound to wh . By construction the substructure induced by (x1 , . . . , xa , y) occurs in A. We assume it is the k-th structure and use the elements of Ai+1 mod 3,k to construct a structure satisfying ψ. We modify the assignments of tuples as needed to create a structure identical to that in A satisfying ψ. Note that by construction all of these assignments are of tuples that contain wh and at least one element from Ai+1 mod 3,k . The resulting structure, which we call A1 , is a model of ϕ. Before this step we have not modified any assignments “spanning” the “rows” Ai,j of A1 and so there are no assignments that we modify more than once. However, there may be some color from Col(A, a+1) that does not appear a+1 times in A1 . We therefore add a new block, denoted Ae , of at most 2|M| (a + 1) elements which consists of a + 1 copies of each color from Col(A, a + 1). Each of these colors occurred at least a + 1 times in A, and so for each such color C, there is an element q in A with color C such that q is not part of Ax . If the substructure induced by (Ax , q) in A is the j-th structure in our enumeration, then we do the following for each member p of Ae that has the same color as q. First, we make the substructure induced by (Ax , p) identical to that induced by (Ax , q) in A. Next, we make the substructure induced by (p, A1,j ) identical to that induced by q and the corresponding zi in A. All of these modifications are on tuples containing a p ∈ Ae and so we do not modify any tuples more than once. We call this structure A2 . Finally, so far we only have an upper-bound on the size of A2 while the lemma states it to be exactly of size κ. We therefore pad in the following simple way2 . We know that N > κ > 2|M| a and so there is a color that occurs at least a + 1 times in A. If #(A2 ) < κ, we simply make an additional κ − #(A2 ) many copies of this color in Ae and modify the assignments of tuples containing these new elements in the same manner as above. The resulting A has size κ and satisfies the requirements of the lemma. Lemma 7
For structures A such that #(A) > N , the colors in a good sample are a subset of Col(A, a + 1). If A |= ϕ and our tester obtains a good sample, then Lemma 7 implies that our tester will find an A satisfying the conditions of step 3 and will therefore accept. The tester obtains a good sample with probability at least 2/3, and so the tester accepts such A with at least the same probability. Next, assume that rdist(A, P ) ≥ ε. In this case we must show that the tester rejects with probability at least 2/3. It is easiest to show the contrapositive: if the tester accepts with probability strictly greater than 1/3, then rdist(A, P ) < ε. If we accept a structure A with probability strictly greater than 1/3, then we must accept it when we obtain a good sample. We construct a B such that B |= ϕ and rdist(A, B) < ε from the A that the tester must find to accept. We begin with Lemma 8, which we will use to “grow” smaller models. 2
One could instead change the tester to search structures with size at most κ.
Relational Properties Expressible with One ∀ Quantifier Are Testable
153
Lemma 8. Let ϕ := ∃x1 . . . ∃xa ∀y∃z1 . . . ∃zb : ψ be a formula with vocabulary τ where ψ is quantifier-free and A ∈ STRUC (τ ) be such that A |= ϕ. Additionally, let B ∈ STRUC (τ ) be any structure containing A as an induced substructure such that #(B) = #(A) + 1. If the additional element of B has a color that occurs at least a + 1 times in A, then we can construct a B |= ϕ by modifying at most a constant number of non-monadic assignments in B. Proof. B contains an induced copy of A and one additional element, which we will denote by q. By assumption, A is a model of ϕ and therefore contains an a-tuple (u1 , . . . , ua ) such that the formula is satisfied when xi is bound to ui . In addition, there are at least a + 1 elements in A that have the same color as q. Therefore, there is at least one such element p that is not one of the ui . We will make q equivalent to p without modifying any monadic assignments. We first modify the assignments needed to make the structure induced by . , xa , q) identical to that induced by (x1 , . . . , xa , p). This requires at most (x 1s, . . ai ai ai −j = O(1) modifications, all of which are non-monadic. There i=1 j=1 j a must be (v1 , . . . , vb ) in A such that ψ is satisfied when zi is bound to vi and y to p. We modify the assignments needed to make the structure induced by (q, . . . , vb) identical to that induced by (p, v1 , . . . , vb )3 . This requires at most sv1 , ai ai ai −j = O(1) modifications, all of which are non-monadic. The j=1 j b i=1 result has #(A) + 1 elements, models ϕ and was constructed from B by making a constant number of modifications to non-monadic assignments.
Lemma 8 Let A be the structure that the tester is running on and A be the structure found in step 3 of the tester. As mentioned above, we will construct a B |= ϕ from A such that B |= ϕ and rdist(A, B) < ε. Note that there must exist at least one color in Col(A, εn/(2·2|M|)) and assume that N is large enough that εn/(2 · 2|M|) ≥ a + 1. We first make a constant sized portion of A identical to A . This requires at most O(1)-many modifications to each relation. All colors in Col(A, εn/(2 · 2|M| )) occur at least a + 1 times in A , allowing us to recursively apply Lemma 8 and add the elements of A that have colors in Col(A, εn/(2 · 2|M| )). This entails making O(1)-many modifications to non-monadic relations (and none to monadic relations) at each step, for a total of O(n) modifications to the non-monadic relations. Finally, we consider the elements of A that have colors occurring at most εn/(2 · 2|M| ) times. There are at most 2|M| such colors and at most εn/2 elements with these colors. We change the monadic assignments on such elements as required to give them colors contained in Col(A, εn/(2 · 2|M| )). This requires at most εn/2 modifications to each of the monadic assignments. We again recursively apply Lemma 8 to A, making O(1) modifications to non-monadic assignments at each step. The resulting structure is B and is such that B |= ϕ. We now show that rdist(A, B) < ε. If Ri is a monadic relation, then the i-th term of the maximum in Definition 8 is at most ε/2 + o(1). If Ri has arity at least two, then the i-th term of the maximum is O(n)/Ω(n2 ) = o(1). All o(1) terms can be made arbitrarily small by choosing N (ϕ, τ, ε) appropriately and 3
The case where vi = p can be handled by replacing vi with q in (q, v1 , . . . , vb ).
154
C. Jordan and T. Zeugmann
so we can assume that all terms are strictly less than ε. The maximum is then strictly less than ε and so rdist(A, B) < ε as desired. Theorem 4
6
Conclusion
We considered a generalization of property testing which we call relational property testing. In Section 3 we showed the relationships between variations of our definitions. The “best” definition depends on the problem in question. Relational databases are perhaps the most obvious example of massive structures where it would be promising to consider applications of property testing. Relational property testing is a natural way to characterize this problem. In addition, properties of databases are often given by queries written in formal languages such as SQL and so it is very natural to consider the testability of properties expressible in various syntactic restrictions of formal languages. Finally, we used our framework to discuss the classification problem for testability in Section 4, inspired by the classical problem for decidability. The major result of Section 4 is the testability of Ackermann’s class with equality, in each of the variations of relational property testing that we considered. This implies the corresponding result in the dense graph and hypergraph models.
References ¨ [1] Ackermann, W.: Uber die Erf¨ ullbarkeit gewisser Z¨ ahlausdr¨ ucke. Math. Annalen 100, 638–649 (1928) [2] Alon, N., Fischer, E., Krivelevich, M., Szegedy, M.: Efficient testing of large graphs. Combinatorica 20(4), 451–476 (2000) [3] Alon, N., Fischer, E., Newman, I., Shapira, A.: A combinatorial characterization of the testable graph properties: It’s all about regularity. In: STOC 2006: Proc. 38th Ann. ACM Symp. on Theory of Comput., pp. 251–260. ACM, New York (2006) [4] Alon, N., Krivelevich, M., Newman, I., Szegedy, M.: Regular languages are testable with a constant number of queries. SIAM J. Comput. 30(6), 1842–1862 (2001) [5] Alon, N., Shapira, A.: A characterization of the (natural) graph properties testable with one-sided error. In: Proc., 46th Ann. IEEE Symp. on Foundations of Comput. Sci., FOCS 2005, Washington, DC, USA, pp. 429–438. IEEE Comput. Soc., Los Alamitos (2005) [6] Alon, N., Shapira, A.: Homomorphisms in graph property testing. In: Klazar, M., Kratochv´ıl, J., Loebl, M., Matouˇsek, J., Thomas, R., Valtr, P. (eds.) Topics in Discrete Mathematics. Algorithms and Combinatorics, vol. 26, pp. 281–313. Springer, Heidelberg (2006) [7] Alon, N., Shapira, A.: A separation theorem in property testing. Combinatorica 28(3), 261–281 (2008) [8] Blum, M., Luby, M., Rubinfeld, R.: Self-testing/correcting with applications to numerical problems. J. of Comput. Syst. Sci. 47(3), 549–595 (1993) [9] B¨ orger, E., Gr¨ adel, E., Gurevich, Y.: The Classical Decision Problem. Springer, Heidelberg (1997)
Relational Properties Expressible with One ∀ Quantifier Are Testable
155
[10] B¨ uchi, J.R.: Weak second-order arithmetic and finite-automata. Z. Math. Logik Grundlagen Math. 6, 66–92 (1960) [11] Chockler, H., Kupferman, O.: ω-regular languages are testable with a constant number of queries. Theoret. Comput. Sci. 329(1-3), 71–92 (2004) [12] Fischer, E.: The art of uninformed decisions. Bulletin of the European Association for Theoretical Computer Science 75, 97–126 (2001); Columns: Computational Complexity [13] Fischer, E., Matsliah, A., Shapira, A.: Approximate hypergraph partitioning and applications. In: Proc. 48th Ann. IEEE Symp. on Foundations of Comput. Sci., FOCS 2007, pp. 579–589. IEEE Comput. Soc., Los Alamitos (2007) [14] Freivalds, R.: Fast probabilistic algorithms. In: Becvar, J. (ed.) MFCS 1979. LNCS, vol. 74, pp. 57–69. Springer, Heidelberg (1979) [15] Goldreich, O., Ron, D.: Property testing in bounded degree graphs. Algorithmica 32, 302–343 (2002) [16] Goldreich, O., Goldwasser, S., Ron, D.: Property testing and its connection to learning and approximation. J. ACM 45(4), 653–750 (1998) [17] Kolaitis, P.G., Vardi, M.Y.: 0-1 laws and decision problems for fragments of second-order logic. Inf. Comput. 87(1-2), 302–338 (1990) [18] de Leeuw, K., Moore, E.F., Shannon, C.E., Shapiro, N.: Computability by probabilistic machines. In: Shannon, C., McCarthy, J. (eds.) Automata Studies, pp. 183–212. Princeton University Press, Princeton (1956) [19] Lov´ asz, L.: Some mathematics behind graph property testing. In: Freund, Y., Gy¨ orfi, L., Tur´ an, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 3–3. Springer, Heidelberg (2008) ¨ [20] L¨ owenheim, L.: Uber M¨ oglichkeiten im Relativkalk¨ ul. Math. Annalen 76, 447–470 (1915) [21] McNaughton, R., Papert, S.: Counter-Free Automata. M.I.T. Press, Cambridge (1971) [22] Parnas, M., Ron, D.: Testing the diameter of graphs. Random Struct. Algorithms 20(2), 165–183 (2002) [23] Ramsey, F.P.: On a problem of formal logic. Proc. London Math. Soc. 30(2), 264–286 (1930) [24] R¨ odl, V., Schacht, M.: Property testing in hypergraphs and the removal lemma. In: STOC 2007: Proc. 39th Ann. ACM Symp. on Theory of Comput., pp. 488–495. ACM, New York (2007) [25] Ron, D.: Property testing. In: Rajasekaran, S., Pardalos, P.M., Reif, J.H., Rolim, J. (eds.) Handbook of Randomized Computing, vol. II, pp. 597–649. Kluwer Academic Publishers, Dordrecht (2001) [26] Rubinfeld, R., Sudan, M.: Robust characterizations of polynomials with applications to program testing. SIAM J. Comput. 25(2), 252–271 (1996) [27] Shelah, S.: Decidability of a portion of the predicate calculus. Israel J. Math. 28(1-2), 32–44 (1977) [28] Skolem, T.: Untersuchungen u ¨ber die Axiome des Klassenkalk¨ uls und u ¨ber Produktations- und Summationsprobleme, welche gewisse Klassen von Aussagen betreffen. Videnskapsselskapets skrifter, I. Mat.-natur kl. (3), 37–71 (1919) [29] Skolem, T.: Logisch-kombinatorische Untersuchungen u ¨ber die Erf¨ ullbarkeit oder Beweisbarkeit mathematischer S¨ atze nebst einem Theorem u ¨ber dichte Mengen. Videnskapsselskapets skrifter, I. Mat.-natur kl. (4), 1–26 (1920) [30] Straubing, H.: Finite Automata, Formal Logic, and Circuit Complexity. Birkh¨ auser, Boston (1994)
Theoretical Analysis of Local Search in Software Testing Andrea Arcuri Simula Research Laboratory, P.O. Box 134, Lysaker, Norway
[email protected]
Abstract. The field of search based software engineering lacks of theoretical foundations. In this paper we theoretically analyse local search algorithms applied to software testing. We consider an infinitely large class of software that has an easy search landscape. Although the search landscape is easy, the software can be arbitrarily complex and large. We prove that Hill Climbing asymptotically has a strictly better runtime than Random Search. However, we prove that a very fast variant of Hill Climbing on reasonable size of software actually does not scale up properly. Although that variant has an exponential runtime, we prove that asymptotically it is still better than Random Search. We show that even on the easiest software testing problems, more sophisticated algorithms than local search are still required to get better performance.
1 Introduction Although there has been a lot of research on search based software engineering in recent years (e.g, in software testing [1]), there exist few theoretical results [2]. The only exceptions we are aware of are on computing unique input/output sequences for finite state machines [3, 4], the application of the Royal Road theory to evolutionary testing [5, 6], and our work on the Triangle Classification problem [7, 8, 9] and on the length of test sequences [10]. To get a deeper understanding of the potential and limitations of the application of search algorithms in software engineering, it is essential to complement the existing experimental research with theoretical investigations. Runtime Analysis is an important part of this theoretical investigation, and brings the evaluation of search algorithms closer to how algorithms are classically evaluated. The goal of analysing the runtime of a search algorithm on a problem is to determine, via rigorous mathematical proofs, the time the algorithm needs to find an optimal solution. In general, the runtime depends on characteristics of the problem instance, in particular the problem instance size. Hence, the outcome of runtime analysis is usually expressions showing how the runtime depends on the instance size. This will be made more precise in the next sections. In this paper we theoretically analyse search based test data generation for an infinitely large class of software. This class is independent of the used programming language, and it can be of any arbitrary size and complexity. However, the resulting fitness functions for the search algorithms yield easy search landscapes (a more precise O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 156–168, 2009. c Springer-Verlag Berlin Heidelberg 2009
Theoretical Analysis of Local Search in Software Testing
157
definition in the next sections). In other words, we practically analyse many types of software that should be easy to handle with search algorithms. In this paper we analyse Random Search (RS) and two variants of Hill Climbing (HC0 and HC1). RS is commonly used in literature of software testing as a baseline for comparisons. Although the testing problems should be easy, we theoretically prove some unexpected and non-obvious results. We prove that HC is asymptotically faster than RS. However, for small size the fastest algorithm is HC1, but we prove that its runtime is exponential. This is a clear example of a search algorithm that performs well in practical applications but that would not scale up for larger software. Although the runtime of HC1 is exponential, we prove that it is still strictly better than the one of RS. We also carried out an empirical analysis to support the theoretical findings. The main contributions of the paper are: – We give first runtime analyses of search algorithms for test data generation for an infinitely large class of software. This class represents programs that are supposed to be easy to handle with search algorithms. However, the programs in this class can be of any arbitrary complexity and size. – We formally prove that two variants of local search are asymptotically faster than RS. – We formally prove that although HC1 is the fastest algorithms we analysed for small size of software, it does have an exponential runtime in the number of inputs, whereas the runtime of HC0 is polynomial. Considering that the search landscape is supposed to be easy, the fact that HC1 has an exponential runtime is not intuitive. The paper is organised as follows. Section 2 gives background information on runtime analysis. A description of search based software testing follows in Section 3. Section 4 describes the class of software we use in our analyses. The analysed search problem is described in Section 5, whereas the employed search algorithms are discussed in Section 6. The theoretical analysis follows in Section 7, whereas the empirical one is in Section 8. A discussion of the obtained results follows in Section 9. Finally, Section 10 concludes the paper.
2 Runtime Analysis To make the notion of runtime precise, it is necessary to define time and size. We defer the discussion on how to define problem instance size for software testing to the next section, and define time first. Time can be measured as the number of basic operations in the search heuristic. Usually, the most time-consuming operation in an iteration of a search algorithm is the evaluation of the cost function. We therefore adopt the black-box scenario [11], in which time is measured as the number of times the algorithm evaluates the cost function. Definition 1 (Runtime [12, 13]). Given a class F of cost functions fi : Si → , the runtime TA,F (n) of a search algorithm A is defined as TA,F (n) := max {TA,f | f ∈ F with (f ) = n} ,
158
A. Arcuri
where (f ) is the problem instance size, and TA,f is the number of times algorithm A evaluates the cost function f until the optimal value of f is evaluated for the first time. A typical search algorithm A is randomised. Hence, the corresponding runtime TA,F (n) will be a random variable. The runtime analysis will therefore seek to estimate properties of the distribution of random variable TA,F (n), in particular the expected runtime E [TA,F (n)] and the success probability Pr [TA,F (n) ≤ t(n)] for a given time bound t(n). More details can be found in [8].
3 Search Based Software Testing In this section, we briefly describe how search algorithms can be applied to test data generation for the fulfillment of white box criteria, in particular branch coverage. Further details can be found in [1]. The objective is to find a set of test cases that execute all the branches in the code. In other words, we want that each predicate in the source code (e.g., in if and loop statements) should be evaluated at least once as true and once as false. For each branch, we do a search to find the inputs that execute it. The search space of candidate solutions is defined by the inputs. For example, if the function takes as input one 32 bit integer, then the search space is composed of 232 different inputs. To apply search algorithms to this test data generation problem, we need to define a fitness function f . For simplicity, let’s say that we need to minimise f . If the target branch is covered when a test case t is executed, then f (t) = 0, and the search is finished. Let’s assume that a test case t is composed of only a set of inputs I, i.e. f (t) = f (I) (in general a test case could be more complex, because it could contain a sequence of function calls). Otherwise, it should heuristically assign a value that tells us how far the branch is from being covered. That information would be exploited by the search algorithm to reward “better” solutions. The most famous heuristic is based on two measures: the approach level A and the branch distance δ [1]. The measure A is used to see in the data flow graph how close a computation is from executing the node on which the target branch depends on. The branch distance δ heuristically evaluates how far a predicate is from obtaining its opposite value. For example, if the predicate is true, then δ tells us how far the input I is from an input that would make that predicate false. For a target branch z, we have that the fitness function fz is: fz (I) = Az (I) + ω(δw (I)) . Note that the branch distance δ is calculated on the node of diversion, i.e. the last node in which a critical decision (not taking the branch w) is made that makes the execution of z not possible. For example, branch z could be nested to a node N (in the control flow graph) in which branch w represents the then branch. If the execution flow reaches N but then the else branch is taken, then N is the node of diversion for z. The search hence gets guided by δw to enter in the nested branches. The used branch distance is defined in Table 1.
Theoretical Analysis of Local Search in Software Testing
159
Table 1. Example of how to apply the function δ on some predicates. k can be any arbitrary positive constant value. A and B can be any arbitrary expression, whereas a and b are the actual values of these expressions based on the values in the input set I. Predicate θ Function δθ (I) A if a is TRUE then 0 else k A = B if abs(a − b) = 0 then 0 else abs(a − b) + k A = B if abs(a − b) = 0 then 0 else k A < B if a − b < 0 then 0 else (a − b) + k A ≤ B if a − b ≤ 0 then 0 else (a − b) + k A > B δB
4 Class of Software In this paper we consider a class C0 of software for our analyses. Programs in C0 are single-thread functions that take as input any arbitrary number v of integer variables. The return values of these functions can be anything. The result of the computation of these programs should not be effected by external entities (e.g., operative system and networks). The programming language can be any that is possible to address with search based techniques (e.g., Java and C/C++). Given a branch to cover, there will be ρ predicates that need to be satisfied, one for each node of diversion in the control flow graph. Let consider them in a disjunctive normal form. In general, each of these predicates is composed of d disjunctions ∨. Each disjunction is composed of c conjunctions ∧. The basic components of these conjunctions are called clauses. We choose to consider the same number of disjunctions and conjunctions for each predicate to simplify the notation. In particular, for the class C0 analysed in this paper, we consider only d = 0. The internal code of these programs in C0 can be of any arbitrary complexity. However, the relations between the ρ predicates and the v inputs is simple. The outcome of each clause can depend on only one of the input variables, and v = ρ · (c + 1), i.e. there is an input variable for each clause. Each clause i has the following form: gi (xi ) op ci ,
(1)
where ci is a constant, and xi is the input on which the clause depends on. The operator op can be: = , = , > , ≥ , < , ≤. The function gi represents the value used in that clause that is based on the computation of the program. The computation of gi (xi ) can
160
A. Arcuri
be of any arbitrarily complexity, but in the class C0 we consider only strictly monotonic functions, i.e. x > y ⇔ gi (x) > gi (y). There are further constraints on the value that ci can assume, but those will be explained in the next section. Let z be the number of clauses involving = , < , ≤. From v = ρ · (c + 1) it follows z ≤ v. To simplify the notation, we consider a program with lower v as smaller, although actually there is no direct connection between size of software (measured in lines of code) and v.
5 Search Problem The programs in the class C0 take as input integer variables. In search based software testing, it is common to put constraints on the range of of the input variables to speed up the search. However, once constraints are put there is no guarantee that any global optimum would lie within this restricted search space. For each integer variable, we consider the range of values R = {0, . . . ,n − 1}. Therefore, the search space S is composed of |S| = nv elements. The size of the problem is hence defined by two variables. For simplicity, n is a power of two, i.e. n = 2t where t is a positive integer. If no global optimum lies in the constrained search space, no search algorithm would ever find any optimal solution. Therefore, we are interested to study the cases in which n is big enough to have at least one global optimum. Studying the runtime of the search algorithms based on the value n helps to understand which are the consequences of choosing a too high value for n. To guarantee that at least one global optimum exists, regarding Equation 1 we just need to ensure that for each clause we have gi (0) < ci < gi (n − 1). Furthermore, for the class C0 let’s consider ci = gi (k) where 0 < k < n − 1 (this guarantees that clauses involving equalities can be solved).
6 Search Algorithms In this paper we consider three different algorithms, i.e. RS, HC0 and HC1. In RS we just sample random solutions with uniform probability. HC0 and HC1 are steepest ascent local search algorithms. They start from a random solution, and then they evaluate all the solutions in their neighbourhood. The algorithms move to the best neighbour. If no better neighbour solution is found, a restart from a new random solution is done. In HC0 the neighbourhood is defined by adding ±1 to each variable. Hence, the neighbourhood has size 2v. In HC1 we consider a bit-string representation. Neighbours are defined by flipping a single bit. Therefore, the neighbourhood size is v log2 n. HC1 starts to flip bits from the leftmost (i.e., the most significant) to the rightmost. If an optimal solution is found, HC1 stops before completing the visit of the entire neighbourhood.
Theoretical Analysis of Local Search in Software Testing
161
7 Theoretical Runtime Analysis We first start to analyse the runtime of the three search algorithms on single clauses, i.e. v = 1. Then we consider the cases for v > 1. Lemma 1. Given g global optima in a search space of |S| elements, then the expected time for RS to find an optimal solution is E[TRS ] = |S|/g. Proof. The probability of sampling an optimal solution is p = g/|S|. The behaviour of RS can therefore be described as a Bernoulli process, where the probability of getting a global optimum for the first time after t steps is geometrically distributed Pr [TRS = t] = (1 − p)t−1 · (p). Hence, the expected time for RS to find a global optimum is E[TRS ] = (1/p) = |S|/g. Theorem 1. For the class C0 when v = 1, for the clauses involving {= , > , ≥} the expected runtime of RS is Θ(1). For the other cases = , < , ≤ its runtime is Θ(n). Proof. For {= , > , ≥}, there is only a constant number t of solutions that are not optimal because ci is a constant in Equation 1 (in particular we have t = 1 for the operator =). The probability of sampling a global optimum is (n − t)/n. By Lemma 1 we have the expected runtime Θ(1). For {= , < , ≤}, there is only a constant number k of solutions that are global optimal because ci is a constant in Equation 1 (in particular we have k = 1 for the operator =). The probability of sampling a global optimum is k/n. By Lemma 1 we have the expected runtime Θ(n). Theorem 2. For the class C0 when v = 1, for the clauses involving {= , > , ≥} the expected runtime of HC0 is Θ(1). For the other cases {= , < , ≤} its runtime is Θ(n). Proof. For {= , > , ≥}, there is only a constant number t of solutions that are not optimal because ci is a constant in Equation 1 (in particular we have t = 1 for the operator =). The probability of sampling a global optimum is (n − t)/n. Because gi (xi ) is strictly monotonic, in at most 2(t + 1) iterations HC0 reaches a global optimum. Therefore, the expected runtime is lower or equal to ((n − t)/n) + ((2t2 + 2t)/n) = Θ(1). For {= , < , ≤}, there is only a constant number k of solutions that are global optimal because ci is a constant in Equation 1 (in particular we have k = 1 for the operator =). The probability of sampling a global optimum is k/n. Because gi (xi ) is strictly monotonic, in at most 2n iterations HC0 reaches a global optimum. Therefore, the expected runtime is bounded from above by (k/n) + 2n((n − k)/n) = O(n). With probability 1/2, the starting point x is higher than n/2. Therefore, the expected runtime is bounded from below by (k/n)+(1/2)(2(n/2−k+1)) = Ω(n). In conclusion, the expected runtime is Θ(n). Theorem 3. For the class C0 when v = 1, for the clauses involving {= , > , ≥} the expected runtime of HC1 is Θ(1). For the other cases {= , < , ≤} its runtime is Θ((log2 n)2 ).
162
A. Arcuri
Proof. For {= , > , ≥}, there is only a constant number t of solutions that are not optimal because ci is a constant in Equation 1 (in particular we have t = 1 for the operator =). The probability of sampling a global optimum is (n − t)/n. Because ci = gi (k) is a constant and we are considering the asymptotic behaviour of HC1 in respect to the variable n, we can consider its values for n > 2k. In these cases, the leftmost bit of these non-optimal solutions is necessarily set to 0. At the first iteration, HC1 flips that bit, and the resulting neighbour is necessarily optimal. Therefore, the expected runtime is (n − t)/n + t/n = Θ(1). For ci = gi (k), let j be the leftmost bit of k that is set to 1 (counting from the right, where the rightmost bit has position j = 0). It follows k < 2j+1 − 1. For {< , ≤}, then the input for which each bit on the left side of j is set to 0 (j included) is an optimal solution. Because gi (xi ) is strictly monotonic, flipping to 0 any of these bits yield a better neighbour solution. In particular, by the branch distance in Table 1, the flipping of the leftmost 1 bits are preferred over the others (i.e., the results in better fitness because the distance is more reduced). Therefore, the expected runtime is bounded from above by log2 n·(log2 n−j −1)) = O((log2 n)2 ). With probability 1/2, at least half of the bits before j are set 1. Therefore, the expected runtime is bounded from below by (1/2)(log2 n · ((1/2)(log2 n − j − 1))) = Ω((log2 n)2 ). In conclusion, the expected runtime is Θ((log2 n)2 ). For the operator =, we need to have the same bits of k. Because gi (xi ) is strictly monotonic, we can map the distance function in Table 1 to the function w(x) = |x − k|, where x is the input we want to calculate the fitness of. With probability 1/2, at least half of the bits before j are set 1. Therefore, the expected runtime is bounded from below by (1/2)(log2 n · ((1/2)(log2 n − j − 1))) = Ω((log2 n)2 ). Calculating an upper bound for the operator = requires some more passages. Let t be t = j + 1. With probability that is lower bounded by a constant, we have x ≥ 2t+1 . Until x > k, we can flip bits only from 1 to 0 (otherwise the fitness value would increase because |x − k + 2 | > |x − k|). Flipping bits on the left side of t has preference. Let’s consider 1 bits in positions a > t and b ≤ t, then it follows x ≥ 2a + 2b . Flipping the 1 bit in position a is preferred over b because |x−2a −k| < |x−2b −k|. If x−2a −k > 0, then: |x − 2a − k| = x − 2a − k < x − 2b − k = |x − 2b − k| , otherwise: |x−2a −k| = k−(x−2a ) ≤ k−2b < 2a −k < 2a +2b −k ≤ x−2b −k = |x−2b −k| . Therefore, before accepting the flipping of bits on the right of position t, we need to flip all the 1 on the left of t. In the worst case, they are log2 n − t − 1 = O(log2 n). Once they are all flipped to 0, they cannot be flipped to 1 because the distance from k would increase. There is a constant number of bits on the right side of t. Even if HC1 explores all their possible combinations, this number of combinations would still be a constant. Therefore, in O((log2 n)2 ) iterations HC1 reaches either a global or a local optimum. There can be local optima. For example, in bit string representation, if bin(k) = 0α 101β for some constants α ≥ 1 and β ≥ 1, then l with bin(l) = 0α 110β would be a local optimum. This can be proved considering that in this case we have k = 2β − 1 + 2β+1 and l = 2β+1 + 2β . Therefore, l = k + 1 and w(l) = |l − k| = 1. Now,
Theoretical Analysis of Local Search in Software Testing
163
in l we cannot flip any bit from 0 to 1, this because it would increase the value of l and it is already l > k, so the distance |l − k| would increase. There are only two bits that are set to 1 in bin(l) = 0α 110β . On the one hand, if we flip the rightmost to 0 we obtain a neighbour solution N1 such that bin(N1 ) = 0α 10β+1 and N1 = 2β+1 . This neighbour solution cannot be accepted because w(N1 ) = 2β − 1 ≥ 1. On the other hand, if we flip the leftmost to 0 we obtain a neighbour solution N2 such that bin(N2 ) = 0α+1 10β and N2 = 2β . Neither this neighbour solution can be accepted because w(N2 ) = 2β+1 − 1 > 1. However, to prove this theorem we do not need to characterise all the local optima. In fact, with probability lower bounded by a constant (i.e., 1/(2j+1 )), then the j + 1 rightmost bits of a random x are equal to the ones of k. Because in these cases the bits on the right of position j are not flipped until all the bits on the left of j are set to 0, then in these cases HC1 converges to a global optimum. Because that probability is lower bounded by a constant, HC1 does on average at most a constant number of restarts (this in fact can be modelled as a Bernoulli process, see the proof of Lemma 1). The expected runtime is hence bounded from above by O((log2 n)2 ). Therefore, the expected runtime of HC1 on the operator = is Θ((log2 n)2 ). Theorem 4. For the class C0 , the expected runtime of RS is Θ(nz ). Proof. Each of the v clauses need to be satisfied (remember that there are no disjunctions, i.e. d = 0). The probability of having all clauses satisfied is pi , where pi is probability of sampling aglobal optimum for the clause i. By Lemma 1 we have the expected runtime Θ(1/( pi )) = Θ( (1/pi )). Therefore we can just multiply the expected runtime of each clause. By Theorem 1, this brings to the expected runtime Θ(nz ). Theorem 5. For the class C0 , the expected runtime of HC0 is Θ(vzn). Proof. Each clause is independent of the others. HC0 does not make any restart (see proof of Theorem 2). Although HC0 starts to optimise the variables that appear in predicates that can be executed, all variables are considered when the neighbourhood is visited. By Theorem 2, that means that there are z variables that are optimised in Θ(n) steps and v − z variables that are optimised in Θ(1) steps. Because for each step we need to evaluate 2v neighbour solutions, it follows that the expected runtime is Θ(vzn). Theorem 6. For the class C0 , the expected runtime of HC1 is Θ(vz(log n)2 bz ), where b is a positive constant b ≥ 1. Proof. It is possible that HC1 needs to make restarts for the clauses involving the operator = (see the proof of Theorem 3). This depends on the actual values of the constants k in ci = gi (k). However, the probability of making a restart for a particular clause is bounded from above by a constant 1/p (see the proof of Theorem 3). Given t = z (with 0 ≤ ≤ 1) the number of clauses for which a restart can be necessary, we have that the expected number of restarts r is bounded in 0 ≤ r ≤ pt = (p )z . Therefore, there exists a constant b ≥ 1 such that bz represents the number of runs of HC1 before it reaches a global optimum. However, the actual value of b depends on the internal implementation of the analysed software (i.e., the values ci ).
164
A. Arcuri
Following the same type of reasoning of the proof of Theorem 5, by Theorem 3 in HC1 there are z variables that requires Θ(log2 n) successful steps to be optimised (the other variables are optimised in Θ(1)), and each step requires the evaluation of v log2 n neighbour solutions. Therefore, the expected runtime is Θ(vz(log n)2 bz ). Theorem 7. For the class C0 , for v = O(z t ) for any constant t ≥ 1, then the expected runtime of RS is asymptotically higher than the one of HC1 in respect to the variable z, i.e. vz(log n)2 bz = o(nz ). Proof. We first prove that n > b. bz represents the expected number of runs that HC1 makes before reaching a global optimum. Following the proof of Theorem 3, we do not need to make a restart on a clause with probability P ≥ (j + 2)/(2j+1 ) = 1/w, where k ≤ 2j+1 − 1 ≤ n, ci = gi (k) and w < n. In fact, 1/(2j+1 ) is the probability of sampling a particular combination of j + 1 bits, and besides the global optimum k there are at least j + 1 neighbours with distance of one bit that are not local optima. Following the proof of Theorem 6, there are at most z variables for which we need to make a restart, therefore the probability of not making a restart is P ≥ (1/w)z . Therefore, there would be at most wz restarts. Because w < n and bz ≤ wz , then it follows b < n. Being n > b, it exists a positive such that n = b1+ . Therefore, we just need to show the following (see [14]): vz(log n)2 bz vz(log n)2 bz vz(log n)2 = lim = lim =0. z→∞ z→∞ z→∞ nz bz b(1+)z lim
8 Empirical Analysis To run empirical experiments, we automatically implemented software of class C0 . For simplicity, we used gi (x) = x. Constants ci = gi (k) = k are uniformly chosen in 1 ≤ ci < 8. Following [15], for each setting of algorithm and problem instance size, we fitted different models to the observed runtimes using non-linear regression with the GaussNewton algorithm. Each model corresponds to a two term expression α · w(n,β) of the runtime, where the model parameters α and β correspond to the constants to be estimated. The residual sum of squares of each fitted model was calculated to identify the model which corresponds best with the observed runtimes. This methodology was implemented in the statistical tool R [16]. For v = 1, the used models are αni log(n)j , where i,j ∈ {0,1,2}. For z = v and n constant, the used models are αv i log(v)j ebv and αv i log(v)j eβv , where i,j ∈ {0,1,2} and b ∈ {0,1}. Note we had to make the distinction between b and β to cope with singular gradient errors (see [16]). For each single clause and search algorithm, we considered 17 different values of n, where n = 2i and i is from 4 to 20. For each of these lengths, we run 1,000 experiments with different random seeds, for a total of 6 · 3 · 17 · 1000 = 306,000 runs. The resulting best models are shown in Table 2. For z = v and n = 1024, we generated software of class C0 with different values of v. Because the runtime of RS and HC1 is exponential in v (see Theorems 4 and 6), we
Theoretical Analysis of Local Search in Software Testing
165
have been able of running only a limited set of experiments. In particular, we consider only clauses involving the operator = (which is the one that is most problematic). For RS we only consider values of v less or equal than 6, whereas for HC0 and HC1 we consider a number of inputs v up to v = 16. For each value v, we run 1,000 experiments with different random seeds. Average number of fitness evaluations that are required to reach a global optimum are shown in Figure 1. The resulting best models are shown in Table 3.
9 Discussion Table 2 shows that in all cases but one the correct models were derived. The same happens in Table 3, where for HC1 we get v(log2 v)2 bv instead of v 2 bv . The point is that for values of v in 1 ≤ v ≤ 16, the function (log2 v)2 is very similar to v (they are Table 2. Empirical results compared to the theoretical results for single clauses Search Algorithm Operator Theoretical RS = Θ(n) < Θ(n) ≤ Θ(n) = Θ(1) > Θ(1) ≥ Θ(1) HC0 = Θ(n) < Θ(n) ≤ Θ(n) = Θ(1) > Θ(1) ≥ Θ(1) HC1 = Θ((log2 n)2 ) < Θ((log2 n)2 ) ≤ Θ((log2 n)2 ) = Θ(1) > Θ(1) ≥ Θ(1)
Empirical 1.043n 0.378n 0.013n log2 n 1.007 1.053 1.040 0.983n 0.974n 0.989n 1.008 1.231 1.164 0.770(log 2 n)2 0.404(log 2 n)2 0.391(log 2 n)2 1.007 1.110 1.072
Table 3. Empirical results compared to the theoretical results for class C0 when z = v and n is a constant Search Algorithm Theoretical Empirical RS HC0 HC1
eΘ(v) Θ(v 2 ) Θ(v 2 bv )
e2.772v 1013v 2 43.760v(log 2 v)2 1.724v
A. Arcuri 1e+06
166
RS HC0 HC1
6e+05 4e+05
●
0e+00
2e+05
Fitness Evaluations
8e+05
●
● ●
●
●
●
●
●
5
●
●
●
●
●
●
10
●
●
15
Number of Variables
Fig. 1. Comparison of performance of RS, HC0 and HC1 with variable number of inputs. Data were collected from 1,000 runs for each number of variables and search algorithm.
equal for v = 16). So it is reasonable to obtain such a wrong model. This model would have likely not been obtained if more experiments with higher values of v were carried out. However, because HC1 has an exponential runtime (see Theorem 6), using high values of v is not practical. This shows a clear limitation of empirical analyses applied to search algorithms with exponential runtime. On the other hand, theoretical analyses do not suffer of this limitation. Figure 1 shows that for values v ≤ 5, HC1 is the fastest although its runtime is exponential. This happens because in the model of its runtime it does have a very low constant compared to the runtime of HC0 (see Table 3). This is a clear example of a search algorithm that would be considered the best if experiments on only small software were carried out while it actually has an exponential runtime. The results of the empirical analysis in Figure 1 show that both RS and HC1 seem to have an exponential runtime. By considering only the empirical experiments, we could hence think that HC1 is equivalent to RS. However, we have formally proved in Theorem 7 that HC1 is asymptotically better than RS.
10 Conclusion In this paper we have theoretically proved runtime of three different search algorithms for test data generation for a infinitely large class of software C0 . This class is characterised by the fact that the search landscape is relatively simple. We found that local search algorithms have better runtime than RS. However, we formally proved the non-obvious result that a variant of local search has an exponential runtime while it is the fastest among the considered search algorithms for comparatively smaller software, i.e. for small values of v.
Theoretical Analysis of Local Search in Software Testing
167
Although at the moment we cannot state any relation between C0 and software used in real-world applications, this paper give results on an infinite number of software with arbitrary complexity and size, whereas the empirical analyses in literature are limited to finite sets of programs. This is one advantage of theoretical analysis. However, theoretical analyses can be more difficult to carry out, particularly in cases of complex search landscapes. Therefore, theoretical analyses are not meant to replace empirical investigations. Empirical analyses can give us information on the time search algorithms require to find optimal solutions, but usually they cannot explain the reason why search algorithms behave in a particular way. For example, why HC1 has an exponential runtime? It would be difficult to answer to this question with only an empirical investigation. On the other hand, when it is possible to carry out a theoretical analysis, we can explain the dynamics underlining the search process. The goal in the long term would be to get insight knowledge [17] to design more sophisticated search algorithms that are specialised to solve the specific problem we want to address. For future, it is important to extend the theoretical analysis to a broader class of software. For example, we could consider negative values for integer inputs, the use of disjunctions, correlation between inputs variables, etc. Other important thing will be to analyse more sophisticated search algorithms that are commonly used in the literature of search based software testing, like for example Simulated Annealing and Genetic Algorithms. Although the theoretical analysis of these algorithms is very complex, in literature results on common combinatorial problems are appearing in the recent years [18]. HC1 performs very well on single clauses (Theorem 3) but then it can end up in an exponential runtime for larger software (Theorem 6). Therefore, it could be possible that more sophisticated search algorithms would be as good as HC1 on single clauses, but then they could be able to escape from local optima without doing restarts (which is what make the runtime of HC1 exponential, see the proof of Theorem 6). This would mean that even on the simplest software testing problems we need to analyse more sophisticated techniques than local search.
Acknowledgements The author is grateful to Per Kristian Lehre for insightful discussions.
References [1] McMinn, P.: Search-based software test data generation: A survey. Software Testing, Verification and Reliability 14(2), 105–156 (2004) [2] Lehre, P.K., Yao, X.: Runtime analysis of search heuristics on software engineering problems. Frontiers of Computer Science in China 3(1), 64–72 (2009) [3] Lehre, P.K., Yao, X.: Runtime analysis of (1+1) ea on computing unique input output sequences. In: IEEE Congress on Evolutionary Computation (CEC), pp. 1882–1889 (2007) [4] Lehre, P.K., Yao, X.: Crossover can be constructive when computing unique input output sequences. In: Li, X., Kirley, M., Zhang, M., Green, D., Ciesielski, V., Abbass, H.A., Michalewicz, Z., Hendtlass, T., Deb, K., Tan, K.C., Branke, J., Shi, Y. (eds.) SEAL 2008. LNCS, vol. 5361, pp. 595–604. Springer, Heidelberg (2008)
168
A. Arcuri
[5] Harman, M., McMinn, P.: A theoretical and empirical study of search based testing: Local, global and hybrid search. IEEE Transactions on Software Engineering (to appear) [6] Harman, M., McMinn, P.: A theoretical & empirical analysis of evolutionary testing and hill climbing for structural test data generation. In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), pp. 73–83 (2007) [7] Arcuri, A., Lehre, P., Yao, X.: Theoretical runtime analysis in search based software engineering. Technical Report CSR-09-04, University of Birmingham [8] Arcuri, A., Lehre, P.K., Yao, X.: Theoretical runtime analyses of search algorithms on the test data generation for the triangle classification problem. In: International Workshop on Search-Based Software Testing (SBST), pp. 161–169 (2008) [9] Arcuri, A.: Full theoretical runtime analysis of alternating variable method on the triangle classification problem. In: International Symposium on Search Based Software Engineering (SSBSE), pp. 113–121 (2009) [10] Arcuri, A.: Longer is better: On the role of test sequence length in software testing. Technical Report CSR-09-03, University of Birmingham (2009) [11] Droste, S., Jansen, T., Wegener, I.: Upper and lower bounds for randomized search heuristics in black-box optimization. Theory of Computing Systems 39(4), 525–544 (2006) [12] Droste, S., Jansen, T., Wegener, I.: On the analysis of the (1+1) evolutionary algorithm. Theoretical Computer Science 276, 51–81 (2002) [13] He, J., Yao, X.: A study of drift analysis for estimating computation time of evolutionary algorithms. Natural Computing 3(1), 21–35 (2004) [14] Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press and McGraw-Hill (2001) [15] Jansen, T.: On the brittleness of evolutionary algorithms. In: Stephens, C.R., Toussaint, M., Whitley, L.D., Stadler, P.F. (eds.) FOGA 2007. LNCS, vol. 4436, pp. 54–69. Springer, Heidelberg (2007) [16] R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008) ISBN 3-900051-07-0 [17] Arcuri, A.: Insight knowledge in search based software testing. In: Genetic and Evolutionary Computation Conference (GECCO), pp. 1649–1656 (2009) [18] Oliveto, P.S., He, J., Yao, X.: Time complexity of evolutionary algorithms for combinatorial optimization: A decade of results. International Journal of Automation and Computing 4(3), 281–293 (2007)
Firefly Algorithms for Multimodal Optimization Xin-She Yang Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, UK [email protected]
Abstract. Nature-inspired algorithms are among the most powerful algorithms for optimization. This paper intends to provide a detailed description of a new Firefly Algorithm (FA) for multimodal optimization applications. We will compare the proposed firefly algorithm with other metaheuristic algorithms such as particle swarm optimization (PSO). Simulations and results indicate that the proposed firefly algorithm is superior to existing metaheuristic algorithms. Finally we will discuss its applications and implications for further research.
1
Introduction
Biologically inspired algorithms are becoming powerful in modern numerical optimization [1, 2, 4, 6, 9, 10], especially for the NP-hard problems such as the travelling salesman problem. Among these biology-derived algorithms, the multi-agent metaheuristic algorithms such as particle swarm optimization form hot research topics in the start-of-the-art algorithm development in optimization and other applications [1, 2, 9]. Particle swarm optimization (PSO) was developed by Kennedy and Eberhart in 1995 [5], based on the swarm behaviour such as fish and bird schooling in nature, the so-called swarm intelligence. Though particle swarm optimization has many similarities with genetic algorithms, but it is much simpler because it does not use mutation/crossover operators. Instead, it uses the real-number randomness and the global communication among the swarming particles. In this sense, it is also easier to implement as it uses mainly real numbers. This paper aims to introduce the new Firefly Algorithm and to provide the comparison study of the FA with PSO and other relevant algorithms. We will first outline the particle swarm optimization, then formulate the firefly algorithms and finally give the comparison about the performance of these algorithms. The FA optimization seems more promising than particle swarm optimization in the sense that FA can deal with multimodal functions more naturally and efficiently. In addition, particle swarm optimization is just a special class of the firefly algorithms as we will demonstrate this in this paper. O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 169–178, 2009. c Springer-Verlag Berlin Heidelberg 2009
170
2 2.1
X.-S. Yang
Particle Swarm Optimization Standard PSO
The PSO algorithm searches the space of the objective functions by adjusting the trajectories of individual agents, called particles, as the piecewise paths formed by positional vectors in a quasi-stochastic manner [5, 6]. There are now as many as about 20 different variants of PSO. Here we only describe the simplest and yet popular standard PSO. The particle movement has two major components: a stochastic component and a deterministic component. A particle is attracted toward the position of the current global best g∗ and its own best location x∗i in history, while at the same time it has a tendency to move randomly. When a particle finds a location that is better than any previously found locations, then it updates it as the new current best for particle i. There is a current global best for all n particles. The aim is to find the global best among all the current best solutions until the objective no longer improves or after a certain number of iterations. For the particle movement, we use x∗i to denote the current best for particle i, and g∗ ≈ min or max{f (xi )}(i = 1, 2, ..., n) to denote the current global best. Let xi and vi be the position vector and velocity for particle i, respectively. The new velocity vector is determined by the following formula vit+1 = vit + α1 (g∗ − xti ) + β2 (x∗i − xti ).
(1)
where 1 and 2 are two random vectors, and each entry taking the values between 0 and 1. The Hadamard product of two matrices u v is defined as the entrywise product, that is [u v]ij = uij vij . The parameters α and β are the learning parameters or acceleration constants, which can typically be taken as, say, α ≈ can be taken as the bounds or limits a = β ≈ 2. The initial values of xt=0 i min(xj ), b = max(xj ) and vit=0= 0. The new position can then be updated by xt+1 = xti + vit+1 . i
(2)
Although vi can be any values, it is usually bounded in some range [0, vmax ]. There are many variants which extend the standard PSO algorithm, and the most noticeable improvement is probably to use inertia function θ(t) so that vit is replaced by θ(t)vit where θ takes the values between 0 and 1. In the simplest case, the inertia function can be taken as a constant, typically θ ≈ 0.5 ∼ 0.9. This is equivalent to introducing a virtual mass to stabilize the motion of the particles, and thus the algorithm is expected to converge more quickly.
3 3.1
Firefly Algorithm Behaviour of Fireflies
The flashing light of fireflies is an amazing sight in the summer sky in the tropical and temperate regions. There are about two thousand firefly species, and
Firefly Algorithms for Multimodal Optimization
171
most fireflies produce short and rhythmic flashes. The pattern of flashes is often unique for a particular species. The flashing light is produced by a process of bioluminescence, and the true functions of such signaling systems are still debating. However, two fundamental functions of such flashes are to attract mating partners (communication), and to attract potential prey. In addition, flashing may also serve as a protective warning mechanism. The rhythmic flash, the rate of flashing and the amount of time form part of the signal system that brings both sexes together. Females respond to a male’s unique pattern of flashing in the same species, while in some species such as photuris, female fireflies can mimic the mating flashing pattern of other species so as to lure and eat the male fireflies who may mistake the flashes as a potential suitable mate. We know that the light intensity at a particular distance r from the light source obeys the inverse square law. That is to say, the light intensity I decreases as the distance r increases in terms of I ∝ 1/r2 . Furthermore, the air absorbs light which becomes weaker and weaker as the distance increases. These two combined factors make most fireflies visible only to a limited distance, usually several hundred meters at night, which is usually good enough for fireflies to communicate. The flashing light can be formulated in such a way that it is associated with the objective function to be optimized, which makes it possible to formulate new optimization algorithms. In the rest of this paper, we will first outline the basic formulation of the Firefly Algorithm (FA) and then discuss the implementation as well as its analysis in detail. 3.2
Firefly Algorithm
Now we can idealize some of the flashing characteristics of fireflies so as to develop firefly-inspired algorithms. For simplicity in describing our new Fireflire Algorithm (FA), we now use the following three idealized rules: 1) all fireflies are unisex so that one firefly will be attracted to other fireflies regardless of their sex; 2) Attractiveness is proportional to their brightness, thus for any two flashing fireflies, the less brighter one will move towards the brighter one. The attractiveness is proportional to the brightness and they both decrease as their distance increases. If there is no brighter one than a particular firefly, it will move randomly; 3) The brightness of a firefly is affected or determined by the landscape of the objective function. For a maximization problem, the brightness can simply be proportional to the value of the objective function. Other forms of brightness can be defined in a similar way to the fitness function in genetic algorithms. Based on these three rules, the basic steps of the firefly algorithm (FA) can be summarized as the pseudo code shown in Fig. 1. In certain sense, there is some conceptual similarity between the firefly algorithms and the bacterial foraging algorithm (BFA) [3, 7]. In BFA, the attraction among bacteria is based partly on their fitness and partly on their distance, while in FA, the attractiveness is linked to their objective function and monotonic decay of the attractiveness with distance. However, the agents in FA have
172
X.-S. Yang Firefly Algorithm Objective function f (x), x = (x1 , ..., xd )T Generate initial population of fireflies xi (i = 1, 2, ..., n) Light intensity Ii at xi is determined by f (xi ) Define light absorption coefficient γ while (t <MaxGeneration) for i = 1 : n all n fireflies for j = 1 : i all n fireflies if (Ij > Ii ), Move firefly i towards j in d-dimension; end if Attractiveness varies with distance r via exp[−γr] Evaluate new solutions and update light intensity end for j end for i Rank the fireflies and find the current best end while Postprocess results and visualization
Fig. 1. Pseudo code of the firefly algorithm (FA)
adjustable visibility and more versatile in attractiveness variations, which usually leads to higher mobility and thus the search space is explored more efficiently. 3.3
Attractiveness
In the firefly algorithm, there are two important issues: the variation of light intensity and formulation of the attractiveness. For simplicity, we can always assume that the attractiveness of a firefly is determined by its brightness which in turn is associated with the encoded objective function. In the simplest case for maximum optimization problems, the brightness I of a firefly at a particular location x can be chosen as I(x) ∝ f (x). However, the attractiveness β is relative, it should be seen in the eyes of the beholder or judged by the other fireflies. Thus, it will vary with the distance rij between firefly i and firefly j. In addition, light intensity decreases with the distance from its source, and light is also absorbed in the media, so we should allow the attractiveness to vary with the degree of absorption. In the simplest form, the light intensity I(r) varies according to the inverse square law I(r) = Is /r2 where Is is the intensity at the source. For a given medium with a fixed light absorption coefficient γ, the light intensity I varies with the distance r. That is I = I0 e−γr , where I0 is the original light intensity. In order to avoid the singularity at r = 0 in the expression Is /r2 , the combined effect of both the inverse square law and absorption can be approximated using the following Gaussian form 2
I(r) = I0 e−γr .
(3)
Sometimes, we may need a function which decreases monotonically at a slower rate. In this case, we can use the following approximation
Firefly Algorithms for Multimodal Optimization
I(r) =
I0 . 1 + γr2
173
(4)
At a shorter distance, the above two forms are essentially the same. This is because the series expansions about r = 0 2 1 e−γr ≈ 1 − γr2 + γ 2 r4 + ..., 2
1 ≈ 1 − γr2 + γ 2 r4 + ..., 1 + γr2
(5)
are equivalent to each other up to the order of O(r3 ). As a firefly’s attractiveness is proportional to the light intensity seen by adjacent fireflies, we can now define the attractiveness β of a firefly by 2
β(r) = β0 e−γr ,
(6)
where β0 is the attractiveness at r = 0. As it is often faster to calculate 1/(1+r2 ) than an exponential function, the above function, if necessary, can conveniently β0 be replaced by β = 1+γr 2 . Equation (6) defines a characteristic distance Γ = √ 1/ γ over which the attractiveness changes significantly from β0 to β0 e−1 . In the implementation, the actual form of attractiveness function β(r) can be any monotonically decreasing functions such as the following generalized form m
β(r) = β0 e−γr ,
(m ≥ 1).
(7)
For a fixed γ, the characteristic length becomes Γ = γ −1/m → 1 as m → ∞. Conversely, for a given length scale Γ in an optimization problem, the parameter γ can be used as a typical initial value. That is γ = Γ1m . 3.4
Distance and Movement
The distance between any two fireflies i and j at xi and xj , respectively, is the Cartesian distance d rij = ||xi − xj || = (xi,k − xj,k )2 , (8) k=1
where xi,k is the kth component of the spatial coordinate xi of ith firefly. In 2-D case, we have rij = (xi − xj )2 + (yi − yj )2 . The movement of a firefly i is attracted to another more attractive (brighter) firefly j is determined by 2 1 xi = xi + β0 e−γrij (xj − xi ) + α (rand − ), 2
(9)
where the second term is due to the attraction while the third term is randomization with α being the randomization parameter. rand is a random number generator uniformly distributed in [0, 1]. For most cases in our implementation,
174
X.-S. Yang
we can take β0 = 1 and α ∈ [0, 1]. Furthermore, the randomization term can easily be extended to a normal distribution N (0, 1) or other distributions. In addition, if the scales vary significantly in different dimensions such as −105 to 105 in one dimension while, say, −0.001 to 0.01 along the other, it is a good idea to replace α by αSk where the scaling parameters Sk (k = 1, ..., d) in the d dimensions should be determined by the actual scales of the problem of interest. The parameter γ now characterizes the variation of the attractiveness, and its value is crucially important in determining the speed of the convergence and how the FA algorithm behaves. In theory, γ ∈ [0, ∞), but in practice, γ = O(1) is determined by the characteristic length Γ of the system to be optimized. Thus, in most applications, it typically varies from 0.01 to 100. 3.5
Scaling and Asymptotic Cases
It is worth pointing out that the distance r defined above is not limited to the Euclidean distance. We can define many other forms of distance r in the ndimensional hyperspace, depending on the type of problem of our interest. For example, for job scheduling problems, r can be defined as the time lag or time interval. For complicated networks such as the Internet and social networks, the distance r can be defined as the combination of the degree of local clustering and the average proximity of vertices. In fact, any measure that can effectively characterize the quantities of interest in the optimization problem can be used as the ‘distance’ r. The typical scale Γ should be associated with the scale in the optimization problem of interest. If Γ is the typical scale for a given optimization problem, for a very large number of fireflies n m where m is the number of local optima, then the initial locations of these n fireflies should distribute relatively uniformly over the entire search space in a similar manner as the initialization of quasi-Monte Carlo simulations. As the iterations proceed, the fireflies would converge into all the local optima (including the global ones) in a stochastic manner. By comparing the best solutions among all these optima, the global optima can easily be achieved. At the moment, we are trying to formally prove that the firefly algorithm will approach global optima when n → ∞ and t 1. In reality, it converges very quickly, typically with less than 50 to 100 generations, and this will be demonstrated using various standard test functions later in this paper. There are two important limiting cases when γ → 0 and γ → ∞. For γ → 0, the attractiveness is constant β = β0 and Γ → ∞, this is equivalent to say that the light intensity does not decrease in an idealized sky. Thus, a flashing firefly can be seen anywhere in the domain. Thus, a single (usually global) optimum can easily be reached. This corresponds to a special case of particle swarm optimization (PSO) discussed earlier. Subsequently, the efficiency of this special case is the same as that of PSO. On the other hand, the limiting case γ → ∞ leads to Γ → 0 and β(r) → δ(r) (the Dirac delta function), which means that the attractiveness is almost zero in the sight of other fireflies or the fireflies are short-sighted. This is equivalent to the case where the fireflies fly in a very foggy region randomly. No other fireflies
Firefly Algorithms for Multimodal Optimization
175
can be seen, and each firefly roams in a completely random way. Therefore, this corresponds to the completely random search method. As the firefly algorithm is usually in somewhere between these two extremes, it is possible to adjust the parameter γ and α so that it can outperform both the random search and PSO. In fact, FA can find the global optima as well as all the local optima simultaneously in a very effective manner. This advantage will be demonstrated in detail later in the implementation. A further advantage of FA is that different fireflies will work almost independently, it is thus particularly suitable for parallel implementation. It is even better than genetic algorithms and PSO because fireflies aggregate more closely around each optimum (without jumping around as in the case of genetic algorithms). The interactions between different subregions are minimal in parallel implementation.
4 4.1
Multimodal Optimization with Multiple Optima Validation
In order to demonstrate how the firefly algorithm works, we have implemented it in Matlab. We will use various test functions to validate the new algorithm. As an example, we now use the FA to find the global optimum of the Michalewicz function d ix2 f (x) = − sin(xi )[sin( i )]2m , (10) π i=1 where m = 10 and d = 1, 2, .... The global minimum f∗ ≈ −1.801 in 2-D occurs at (2.20319, 1.57049), which can be found after about 400 evaluations for 40 fireflies after 10 iterations (see Fig. 2 and Fig. 3). Now let us use the FA to find the optima of some tougher test functions. This is much more efficient than most of existing metaheuristic algorithms. In the above simulations, the values of the parameters are α = 0.2, γ = 1 and β0 = 1.
2
f(x,y)
1 0 −1 −2 0 1 4
2
3 2
3 x
4
1 0
y
Fig. 2. Michalewicz’s function for two independent variables with a global minimum f∗ ≈ −1.801 at (2.20319, 1.57049)
X.-S. Yang 4
4
3.5
3.5
3
3
2.5
2.5 y
y
176
2 1.5
2 1.5
1
1
0.5
0.5
0 0
0.5
1
1.5
2 x
2.5
3
3.5
4
0 0
0.5
1
1.5
2 x
2.5
3
3.5
4
Fig. 3. The initial 40 fireflies (left) and their locations after 10 iterations (right)
We have also used much tougher test functions. For example, Yang described a multimodal function which looks like a standing-wave pattern [11] d
d d 2m 2 f (x) = e− i=1 (xi /a) − 2e− i=1 xi · cos2 xi ,
m = 5,
(11)
i=1
is multimodal with many local peaks and valleys, and it has a unique global minimum f∗ = −1 at (0, 0, ..., 0) in the region −20 ≤ xi ≤ 20 where i = 1, 2, ..., d and a = 15. The 2D landscape of Yang’s function is shown in Fig. 4. 4.2
Comparison of FA with PSO and GA
Various studies show that PSO algorithms can outperform genetic algorithms (GA) [4] and other conventional algorithms for solving many optimization problems. This is partially due to that fact that the broadcasting ability of the current best estimates gives better and quicker convergence towards the optimality. A general framework for evaluating statistical performance of evolutionary algorithms has been discussed in detail by Shilane et al. [8]. Now we will compare the Firefly Algorithms with PSO, and genetic algorithms for various standard test functions. For genetic algorithms, we have used the standard version with no elitism with a mutation probability of pm = 0.05 and a crossover probability of 0.95. For the particle swarm optimization, we have also used the standard version with the learning parameters α ≈ β ≈ 2 without the inertia correction [4, 5, 6]. We have used various population sizes from n = 15 to 200, and found that for most problems, it is sufficient to use n = 15 to 50. Therefore, we have used a fixed population size of n = 40 in all our simulations for comparison. After implementing these algorithms using Matlab, we have carried out extensive simulations and each algorithm has been run at least 100 times so as to carry out meaningful statistical analysis. The algorithms stop when the variations of function values are less than a given tolerance ≤ 10−5 . The results are summarized in the following table (see Table 1) where the global optima are reached. The numbers are in the format: average number of evaluations (success
Firefly Algorithms for Multimodal Optimization
177
Fig. 4. Yang’s function in 2D with a global minimum f∗ = −1 at (0, 0) where a = 15
Table 1. Comparison of algorithm performance Functions/Algorithms Michalewicz’s (d=16) Rosenbrock’s (d=16) De Jong’s (d=256) Schwefel’s (d=128) Ackley’s (d=128) Rastrigin’s Easom’s Griewank’s Shubert’s (18 minima) Yang’s (d = 16)
GA 89325 ± 7914(95%) 55723 ± 8901(90%) 25412 ± 1237(100%) 227329 ± 7572(95%) 32720 ± 3327(90%) 110523 ± 5199(77%) 19239 ± 3307(92%) 70925 ± 7652(90%) 54077 ± 4997(89%) 27923 ± 3025(83%)
PSO 6922 ± 537(98%) 32756 ± 5325(98%) 17040 ± 1123(100%) 14522 ± 1275(97%) 23407 ± 4325(92%) 79491 ± 3715(90%) 17273 ± 2929(90%) 55970 ± 4223(92%) 23992 ± 3755(92%) 14116 ± 2949(90%)
FA 3752 ± 725(99%) 7792 ± 2923(99%) 7217 ± 730(100%) 9902 ± 592(100%) 5293 ± 4920(100%) 15573 ± 4399(100%) 7925 ± 1799(100%) 12592 ± 3715(100%) 12577 ± 2356(100%) 7390 ± 2189(100%)
rate), so 3752 ± 725(99%) means that the average number (mean) of function evaluations is 3752 with a standard deviation of 725. The success rate of finding the global optima for this algorithm is 99%. We can see that the FA is much more efficient in finding the global optima with higher success rates. Each function evaluation is virtually instantaneous on modern personal computer. For example, the computing time for 10,000 evaluations on a 3GHz desktop is about 5 seconds. Even with graphics for displaying the locations of the particles and fireflies, it usually takes less than a few minutes. It is worth pointing out that more formal statistical hypothesis testing can be used to verify such significance.
178
5
X.-S. Yang
Conclusions
In this paper, we have formulated a new firefly algorithm and analyzed its similarities and differences with particle swarm optimization. We then implemented and compared these algorithms. Our simulation results for finding the global optima of various test functions suggest that particle swarm often outperforms traditional algorithms such as genetic algorithms, while the new firefly algorithm is superior to both PSO and GA in terms of both efficiency and success rate. This implies that FA is potentially more powerful in solving NP-hard problems which will be investigated further in future studies. The basic firefly algorithm is very efficient, but we can see that the solutions are still changing as the optima are approaching. It is possible to improve the solution quality by reducing the randomness gradually. A further improvement on the convergence of the algorithm is to vary the randomization parameter α so that it decreases gradually as the optima are approaching. These could form important topics for further research. Furthermore, as a relatively straightforward extension, the Firefly Algorithm can be modified to solve multiobjective optimization problems. In addition, the application of firefly algorithms in combination with other algorithms may form an exciting area for further research.
References [1] Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press, Oxford (1999) [2] Deb, K.: Optimisation for Engineering Design. Prentice-Hall, New Delhi (1995) [3] Gazi, K., Passino, K.M.: Stability analysis of social foraging swarms. IEEE Trans. Sys. Man. Cyber. Part B - Cybernetics 34, 539–557 (2004) [4] Goldberg, D.E.: Genetic Algorithms in Search, Optimisation and Machine Learning. Addison Wesley, Reading (1989) [5] Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. of IEEE International Conference on Neural Networks, Piscataway, NJ, pp. 1942–1948 (1995) [6] Kennedy, J., Eberhart, R., Shi, Y.: Swarm intelligence. Academic Press, London (2001) [7] Passino, K.M.: Biomimicry of Bacterial Foraging for Distributed Optimization. University Press, Princeton (2001) [8] Shilane, D., Martikainen, J., Dudoit, S., Ovaska, S.J.: A general framework for statistical performance comparison of evolutionary computation algorithms. Information Sciences: An Int. Journal 178, 2870–2879 (2008) [9] Yang, X.S.: Nature-Inspired Metaheuristic Algorithms. Luniver Press (2008) [10] Yang, X.S.: Biology-derived algorithms in engineering optimization. In: Olarius, S., Zomaya, A. (eds.) Handbook of Bioinspired Algorithms and Applications, ch. 32. Chapman & Hall / CRC (2005) [11] Yang, X.S.: Engineering Optimization: An Introduction with Metaheuristic Applications. Wiley, Chichester (2010)
Economical Caching with Stochastic Prices Matthias Englert1 , Berthold V¨ ocking2 , and Melanie Winkler2 1
DIMAP and Department of Computer Science, University of Warwick [email protected] 2 Department of Computer Science, RWTH Aachen University {voecking,winkler}@cs.rwth-aachen.de
Abstract. In the economical caching problem, an online algorithm is given a sequence of prices for a certain commodity. The algorithm has to manage a buffer of fixed capacity over time. We assume that time proceeds in discrete steps. In step i, the commodity is available at price ci ∈ [α, β], where β > α ≥ 0 and ci ∈ N. One unit of the commodity is consumed per step. The algorithm can buy this unit at the current price ci , can take a previously bought unit from the storage, or can buy more than one unit at price ci and put the remaining units into the storage. In this paper, we study the economical caching problem in a probabilistic analysis, that is, we assume that the prices are generated by a random walk with reflecting boundaries α and β. We are able to identify the optimal online algorithm in this probabilistic model and analyze its expected cost and its expected savings, i.e., the cost that it saves in comparison to the cost that would arise without having a buffer. In particular, we compare the savings of the optimal online algorithm with the savings of the optimal offline algorithm in a probabilistic competitive analysis and obtain tight bounds (up to constant factors) on the ratio between the expected savings of these two algorithms.
1
Introduction
We study a stochastic version of the economical caching problem dealing with the management of a storage for a commodity whose price varies over time. An online algorithm is given a sequence of prices for a certain commodity. The algorithm has to manage a buffer of fixed capacity over time. We assume that time proceeds in discrete steps. In step i, the commodity is available at price ci ∈ [α, β], where β > α ≥ 0 and ci ∈ N. One unit of the commodity is consumed per step. The algorithm can buy this unit at the current price ci , can take a previously bought unit from the storage, or can buy more than one unit at price ci and put the remaining units into the storage. A slightly more general version of this problem has been introduced in [2], where it is assumed that the consume per step is not fixed to one but varies over time. The motivation for introducing this problem was the battery management of a hybrid car with two engines, one operated with petrol, the other
Supported by the DFG GK/1298 “AlgoSyn” and EPSRC grant EP/F043333/1.
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 179–190, 2009. c Springer-Verlag Berlin Heidelberg 2009
180
M. Englert, B. V¨ ocking, and M. Winkler
with electrical energy. An algorithm has to decide at which time the battery for the electrical energy should be recharged by the combustion engine. In this context, prices correspond to the combustion efficiency which, e.g., depends on the rotational speed. Obviously, the economical caching problem is quite generic and one can think of multiple other applications beyond hybrid cars, e.g., purchasing petrol at gas stations where prices vary on a daily basis, caching data streams in a mobile environment in which services that operate at different price levels are available in a dynamic fashion etc. The problem is related to the well-known one-way trading problem [1]. In [2], the economical caching problem has been analyzed in a worst case competitive analysis as introduced by [4]. Let h = β/α and assume β > α > 0. It is shown that no memoryless√algorithm (as formalized in [2]) can achieve a competitive factor better than h. This competitive factor is guaranteed by a simple threshold √ scheme that purchases as much as possible units if the price is αβ and purchases as few as possible units if the price is above smaller than √ αβ. This is, however, not the best possible competitive factor: It is shown that there is an online algorithm that takes into account the history in its buying −1 decisions and, this way, achieves a competitive factor of W 1−h 1 , where eh + √ h by a factor W denotes the LambertW function. This competitive factor beats √ of about 2. Moreover, it is shown that this competitive factor is, in fact, best possible for any online algorithm. The results from [2] determine the competitive factor for economical caching with respect to worst-case prices. The given bounds are tight. Nevertheless, the practical relevance of these results is questionable as the presented online algorithm optimizes against an adversary that generates a worst-case price sequence in response to the decisions of the algorithm. In contrast, in the aforementioned applications, the prices are set by a process that is (at least to a large extend) independent of the decisions of the online algorithm. We believe that a model in which prices are determined by a stochastic process can lead to more algorithms that perform better in practice. In this paper, we assume that prices are generated by a random walk with reflecting barriers α and β. We do not claim that this is exactly the right process for any of the aforementioned applications. We hope, however, that our analysis is a first step towards analyzing more realistic, and possibly more complicated input distributions. Before stating our results, we give a formal description of the problem. 1.1
Formal Description of the Problem
We consider a finite process of n steps. In step i ∈ {1, . . . , n} the algorithm can purchase the commodity at price ci per unit. Prices c1 , c2 , . . . , cn , cn+1 are natural numbers generated by a random walk with reflecting boundaries α ∈ N and β ∈ N, β > α ≥ 0. Price cn+1 is used to valuate the units in the storage at the end of the process.
Economical Caching with Stochastic Prices
181
We assume that the random walk starts at a position chosen uniformly at random from {α, . . . , β}. This is justified by the fact that this allocation corresponds to the steady state distribution of the random walk. If ci ∈ {α + 1, β − 1} then Pr [ci+1 = ci − 1] = Pr [ci+1 = ci + 1] = 12 ; if ci = α then Pr [ci+1 = α] = Pr [ci+1 = α + 1] = 12 ; and if ci = β then Pr [ci+1 = β − 1] = Pr [ci+1 = β] = 12 . It is well known that, under these conditions, for every 1 ≤ i ≤ n + 1, α ≤ j ≤ β, Pr [ci = j] = t−1 , where t = |{α, . . . , β}| = β − α + 1. The capacity of the storage is denoted by b ∈ N. Let si ∈ [0, b] denote the amount stored in the storage at the end of step i. Initially, the storage is assumed to be empty, that is, s0 = 0. Consider an algorithm A. Let ui ∈ {0, . . . , b− si−1 + 1} denote the number of units that A purchases in step i. In each step, one unit of the commodity is consumed. Thus, si = si−1 + ui − 1 ≤ b. Furthermore, if si−1 0 at least one unit has to be purchased in step i. The cost of the algorithm = n is i=1 ui · ci − sn · cn+1 . Observe that we value the units in the buffer at the end of the process with the price cn+1 . An algorithm is called online when ui , the amount purchased in step i depends only on the prices c1 , . . . , ci but not on the prices ci+1 , . . . , cn+1 . An algorithm that is allowed to take into account knowledge about the full price sequence is called offline. 1.2
Preliminaries
Algorithm NoBuffer is an example of a very simple online algorithm. It sets ui = 1, regardless of how the prices are chosen. As its name suggest, this algorithm does not exploit the storage. For this reason, its expected cost can be easily since Pr [ci = j] = t−1 . calculated. In every step i, the expected price is α+β 2 Hence, by linearity of expectation, the expected cost of algorithm NoBuffer are n · α+β 2 . The savings of an algorithm A are defined to be the cost of NoBuffer minus the cost of the algorithm A. The savings are of particular relevance as this quantity describes the reduction in cost due to the availability of a buffer. For an algorithm A, let φi (A), 0 ≤ i ≤ n, denote the savings accumulated until step i, i.e., the savings of A assuming that A as well as NoBuffer terminate after step i. Let Δφi (A) = φi (A) − φi−1 (A) be the savings achieved in step i. Proposition 1. For 1 ≤ i ≤ n, Δφi = si (ci+1 − ci ). Proof. The increase in cost of the NoBuffer algorithm during step i is ci . The increase in cost of A during the same step, taking into account the change in the value of the storage, is ui · ci − si · ci+1 + si−1 · ci = ci − si (ci+1 − ci ) , where the last equation follows from si = si−1 + ui − 1. Subtracting the increase in cost of A from the increase in cost of NoBuffer gives the proposition.
182
M. Englert, B. V¨ ocking, and M. Winkler
The proposition shows that the savings achieved in step i correspond to the (possibly negative) increase of the value of the storage due to the change of the cost from ci to ci+1 during the step. 1.3
Our Results
We are able to identify the optimal online algorithm, i.e., the algorithm that achieves the smallest possible expected cost among all online algorithms. This algorithm works as follows: It fills the storage completely if ci = α and uses units from the storage if ci > α, cf. Algorithm 1. The algorithm is called Cautious
Algorithm 1. (Cautious) Input: ci 1: if ci = α then 2: ui := b − si−1 + 1 3: else 4: ui := 1 − min{1, si } 5: end if
as it only stores units bought at the lowest possible price. Obviously, there is no risk in buying these units. Somewhat surprisingly, this approach is the best possible online strategy. We estimate the expected cost of the Cautious algorithm in the limiting case, i.e., assuming that n is sufficiently large. Per step, the expected cost of Cautious are 2 α + t · 2−Θ(b/t ) . 2
Observe, if b is of order ω(t2 ) then 2−Θ(b/t ) = o(1). That is, the expected cost 2 per step approach the lower boundary α. If b = O(t2 ) then 2−Θ(b/t ) = Θ(1). Thus, in this case, the expected cost of Cautious are α + Θ(t), which, however, is not a very significant result as it does not show the dependence of the expected cost on b, for b = O(t2 ). To get more meaningful results we investigate the expected savings rather than the expected cost of the Cautious algorithm and compare it with the expected savings of an optimal offline algorithm. We get tight lower and upper bounds on the expected savings for both of these algorithms. The expected savings of the Cautious algorithm are Θ(min{ bt , t}) per step, while the expected savings of √ the optimal offline algorithm are Θ(min{ b, t}) per step. Thus the competitive factor, i.e., the ratio between the expected savings of Cautious and the expected savings of the optimal online algorithm, is of order
√ min bt , t b √ = Θ min ,1 . t min b, t
Economical Caching with Stochastic Prices
2
183
Cautious Is the Optimal Online Algorithm
In this section, we show that Cautious achieves the smallest possible expected cost among all online algorithms. Theorem 1. For every n ∈ N, there is no online algorithm achieving smaller expected cost than Cautious. Proof. Consider any online algorithm A. We assume that A is reasonable, i.e., it completely fills the storage in every step in which the price per unit is α. We can make this assumption w.l.o.g. since buying units at cost α instead of buying them later at possibly higher cost or buying additional units that are valuated at a price of cn+1 ≥ α after step n cannot increase the cost of the algorithm. We will show, for every 1 ≤ i ≤ n, Cautious achieves larger expected savings in step i than A, that is, E [Δφi (A)] ≤ E [Δφi (Cautious)]. By linearity of expectation, this implies E [φn (A)] ≤ E [φn (Cautious)]. In words, the expected savings of A over all steps are not larger than the expected savings of Cautious, which implies the theorem. Towards this end, we study the expected savings of A per step. Firstly, we analyze E [Δφi (A)] conditioned on ci being fixed to some value. For k ∈ {α + 1, . . . , β − 1}, Proposition 1 gives E [Δφi (A) | ci = k] = (E [ci+1 | ci = k] − k) · E [si | ci = k] = 0 because E [ci+1 | ci = k] = k. In contrast, E [Δφi (A) | ci = α] = (E [ci+1 | ci = α] − α) · E [si | ci = α] =
1 · E [si | ci = α] 2
because E [ci+1 | ci = α] = α + 12 , and 1 E [Δφi (A) | ci = β] = (E [ci+1 | ci = β] − β) · E [si | ci = β] = − · E [si | ci = β] 2 because E [ci+1 | ci = β] = β − 12 . Applying these equations, the expected savings of A in step i can be calculated as E [Δφi (A)] = Pr [ci = α] · E [Δφi (A) | ci = α] + Pr [ci = β] · E [Δφi (A) | ci = β] 1 (E [si | ci = α] − E [si | ci = β]) . = 2t Now observe that E [si | ci = α] = b because of our initial assumption that A is reasonable. Thus, the expected savings are maximized if E [si | ci = β] is minimized, which is achieved if A has the property that for every step j with cj > α, uj ≤ β − cj − sj−1 + 1. Cautious has this property. Thus Theorem 1 is shown.
184
3
M. Englert, B. V¨ ocking, and M. Winkler
Expected Cost of Cautious
We now calculate the expected cost of Cautious. Theorem 2. As n tends to infinity, the expected cost per step of Cautious tend 2 to α + t · 2−Θ(b/t ) . Proof. To simplify notation we will assume that (β+α)/2 is integral. Fix an arbi2 trary step i > b. We will show that E [ci − (si−1 − si )(ci − α)] = α + t · 2−Θ(b/t ) which corresponds to the expected cost of Cautious per step in the limiting case, that is, if we ignore an additive error that does not depend on the sequence length n (but may depend on b, α, and β). To see that E [ci − (si−1 − si )(ci − α)] is indeed related to the expected cost per step we take the sum over all steps j > b and obtain ⎡ ⎤ n n
E [cj − (sj−1 − sj )(cj − α)] = E ⎣ (cj − (sj−1 − sj )(cj − α))⎦ j=b+1
j=b+1
⎡ = E⎣
n
⎤ (uj (cj − α) + α)⎦
j=b+1
⎡ = E⎣
n
⎤ uj cj − (sn − sb ) · α⎦
j=b+1
n u c − s c which approximates the expected total cost of Cautious E n n+1 j=1 j j within an additive error of 2b · β. From the definition of the Cautious algorithm it follows immediately that ci − (si−1 − si )(ci − α) equals α if si−1 > 0 and equals ci otherwise. Therefore E [ci − (si−1 − si )(ci − α)] is equal to β
Pr [ci = x] · (Pr [si−1 = 0 | ci = x] · x + (1 − Pr [si−1 = 0 | ci = x]) · α)
x=α
=α+
β
Pr [ci = x] · Pr [si−1 = 0 | ci = x] · (x − α)
x=α
=α+
β 1 Pr [si−1 = 0 | ci = x] · (x − α) , t x=α
Note that Pr [si−1 = 0 | ci = x] is the probability that the price was strictly larger than α for the last b steps, i.e., ci−b > α, . . . , ci−1 > α. In the following 2 we will show that Pr [si−1 = 0 | ci = x] = 2−Θ(b/t ) , for x ≥ (α + β)/2, which implies the theorem since on one hand
Economical Caching with Stochastic Prices
α+
185
β 1 Pr [si−1 = 0 | ci = x] · (x − α) t x=α
≤α+
β 1 Pr [si−1 = 0 | ci = β] · (x − α) t x=α 2
2−Θ(b/t =α+ t
β
)
(x − α)
x=α 2
= α + t · 2−Θ(b/t
)
and on the other hand α+
β 1 Pr [si−1 = 0 | ci = x] · (x − α) t x=α
1 ≥α+ t
β
2
=α+
Pr [si−1 = 0 | ci = (α + β)/2] · (x − α)
x=(α+β)/2
2−Θ(b/t t
β
)
(x − α)
x=(α+β)/2 2
= α + t · 2−Θ(b/t
)
. 2
First we show Pr [si−1 = 0 | ci = x] = 2−Ω(b/t ) for any x. For simplicity we assume that b is a multiple of 2(t − 1)2 and we divide the b previous steps into b/(2(t − 1)2) phases of length 2(t − 1)2 . It is well-known that if the random walk starts with a price x in the beginning of a phase, the expected time until the random walk will reach a price of α will be (x − α) · (2β − α − x − 1) [3, p. 349]. Using Markov’s inequality we know that the probability not to reach α in two times as many steps is at most 1/2. The number of steps is maximized for x = β which gives us that the probability not to reach α in 2 · (β − α) · (β − α − 1) ≤ 2(t − 1)2 steps is at most 1/2. In other words, if we fix one of the phases, the probability that the price is strictly larger than α in every step of this phase is at most 1/2 and because our argument does not depend on the price x a phase starts with, this holds independently for every phase. Therefore, the probability not to have a price of α in any of the b/(2(t − 1)2 ) phases is bounded from above 2 2 by 2−b/(2(t−1) ) = 2−Ω(b/t ) . 2 It only remains to show that Pr [si−1 = 0 | ci = x] = 2−O(b/t ) for x ≥ (α + β)/2. For this we divide the b previous steps into 16 ·b/(t− 1)2 phases of length at most (t − 1)2 /16. We call a phase starting with some price above (α + β)/2 successful if the price at the end of the phase is above (α + β)/2 as well and no step of the phase has a price of α. If we can show that the success probability for a fixed phase can be bounded from below by some constant c > 0, we can conclude that the probability for all 16 · b/(t − 1)2 phases to be successful is 2 2 at least c16·b/(t−1) = 2−O(b/t ) which is the desired result and concludes the proof of the theorem.
186
M. Englert, B. V¨ ocking, and M. Winkler
Fix a phase, let A denote the event that no step in the phase has a price of α, and let B denote the event that the price at the end of the phase is above (α + β)/2. We will show that the probability for the phase to be successful Pr [A ∧ B] = Pr [B] · Pr [A | B] is at least 1/64. Clearly Pr [B] ≥ 1/2. Now assume for contradiction that Pr [A | B] ≤ 1/32. Let X be the random variable describing the number of steps it takes a random walk starting from (α + β)/2 to 2 hit a price of α. We already know − 1)−j/4 (as before this follows that E [X] = 23(t from [3, p. 349]) and that Pr X ≥ j · 2(t − 1) ≤ 2 , for any integral j > 1. Furthermore, the assumption Pr [A | B]≤ 1/32 implies Pr X ≤ (t − 1)2 /16 ≥ 31/32, which gives Pr X > (t − 1)2 /16 ≤ 1/32. Thus, we can conclude – Pr X ∈ [0, (t − 1)2 /16] ≤ 1, – Pr X ∈ ((t − 1)2 /16, 16 · (t − 1)2 ≤ 1/32, and – for any integral j ≥ 1, Pr X ∈ (2 · j · (t − 1)2 , 2 · (j + 1) · (t − 1)2 ≤ 2−j . This gives us the following contradiction ∞
E [X] ≤
(t − 1)2 (j + 1) · (t − 1)2 (t − 1)2 3(t − 1)2 + + = E [X] < 16 2 2j−1 4 j=8
and completes the proof of the theorem.
Note that the above arguments in the proof of Theorem 2 also directly imply the following lemma, which will be useful later in the proof of Theorem 3. Lemma 1. The probability that a random walk starting at a price of β will not see a price of α for at least (t − 1)2 steps is at least (1/64)16 Proof. The proof is analogous to the above arguments if we consider only 16 instead of 16 · b/(t − 1)2 phases of length at most (t − 1)2 /16.
4
Expected Savings of Cautious
Theorem 3. E [φn (Cautious)] = Θ(n · min{ bt , t}). Proof. Fix an arbitrary time step i. In Theorem 1 it was already shown that E [Δφi (Cautious)] = (b − E [si |ci = β])/(2t). The number of units si in the storage only depend on the number of steps := i − i that passed since the last previous step i that had a price of α. More precisely, the storage is empty if ≥ b and otherwise the storage contains b − units, where we define := b if there was no previous step i with a price of α. Hence, si = b − min{b, } and thus, E [Δφi (Cautious)] =
E [min{b, } | ci = β] . 2t
Economical Caching with Stochastic Prices
Clearly E [min{b, } | ci = β] /(2t) ≤ steps is obviously bounded by n · t, E [φn (Cautious)] ≤ min
b 2t
187
= O(b/t). Since the total saving over all
b ,t . E [Δφi (Cautious)] , n · t = O n · min t i=1
n
It remains to show that E [min{b, } | ci = β] /(2t) = Ω(min{ bt , t}). The probability that a random walk starting at a price of β does not reach a price of α for at least (t − 1)2 steps is equal to Pr > (t − 1)2 | ci = β . Hence, due to Lemma 1 Pr > (t − 1)2 | ci = β ≥ (1/64)16 . It follows b min{b, (t − 1)2 } E [min{b, } | ci = β] ≥ = Ω min ,t . 2t 6416 · 2t t n Taking thesum over all steps gives E [φn (Cautious)] = i=1 E [Δφi (Cautious)] = b Ω n · min t , t .
5
Expected Savings of the Optimal Offline Algorithm
The cost of an optimal offline algorithm is ni=1 min{cj | j ≥ 0, i − b ≤ j ≤ n i} + i=n−b+1 (min{ci , . . . , cn , cn+1 } − cn+1 ) (which can be formally proven with arguments analogous to [2, Lemma 2.1]). To get a measurement of the quality of this savings of our optimal online algorithm Cautious, we calculate E[φn (Off)] for the optimal offline algorithm. √ Theorem 4. E [φn (Off)] = Θ(n · min{ b, t}). √ Proof. We will show that E [ci − min{ci−b , . . . , ci }] = Θ(min{ b, t}), for every b step i > b. The theorem nfollows because i=1 (ci − min{cj | j ≥ 1, j ∈ [i − b, i]}) ≤ b(β − α) and i=n−b+1 (min{ci , . . . , cn , cn+1 } − cn+1 ) ∈ [b(α − β), 0] and therefore n n
ci − E min{cj | j ≥ 1, j ∈ [i − b, i]} E [φn (Off)] = E i=1
−E
i=1 n
(min{ci , . . . , cn , cn+1 } − cn+1 )
i=n−b+1
=
n
E [ci − min{ci−b , . . . , ci }]
i=b+1
+E +E
b
(ci − min{cj | j ≥ 1, j ∈ [i − b, i]})
i=1 n
(min{ci , . . . , cn , cn+1 } − cn+1 )
i=n−b+1
√ = Θ(n · min{ b, t}) .
188
M. Englert, B. V¨ ocking, and M. Winkler
In this proof we will make use of the well-known fact that it is possible to map a symmetric unrestricted random walk c1 , . . . , cn+1 to the random walk c1 , . . . , cn+1 with reflecting boundaries at α and β. More precisely, let c1 be chosen from {0, . . . , t − 1} and set Pr ci+1 = ci + 1 = at random uniformly 1 Pr ci+1 = ci − 1 = 2 , for any ci . Setting ci := α+ min{(ci mod 2t), 2t− 1 − (ci mod 2t)} for each i, maps this unrestricted random walk to the reflected random walk. Fix a√time step i > b. We have to show that E [ci − min{ci−b , . . . , ci }] = Θ(min{ b, t}) and we start with the upper bound. It is easy to see that ci − cj ≤ |ci − cj | for every j and it is known [3, Theorem III. 7.1] that Pr
|ci
−
min{ci−b , . . . , ci }|
= =
1 2b−1
·
b b + 1 . 1 2 (b + ) 2 (b + + 1)
Thus, E [ci − min{ci−b , . . . , ci }] ≤ E |ci − min{ci−b , . . . , ci }| b
b b = · + 1 1 2b−1 2 (b + ) 2 (b + + 1) =1 b
√ b · ≤2· =O b . 1 b−1 2 2 (b + ) =1
Since obviously ci − min{ci−b , . . . , ci } ≤ t, we obtain E [ci − min{ci−b , . . . , ci }] = √ O(min{ b, t}). To show the lower bound we, once again, start by considering the symmetric unrestricted random walk and obtain √ √ 1 b b = · Pr |ci − ci−b | > Pr ci − ci−b > 4 2 4 √ 1 b = · 1 − Pr |ci − ci−b | ≤ 2 4 √ b/4
1 b −b = · 1−2 · 1 2 √ 2 (b + ) 1 ≥ · 2
=− b/4 √ b/4
1−2
−b
·
√ =− b/4
b b 2
≥
1 . 4
√ Since the step size of the random walk is one, c0 −ci−b > b/4 implies that there √ exists a 0 ≤ k ≤ b such that ci − ci−k = b/4 . Therefore we can conclude for our reflecting random walk that
Economical Caching with Stochastic Prices
189
√ b 1 | ci = j ≥ Pr ci − min{ci−b , . . . , ci } ≥ 4 4 √ for any j ≥ α + b/4 . By exploiting this lower bound on the probability that there is a large difference between the current price ci and the minimum price over the last b steps min{ci−b , . . . , ci } we get
E [ci − min{ci−b , . . . , ci }] =
β
Pr [ci−b = j] · E [ci − min{ci−b , . . . , ci } | ci = j]
j=α
≥
1 t
β
√ j=α+ b/4
√ b/4 ≥ t
E [ci − min{ci−b , . . . , ci } | ci = j]
β
√ j=α+ b/4
√ b | ci = j Pr ci − min{ci−b , . . . , ci } ≥ 4
√ b 16 √ j=α+ b/4 √ √ b b · t− . = 16t 4 ≥
1 t
β
√ Hence, for b ≤ t2 , E [ci − min{ci−b , . . . , ci }] = Ω( b). Since the amount ci − min{ci−b , . . . , ci } is monotonic in b, we also have E [ci − min{ci−b , . . . , ci }] = Ω(t) for b > t2 , which concludes the proof of the lower bound.
6
Conclusions
We considered the Economical Caching problem with stochastic prices generated by a random walk with reflecting boundaries at α and β. We identified an optimal online algorithm for this setting and gave bounds on its expected cost and savings. However, modeling the price development as a random walk might be a serious simplification of many real world scenarios. Therefore this work should be seen as a first step towards an analysis of more realistic input models. In reality, prices may increase or decrease by more than one unit in each time step and the probabilities for an increase and a decrease might differ and depend on the current price. Furthermore, the assumption that the upper and lower bound α and β is known to the algorithm, is unrealistic in most applications. In fact, in general, these strict bounds do not even exist.
190
M. Englert, B. V¨ ocking, and M. Winkler
References [1] Borodin, A., El-Yaniv, R.: Online computation and competitive analysis. Cambridge University Press, New York (1998) [2] Englert, M., R¨ oglin, H., Sp¨ onemann, J., V¨ ocking, B.: Economical caching. In: Proceedings of the 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 385–396 (2009) [3] Feller, W.: An Introduction to Probability Theory and Its Applications, 3rd edn. Wiley, Chichester (1971) [4] Sleator, D.D., Tarjan, R.E.: Amortized efficiency of list update and paging rules. Commun. ACM 28(2), 202–208 (1985)
Markov Modelling of Mitochondrial BAK Activation Kinetics during Apoptosis C. Grills , D.A. Fennell, and S.F.C. Shearer (ne´e O’Rourke) Centre for Cancer Research and Cell Biology, Queen’s University, Belfast, BT9 7AB, N. Ireland [email protected]
Abstract. The molecular mechanism underlying mitochondrial BAK activation during apoptosis remains highly controversial. Two seemingly conflicting models known as the agonism model and the de-repressor model have been proposed. In the agonism model, BAK requires activator BH3 only proteins to initiate a series of events that results in cell apoptosis. In the de-repressor model the antagonism of pro-survival BCL-2 family proteins alone is sufficient for BAK activation kinetics to promote apoptosis. To gain a better understanding of the kinetic implications of these models and reconcile these opposing, but highly evidencebased theories, we have formulated Markov chain models which capture the molecular mechanisms underlying both the agonism and de-repressor models. Our results indicate that both pure agonism and dissociation are mutually exclusive mechanisms capable of initiating mitochondrial apoptosis by BAK activation. Keywords: BAK activation kinetics; apoptosis; BH3-only proteins; BCL-2 family proteins; agonism model; derepressor model.
1
Introduction
Apoptosis is a regulated process designed to rid the body of damaged or unwanted cells. This is critical for normal development and organism homeostasis [1]. Apoptosis is induced via two main routes involving either the mitochondria (the intrinsic pathway) or the activation of death receptors (the extrinsic pathway). In this paper we study the intrinsic pathway of apoptosis, since there is considerable debate in literature about how the molecular mechanism underlying this pathway works. Understanding it is of vital importance to oncologists because apoptosis is a hallmark of cancer and a pivotal factor underlying tumour resistance to systemic anti-cancer therapy. The mitochondrial apoptotic pathway is largely governed by BCL-2 family proteins which include pro-apoptotic members that promote mitochondrial permeability and antiapoptotic members that inhibit the effects of pro-apoptotic members or inhibit the mitochondrial release of cytochrome c. One of the most
Corresponding author.
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 191–205, 2009. c Springer-Verlag Berlin Heidelberg 2009
192
C. Grills, D.A. Fennell, and S.F.C. Shearer
important members of the BCL-2 family that is crucial in the regulation of the mitochondrial pathway of apoptosis is BAK [2]. When BAK is oligoerised this enables the release of mitochondrial factors to activate caspases which irreversibly execute the cell death program [3]. BAK is activated by a subclass of proapoptotic BCL-2 proteins which contain only a BH3 domain [2]. However at present there exists considerable controversy as to whether BAK is activated directly or indirectly. Pure agonism models which support direct BAK activation [4, 5], vs de-repressor models which support indirect BAK activation reflect contrasting thermodynamic representations of BAK regulation. In the agonism model, a subclass of activator BH3-only proteins, comprising BID, BIM and arguably PUMA, interact with BAX [5, 6, 7] resulting in a conformational change and oligomerization [7, 8]. This makes the case that this subclass of BH3-only proteins activate BAK directly. In the de-repressor model a different subclass of de-repressor BH3-only proteins, comprising BCL-2, BCL-XL, MCL-1 and VDAC have a neutralizing effect on BAK [7, 9, 10]. BAK may then be activated by the small molecule BAD BH3 mimetic ABT737 or undergo auto-activation, in the absence of activator BH3s [7]. This has led to the hypothesis that direct activation is not essential for apoptosis, but that antagonism of pro-survival BCL-2 family proteins alone is sufficient [7]. In view of this apparent disagreement, we have adopted a mathematical approach to examine the probability of concentration-dependent effects of activator BH3-only proteins and de-repressor BH3-only proteins on the maximum rate of BAK activation. Previously we modelled these systems deterministically by obtaining non-linear rate equations using the Law of Mass Action and solving numerically [7]. However the stochastic cellular environment motivates us to study this problem using a different mathematical framework. A stochastic approach will allow us to describe the time evolution of the BAX activation kinetics by taking into account the fact that the molecules in the reacting system exhibit some degree of randomness in their dynamical behaviour. Ghosh et al [11] have pointed out that although stochastic simulators may be used to solve a system of non-linear differential equations for reaction kinetics to include stochasticity in the model, they suffer from simulation stiffness and require heavy computational overheads. Bearing this in mind our approach is to adopt a new discrete time Markov chain based model to simulate the behaviour of BAK activation kinetics in the mitochondria during apoptosis. This method has the advantage that it has both low computational and memory overheads. It has never been applied to BAK activation kinetics but Markov chains have been used to model cellular signals and signalling pathways [12]. Using Markov chain analysis allows us to determine the quantitative contribution of each stage to the overall outcome of the reaction. It allows us to establish a correlation between statistical and physical models and extract accurate temporal behaviour as well as looking for convergence. Markov chains provide a concise way to represent information and the results from ours Markov suggest that both the agonism
Mitochondrial BAK Activation Kinetics during Apoptosis
193
and dissociation models reflect valid mechanisms for BAK activation provided that strict constraints are applied. The structure of our papers is as follows. In section 2 we present a discrete Markov modelling approach to study the BAK activation kinetics of apoptosis. Each stage of the kinetic reaction represents a particular state. We develop three models in this section: (i) agonist Markov model, (ii) de-repressor Markov Model and (iii) combined agonist and de-repressor Markov model. In section 3 we use these models to analyse the probabilities of reaching the stage at which apoptosis is irreversibly initiated. We conclude in section 4 with a critical discussion of our model predictions and their biological interpretation.
2
Markov Chain Model of BAK Activation Kinetics
In a stochastic biochemical system the state of the system at any time is defined by the number of molecules of each type. The transition from one state to another is obtained from the probability of the reactions in the current state and the resulting next state is the new molecular state. As the molecular reactions in a biological process occur due to the random collision of the molecules, the state transition parameters are random and the state space is discrete. Stochastic models of reacting systems can be formulated as a Markov process [13]. We convert the mitochondrial BAK activation kinetics into discrete Markov chain models so that we obtain a domain of molecular states with corresponding state transition probabilities. Since each stage of the reaction equation is only dependent on the previous stage, the systems we consider satisfy the ”memory less” property of a Markov process. 2.1
Theoretical Calculations
The agonist Markov model, the de-repressor Markov model and the combined agonist and de-repressor Markov models that we consider may be represented by a set of states given by S = {s1 , s2 , s3 , ..., sr }, where si represents a stage in the reaction and the transition between each stage is referred to as a step. If the system is in a state si and moves to a state sj it does so with a probability pij . This probability does not depend on any states previous to the current one and obeys (1) P [sn |s0 , · · · , sn−1 ] = P [sn |sn−1 ]. The probabilities pij are referred to as the transition probabilities and are given by the Chapman-Kolmogorov equation [14] as pm+n = ij
r
n pm ik pkj ,
(2)
k=1
where m and n are integers and P denotes the transition matrix of the Markov chain. The ijth entry pnij of the matrix P n gives the probability that the Markov
194
C. Grills, D.A. Fennell, and S.F.C. Shearer
chain starting in the state si will be in the state sj after n steps. The matrix P is stochastic provided that the following restrictions hold
pij =
m
pij ≥ 0 i, j ∈ S
(3)
pij = 1 for all j ∈ S
(4)
i=1
i∈S
One of the most important quantities that we are interested in from a Markov Chain model are the limiting probabilities lim pn n→∞ ij
= lim P [sn = j|s0 = i].
(5)
n→∞
We can look at these by examining the limit of P n since any finite stochastic matrix P has exactly one eigenvalue equal to 1 and the remaining eigenvalues are less than 1. Therefore P n converges to a matrix where each row is a eigenvector of P corresponding to the eigenvalue 1. In this paper we deal only with stochastic transition matrices and thus we are able to calculate limiting probabilities for each system considered. Most current approaches to simulate chemical kinetics involve mass action approximations and efficient stochastic methods that have been devised assume populations of indistinguishable molecules in each state [13]. Here we apply the Markov chain method to analyse molecule distribution throughout the system, the probabilities we obtain make us aware of how many molecules are likely to be in a particular state after a given number of iterations. 2.2
System 1: Agonist Driven BAK Activation
In the agonism model activator BH3-only proteins (namely BID and BIM, labelled as b1) interact with a putative activation binding site analogous to BAX [5] leading to conformation change and eventually oligomerization [2]. These activators can be bound to the pro-survival BCL-2 family proteins which prevents them forming a transient complex with BAK [6]. Under certain conditions described as ”priming for death” dissociator BH3 proteins such as BAD and NOXA (labelled b2 1 and b2 2) can free the bound BID and occupy the offending antiapoptotic proteins leaving BID free to activate BAK [8]. The behaviour of the agonist model of BAK activation can be captured in essence in four stages and schematically is represented in Figure 1. From this reaction scheme we see that 0.5 0.5
1
B + A1b1 + b21 + A2b1 + b22
A1b21+ b1 + B + A2b22
A1b21+ Bb1 + A2b22
A1b21 + b1 A2b22 + B*
Fig. 1. The reaction scheme showing the interaction between the agonists and BAK
S1
S2 0.5
0.5
S3 0.5
S4 0.5
Fig. 2. The Markov Chain representing the agonist model. The transition probabilities between each stage are labelled.
Mitochondrial BAK Activation Kinetics during Apoptosis
195
in the first stage of the reaction the anti-apoptotic proteins (A1 and A2) are initially held in complex with the agonists (b1) whilst BAK (B) and the dissociator BH3s are free molecules (b2 1 and b2 2). In the second stage the complex between the anti-apoptotic proteins and the agonists is broken up by the dissociator BH3s and BAK remains unbound. The complex formed between the anti-apoptotic proteins and the dissociator BH3s is unstable and can be broken up again. This occurs in the third stage of the reaction scheme above. Here the interaction of A1 and b2 1 corresponds to the restricted binding of BAD to mitochondrial BCL-2, BCL-XL or BCL-W, whereas the A2 interaction with b2 2 corresponds to MCL-1 or A1 interaction with NOXA. If the agonists remain free they can bind with BAK while the complex with the anti-apoptotic proteins and the dissociator BH3s remains the same. BAK and the agonists combine in a reversible complex thus they may separate into their original components. If they stay in union the complex goes through a conformation change and BAK opens (B*), as shown in stage 4 of the reaction. It is assumed the recycling of b1 follows the dissociation of B.b1 allowing it to be reused in subsequent reactions. In Figure 2 s1 , s2 , s3 , and s4 correspond to the initial, second, third and final stage of the reaction in Figure 1. Initially we assume forward and backward reactions are equally probable since the control exerted by pro or anti-apoptotic BCL-2 members is equal in an unprimed system. Dominance is only possible if there is a greater amount of pro or anti-apoptotic factors [15]. Assigning probabilities to the movement between states is influenced by the concentration of the individual species, any variation in this is accounted for through a probability increase or decrease. The choice of parameter values was governed by forward and backward reactions being equally likely and these values were recommended by clinicians working in this field [16]. Figure 2 illustrates the probabilities of the system being in the s4 state when the system begins in state s1 . We observe that the probabilities are larger with more variance at the beginning of the reaction because the system is most active at this stage but as energy is depleted and various BCL-2 family members get locked in complexes the system tends to a limit. The probability of starting in state s1 and ending in the desired s4 state (the B* state, where BAK can oligomorise) is pn14 = 0.1818. When the anti-apoptotic proteins are hindered it becomes more likely the reaction will drive forward. It is only from the s3 state that the B* state can be reached, so if it is more likely to reach s3 it follows it is more likely the s4 state will be achieved. Adjusting our transition matrix to reflect these new probabilities and finding it’s limit as n → ∞ we discover the probability associated with reaching the B* state (s4 ) when the system has reached completion and has stabilised is 0.24. The same effect is produced when the influence of the agonists is made greater which would correspond to priming the system with more b1. From Figure 3 we can see that altering the system to make cell death more likely not only raises the probability of BAK opening it also takes less iterations for this higher limit to be reached implying a more efficient system.
196
C. Grills, D.A. Fennell, and S.F.C. Shearer
Fig. 3. This shows the probability of reaching the B* state with respect to the number of iterations
2.3
System 2: Derepressor Model of BAK Activation
It has been shown that BAK which is neutralized by BCL-2, BCL-XL, MCL-1 or VDAC2 can be activated by the small molecule BAD BH3 mimetic ABT737 [17]. This has resulted in the conclusion that direct activator BH3 dependent agonism is not essential for BAK activation but that the antagonism of pro-survival BCL-2 family proteins is sufficient [17] and BAK auto-activation may drive this reaction forwards [18]. This corresponds to the double BID/BIM knockout with PUMA knockdown model system reported by Willis et al. [17]. Schematically the reaction for the de-repressor model is represented in Figure 4. Initially in
0.5
1
BA1 + b21 + BA2 + b22
A1b21 + B + A2b22
A1b21 + B* + A2b22
Fig. 4. The reaction scheme when agonists are omitted
S1
S2 0.5
S3 1
Fig. 5. Each state of the reaction is represented by a state in the Markov chain, the transition probabilities are labelled
the de-repressor model BAK is held in complex with the anti-apoptotic proteins (B.A1 and B.A2) but the dissociator BH3s displace BAK from this union and combine with the anti-apoptotic proteins themselves (A1.b2 1 and A2.b2 2). These new complexes are not stable and the anti-apoptotic proteins may free themselves and bind with BAK again however if that does not happen then the free BAK will open (B*) and the anti-apoptotic proteins shall remain bound to the dissociator proteins. Again each stage of this reaction is only dependent on
Mitochondrial BAK Activation Kinetics during Apoptosis
197
Fig. 6. This probability of reaching the B* state
the previous stage and the jumps between them are instantaneous so it can be modelled as a discrete Markov chain. Diagrammatically these stages and transitions are represented in Figure 5. In Figure 5 s1 , s2 and s3 correspond to the first, intermediate and final stages of the depressor model described by Figure 4 respectively. The stochastic transition matrix representing our derepressor model incorporated the equality displayed by pro and anit-apoptotic family members. This matrix tends to a limit with increasing iterations. s3 is an absorbing state, once reached it remains there so the probability of starting in state s1 and ending in state s3 (the B* state) is pn13 = 1 as n → ∞. In Figure 6 we show the probabilities of being in the s3 state after beginning in the s1 state against the number of iterations required, since the limiting probability is always 1 the iterations provides us with the most relevant information. The B* state is an absorbing state and the lack of reliance on an agonist makes the system more self-sufficient. Organising our initial probability transition matrix into its canonical form we get ⎛ ⎞ 0 1 0 QR = ⎝ 0.5 0 0.5 ⎠ . P = (6) 0 I 0 0 1 The matrix I − Q has an inverse N known as the fundamental matrix for P. The ijth entry nij of the matrix N is the expected number of times the chain is in the transient state sj given that it starts in the transient state si . Calculating N we get 22 −1 . (7) (I − Q) = N = 12
198
C. Grills, D.A. Fennell, and S.F.C. Shearer
We know our reaction begins at s1 therefore beginning at state 1 the expected number of times the system is in state 1 and 2 before being absorbed is n11 = 2 and n12 = 2. To analyse the number of steps taken before the chain is absorbed we let ti be the expected number of steps taken given that the chain starts in si before absorbtion. We let t be the column vector whose ith entry is ti . Then t = N c where c is a column vector all of whose entries are one. Adding all entries in the ith row of N gives the expected number of times in any transient state starting from Si . Thus ti is the sum of the entries of the ith row of N . So in our case 22 1 4 t = Nc = = . (8) 12 1 3 Thus the expected number of steps taken starting at s1 before the chain is absorbed is 4. Reducing the effects of the anti-apoptotic proteins in this system results in the likelihood of BAK being released from its initial complex increasing which in turn leads to a decline in the number of iterations required for the reaction to reach and stay in the B* state. We altered the transition matrix accordingly and the iterations required to reach the B* state were halved. The expected number of times in state 1 and 2 fell with n11 and n12 reducing from 2 to 43 and the expected number of steps taken also lessened with t1 reducing from 4 to 83 . 2.4
System 3: The Unified Model
There currently exists considerable controversy as to how BAK activation occurs, hence the two conflicting ideas we presented. When the physical evidence is present it can not be dismissed, especially since data supporting both ideas is frequently published. The inconsistencies in the two models indicates they are missing vital details so in a bid to reconcile them we adopted a new approach and combined the need for activator BH3’s and BAK auto-activation into one novel system, thus taking into account all proven concepts. We also include the idea that once in the open state BAK can still be bound by an anti-apoptotic protein [18] and so BH3’s are required to free it to allow it to oligomerise, this accounts for how dissociator BH3s directly contribute to the final open BAK concentration. Schematically the reaction produced by combining the agonist and de-repressor model can be seen in Figure 7. When this reaction begins BAK (B) and the dissociator BH3s (b2 1 and b2 2) are free but the anti-apoptotic proteins are holding the agonists in complex B + A1b1 + b21 + A2b1 + b22
A1b21 + b1 + B + A2b22
A1b21 + B + Bb1 + A2b22
A1b21 + b1 + B* + A2b22
B*A1 + b21 + b1 + b22 + B*A2
Fig. 7. The reaction scheme when agonists are present, auto-activation is included and B* binding is possible
Mitochondrial BAK Activation Kinetics during Apoptosis
199
(A1.b1 and A2.b1). The dissociator BH3s displace the agonists to form a complex with the anti-apoptotic proteins (A1.b2 1 and A2.b2 2). In doing so both the agonists and BAK are free from any unions. The complex between the antiapoptotic proteins and the dissociator BH3s is not stable and the anti-apoptotic proteins may break free and rebind with the agonists returning to stage 1 of the reaction. If that doesn’t happen the reaction moves forward to stage 3 of the reaction where the agonists and BAK form a complex (B.b1) while the dissociator BH3s and the anti-apoptotic proteins remain in union. Stage 4 of the reaction shows that the union between BAK and the agonist causes BAK to open (B*) and any remaining agonist (b1) is released back into the system. In the final stage of the reaction open BAK binding is also incorporated into this model and so B* may be bound by the anti-apoptotic proteins that have broken free from their union with the dissociator proteins to form B*.A1 and B*.A2. Stage 5 of the reaction also shows this may be reversed if the dissociator BH3s displace B* from the complex with the anti-apoptotic proteins to return to A1.b2 1 and A2.b2 2 leaving B* free to begin the apoptotic process. 0.5
1
S1
S2 0.5
0.5
0.5
S3 0.5
S5
S4 0.5
0.5
0.5
Fig. 8. Markov chain representing the reaction scheme for the unified model
Similarly to the agonist and de-repressor models this unified model also has the memoryless property required for it to be modelled as a Markov chain. The chain for the unified model is shown in Figure 8. We formed the transition matrix for the Markov chain representing the unified model and initially solved the system for arbitrary equal values because they indicate the equality of the binding strength displayed by both pro and anti-apoptotic BCL-2 family members. We later alter these values to represent different binding scenarios. The graph of the probabilities of being in the B* state dependent on the number of iterations the system has gone through is shown in Figure 9. Again the matrix is stochastic so it tends to a limit as n → ∞. We know our system begins in the s1 state and the state we want it to reach is s4 so we concentrate on the probabilities associated with these. The probabilities for this system rise steeply at the beginning of the reaction when energy levels are high but they plateau as the reaction reaches completion. As n → ∞ the probability of reaching the state s4 (the B* state) is pn14 = 0.5 with the remainder of the B* that has been created still being bound by anti-apoptotic proteins. To increase the efficiency of this system we reduce the effects of the anti-apoptotic proteins thus freeing up activator BH3’s, as well as decreasing the likelihood of B* being bound. Changing our transition probabilities to mirror this we saw that the final probability of ending up in state s4 after starting in state s1 rose to 0.75 and there’s a 0.25 chance of being in the state s5 .
200
C. Grills, D.A. Fennell, and S.F.C. Shearer
Fig. 9. The probability of being in the B* state when the system begins at s1
We also increased the potency of b1, thus the driving force of the reaction was made stronger. Doing this resulted in a probability of 0.5 of being in the B* state, this is the same likelihood as the unaltered system however the number of iterations required to achieve this value were only a third of what they had been so the system is more efficient when primed with greater levels of the agonist.
3
Results and Discussion
In a previous paper [7] we modelled these reactions deterministically. We found that cellular dynamics displayed by the BCL-2 family exhibit a duality of stochastic and deterministic features. The results from the two modelling approaches complement each other [19] with our previous findings being concentration based we now study molecule distribution at a localised level through the use of Markov chains. 3.1
Agonist Only BAK Activation
Looking at the agonist BAK activation model involving a bimolecular reaction between b1 and B to yield B*, as seen in system 1, we calculated that the probability of the model moving forward with each step, as desired, and reaching the B* state after 3 steps is 0.25. However this value is not maintained because some b1 is reused by the system thus it moves to the s1 state again. The probability of being in the B* state fluctuates between high and low values, especially at the beginning of the reaction, but as the number of iterations increases the range of probabilities reduces because energy in the system dwindles. The lower values are caused by the backward reactions making the B* state less likely to be reached. The probability of B* being created and the reaction being in the s4 state when the reaction reaches a steady state is 0.1818.
Mitochondrial BAK Activation Kinetics during Apoptosis
3.2
201
Effect of Prosurvival Proteins and Agonists on B* Formation
Decreasing the effects of the pro-survival proteins (A1 and A2) makes it more likely that the system will drive forward and as the reaction reaches completion the probability of being in the B* state has risen to 0.24, thus for every 100 molecules of BAK 24 reach the B* state and ultimately die. We observe a 32% increase in the number of molecules that reach the B* state. The number of iterations required to complete the reaction falls dramatically by 41.3% thus meaning we have a more efficient system and faster reaction times due to this adjustment. Limiting the effects of the anti-apoptotic proteins would be expected to have a positive effect since BAK is neutralized by BCL-2, BCL-XL, MCL-1 or VDAC2. Even if the impact of A1 or A2 is great it doesn’t prevent b1 driven B* formation if b1 is free, but it does sequester a lot of b1 and production of B* will switch off if all b1 is monopolised. This is consistent with the rheostat model originally described by Korsmeyer and colleagues [9]. Upping the initial levels of agonists present involves priming the cell with more b1 and has the same impact as reducing the effects of the anti-apoptotic proteins with the probability of being in the B* state when the reaction has reached completion also being 0.24. This increase in B* production is in agreement with experimental results which show activator BH3s binding with BAK and consequently a conformation change and oligomerisation can take place [5]. 3.3
BAK Activation in the Absence of an Agonist
In opposition to those supporting activator BH3 driven BAK activation recent genetic evidence has also been published showing an activation model for BAK which does not require the presence of activator BH3’s (BID, BIM or PUMA), this system is shown in Figure 4. It suggests that overcoming an activation energy for conformation change of BAK is not required, rather, that activation involves de-repression and spontaneous transit to a lower energy state. This may be made possible by the small molecule BAD BH3 mimetic ABT737 [17] and so has led to the hypothesis activator BH3s are not essential. Since there is no b1 to be considered or reused once the system reaches the B* state it remains in it’s B* state with no further movement so the limiting probability for this system is 1. Interestingly decreasing the effects of the anti-apoptotic proteins, A1 and A2, sees the number of steps required to be in the B* state fall by 46.7% as the system is more likely to behave more efficiently. This corresponds to the double BIM/BID knock-out model presented in [17]. 3.4
Combined Agonist and Derepressor Model
The conflicting agonist and de-repressor models have both been validated with experimental evidence so neither can be dismissed, however the conflicting ideas they present need to be explained but both are equally possible. Bearing this in mind as well as being aware that the adenoviral E1B 19K and BCL-2 have been reported to interact with BAK in its open conformation [18] system 3
202
C. Grills, D.A. Fennell, and S.F.C. Shearer
was set up as a combination of the ideas of both system 1 and 2 as well as incorporating B* binding to A1 and A2. Again the dissociator BH3’s are required to overcome A1 and A2 so freeing up both B and B* in this system. Once B is freed it can either form a complex with b1 thus creating B* or direct activation from B to B* is also permitted. We found that the probability of reaching the B* state as the reaction reached an end was 0.5. Priming this system with additional agonists saw the probability of being in the B* state remain at 0.5 but the number of iterations required before a steady state was reached was reduced by a factor of 3. Incorporating this instantaneous concentration jump is consistent with pharmacological exposure in a cell free system and is relevant to modelling the action of small molecule BH3 peptidomimetics [8]. Probability was increased to 0.75, a 50% rise, when the effect of A1 and A2 was inhibited by b2 1 and b2 2 because this enhances b1 efficacy. In addition to this probability increase the number of steps taken to reach completion fell by 37.1%, thus proving A1 and A2 are two important variables when it comes to the control of B* production. Combining the freeing effect displayed by the dissociator BH3s and the driving influence of the agonists demonstrates synergy and our results reflect experimental evidence [2]. To date there have been no attempts to apply Markov modelling to BAK activation. In silico deterministic modelling has been reported in relation to BAX, however generally these studies were limited to considerable simplification of the interactions between one anti-apoptotic protein, activator BH3 and BAX, and have attempted to incorporate both mitochondrial translocation of BAX/activator BH3 and outer mitochondrial permeabilization events, both of which are poorly defined at the molecular level [10]. We demonstrate that the within-membrane interactions of mitochondrial BCL-2 proteins and BH3 domains, although complex, are amenable to this mathematical modelling approach. A considerable body of evidence has amassed implicating activator BH3 domains as being capable of triggering multidomain BCL-2 family protein activation [2] . The ability of activator BH3s to drive BAK conformation change and oligomerization strongly suggests that as yet unidentified trigger residues also resides in BAK. Because mitochondria can be isolated, and respond to an activator BH3 peptide domain in a manner observed in vivo [2], our model incorporates a rapid (near instantaneous) concentration jump of activator BH3, leading to initiation of the reaction leading to B*. This is consistent with pharmacological exposure in a cell free system, and is relevant to modelling the action of small molecule BH3 peptidomimetics [8]. In our bid to reconcile the conflicted ideas presented by systems 1 and 2 we used Markov analysis to gain further insights into these experimentally observed phenomena. Since B* binding is observed for adenoviral E1B19K and BCL-2 we included both agonist driven activation and the ability of dissociator BH3s to free B* from any unions with anti-apoptotic proteins to explain how both the activating and dissociating BH3s help control free B* levels. A central paradox in the de-repressor model, is how B* forms when no activator BH3s are present (ie. in genetic knockouts). This may be explained in at least two ways. Firstly,
Mitochondrial BAK Activation Kinetics during Apoptosis
203
unknown death signals may be capable of driving BAK conformation change in the absence of the known activator BH3s BID, BIM and possibly PUMA. Secondly, conversion of B to B* may happen spontaneously, albeit with much lower probability compared with activator BH3 driven conversion. In the absence of activator BH3s, pro-survival BCL-2 proteins would therefore provide an efficient safety mechanism for blocking BAK activation after it has undergone its conformation change. When B* is released from its complex with either A1 and/or A2 by b2 1 and/or b2 2 it can autoactivate other B molecules to form oligomers. Furthermore, b1 which is also freed from A1 and/or A2 is then capable of causing ballistic conversion of B to B* in the outer mitochondrial membrane. What is clear from experimental studies, is that the pharmacological characteristics of BH3 domains differ significantly. Therefore, BAD and NOXA although able to sensitize BID, does not mimic this activator BH3 [2]. Our Markov model reconciles and supports a general model in which the interplay between BAK and multiple pro-survival proteins determine susceptibility to dissociator BH3 domains (and therefore mimetics such as ABT737 and obatoclax), in both the presence and absence of an activator BH3. Markov chain analysis supports the view that the interplay between BAK, the anti-apoptotic proteins and the dissociators determine whether the cell lives or dies in the presence or absence of activators. This approach has implications for better understanding of the complex molecular mechanisms that underlie BCL-2 family interactions and the implications of mitochondrial priming for death, as well as giving information on factors controlling sensitivity or resistance to BH3 peptidomimetics now entering the clinic.
4
Conclusions
Mathematical modelling is becoming increasingly important in biomedical applications since they allow hypothesis to be tested and weak points to be identified [20]. We developed a mathematical framework for the mitochondrial apoptosis regulatory network viewing the reaction as a Markov chain with each stage of the reaction representing a particular Markov state. We modelled conflicting ideas involving apoptosis and reconciled these ideas into a unified model. We used the models to examine the response of the systems to changes in probabilities of moving from one state to another which allows us to see what initial conditions make ending in the B* state the most probable. We explored the likelihood of the system driving forward and the open form of BAK (B*) being obtained. At this state cytochrome c is released and apoptosis begins. We discovered there was agreement between the results produced by our Markov models and the deterministic solutions [7]. Agreement between stochastic and deterministic results is non uncommon in literature [21, 22], despite the fact they provide different information the same conclusions can still be drawn. With Markov models we are concentrating on the probability of being in a particular state at a particular time whilst the deterministic model gave concentration dependent results but both gave us the opportunity to optimize the systems, if
204
C. Grills, D.A. Fennell, and S.F.C. Shearer
it is more likely that the B* state will be reached then it follows that the concentration produced will also be greater. We found that priming with additional agonist (b1), hindering the effects of the anti-apoptotic proteins (A1 and A2) and increasing the effects of the de-repressor BH3s all had positive effects on the likelihood that BAK would be converted to it’s open form. In the deterministic case these changes all resulted in a higher B* concentration. The concurrence between the models and their agreement with experimental evidence validates they are accurate representations of real life. Therapeutically the models can be used to inform clinicians of the likely short and long term effects of the drugs being administered since we can look at the reaction stage by stage. We can also optimize their efficiency now we are aware of the positive effects of particular family members and clinicians are aware of what priming and knockout techniques they need to perfect.
Acknowledgements One of us S.F.C. Shearer acknowledges financial support from the Leverhulme Trust (Grant No. F/00203/K) and MRC (Grant No. G0701852).
References [1] Adams, J.M., Cory, S.: The bcl-2 protein family: arbiters of cell survival. Science 281, 1322–1326 (1998) [2] Letai, A., et al.: Pharmacological manipulation of bcl-2 family members to control cell death. J. Clin. Invet. 115, 2648–2655 (2005) [3] Green, D.R.: Apoptotic pathways: ten minutes to dead. Cell 121, 671–674 (2005) [4] Scorrano, L., Korsmeyer, S.J.: Mechanisms of cytochrome c release by proapoptotic BCL-2 family members. Biochemical and Biophysical Research Communications 304, 437–444 (2003) [5] Walensky, L.D., Pitter, K., Morash, J., Oh, K.J., Barbuto, S., et al.: A stapled BID BH3 helix directly binds and activates BAX. Mol. Cell 24(2), 199–210 (2006) [6] Cheng, E.H., et al.: Bcl-2, Bcl-Xl sequestered BH3 domain only molecules preventing bax and Bak mediated mitochondrial apoptosis. Mol. Cell 8, 705–711 (2001) [7] Grills, C., Crawford, N., Chacko, A., Johnston, P.G., O’Rourke, S.F.C., Fennell, D.A.: Dynamical systems analysis of mitochondrial BAK activativation kinetics predicts resistance to BH3 domains. PLoS ONE 3(8), 30–38 (2008) [8] Oltersdorf, T., Elmore, S.W., Shoemaker, A.R., Armstrong, R.C., Augeri, D.J., et al.: An inhibitor of bcl-2 family proteins induces regression of solid tumours. Nature 435(7042), 677–681 (2005) [9] Walensky, L.D., Kung, A.L., Escher, I., Malia, T.J., Barbuto, S., et al.: Activation of apoptosis in vivo by a hydrocarbon-stapled BH3 helix. Science 305(5689), 1466–1470 (2004) [10] Chen, C., Cui, J., Lu, H., Wang, R., Zhang, S., et al.: Modeling of the role of a baxactivation switch in the mitochondrial apoptosis decision. Biophys. Journal 92(12), 4304–4315 (2007)
Mitochondrial BAK Activation Kinetics during Apoptosis
205
[11] Ghosh, S., Ghosh, P., Basu, K., Das, S., et al.: Modeling the stochastic dynamics of gene expression in single cells: A birth and death markov chain analysis. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, Los Alamitos (2007) [12] Asavathiratham, C., Roy, S., Lesieutre, B., Verghese, G.: The influence model. Control systems Magazine. IEEE 21(6), 52–64 (2001) [13] Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361 (1977) [14] Stirzaker, D.: Stochastic Processes and Models. Oxford University Press, Oxford (2005) [15] Rudden, R.: Cancer Biology, 3rd edn. Oxford University Press, Oxford (1995) [16] Centre for Cancer Research and Cell Biology, http://www.qub.ac.uk/ research-centres/CentreforCancerResearchCellBiology [17] Willis, S.N., Fletcher, J.I., Kaufmann, T., van Delft, M.F., Chen, L., et al.: Apoptosis initiated when BH3 ligands engage multiple Bcl-2 homologs, not Bax or Bak. Science 315(5813), 856–859 (2007) [18] Ruffolo, S.C., Shore, G.C.: BCL-2 selectively interacts with the BID-induced open conformer of BAK, inhibiting BAK auto-oligomerization. J. Biol. Chem. 278(27), 25039–25045 (2003) [19] Falcke, M.: Deterministic and stochastic models of intracellular Ca2+ waves. New Journal of Physics 5, 96.1–96.28 (2003) [20] Kitano, H.: Computational systems biology. Nature 420, 206–210 (2002) [21] Gonze, D., Halloy, J., Goldbeter, A.: Deterministic versus stochastic models for circadian rhythms. J. Biol. Phys. 28, 637–653 (2002) [22] Srivastava, R., You, L., Summers, J., Yin, J.: Stochastic versus deterministic modeling of intracellular viral kinetics. J. Theor. Biol. 218, 309–321 (2002)
Stochastic Dynamics of Logistic Tumor Growth S. F. C. Shearer (ne´e O’Rourke) , S. Sahoo, and A. Sahoo Centre for Cancer Research and Cell Biology, Queen’s University, Belfast, BT9 7AB, N. Ireland [email protected]
Abstract. We investigate the effect of Gaussian white noises on the logistic growth of tumors. The model consists of the logistic growth dynamics under the influence of two Gaussian white noises, one multiplicative and the other additive. Both noises are correlated with each other. We study diverse aspects of the probability distribution of the size of the growing tumour both in the transient and steady-state regime. The simulation is based on the solution of the time-dependent Fokker-Planck equation associated with stochastic dynamics using B-spline functions. Keywords: Noise; fluctuations; avascular tumor growth; dynamicalprobability.
1
Introduction
In recent years, much attention has been directed towards application of nonlinear physics to uncover biological complexities. Studies have confirmed the role of noise in nonlinear stochastic systems [1]. In this context, the Fokker-Planck Equation (FPE) has become one of the important approaches in the studies of nonlinear dynamics based on stochastic form [2], especially in the study of the effect of environmental fluctuations in a dynamical system. Tumor growth is a complex process. It has been a challenge for many years to search for a suitable growth law of tumors. Mathematical models based on simple mathematical equations, such as the deterministic exponential, logistic and Gompertzian equations are used as a basic tool for describing tumor growth [3, 4]. In a recent work, Ferreira et al. [4] analysed the effect of distinct chemotheraputic strategies for the growth of tumors. These strategies confirmed that an environment like chemotherapy affects the growth of cancer vitally. It may also be noted that the presence of noise in biological systems gives rise to a rich variety of dynamical effects. Random fluctuations may be regarded as not only a mere source of disorder but also as a factor which introduces positive and organising behavior rather than disruptive changes in the system’s dynamics. Some of the notable examples of noise induced effects are stochastic resonance in biological systems. The effect of cell-mediated immune response against cancer [5] may be a specific illustration of coupling between noise and a biological system.
Corresponding author.
O. Watanabe and T. Zeugmann (Eds.): SAGA 2009, LNCS 5792, pp. 206–220, 2009. c Springer-Verlag Berlin Heidelberg 2009
Stochastic Dynamics of Logistic Tumor Growth
207
The simplest model describing the regulated single species population growth is the logistic equation, it provides the basic platform from which more detailed models can be formulated. Ai et al. [6, 7] and Behera and O’Rourke [8, 9] applied the logistic model for tumor growth to analyse the growth of tumors driven by correlated noises. However these works focused exclusively on analysing the steady-state behavior of tumor growth. Yet there are many biological systems in which the interesting dynamics take place before the system has had time to reach the steady-state. For example, the multi-peak structure of the probability density function generated by the cell fission as time evolves is predicted by Lo [10] in a study based on the stochastic Gompertz model of tumor cell growth. In this work Lo [10] employed a functional Fokker-Planck equation for which an analytical solution exists. But studies examining the behavior of the growth of the tumor before it reaches the steady-state under the influence of correlated noise are still sparse in the literature. In the present work, we address this problem by studying the non-steady-state behavior of tumor under the influence of correlated additive and multiplicative noise. Our calculation is based on the solution of the time dependent FokkerPlanck equation. It is worth mentioning that obtaining the time-dependent solution of the FPE for a specific system, is a much more complicated problem than the steady-state analogue problem, as it is necessary to solve an eigenvalue problem. This often requires numerical implementation. The numerical solution of the FPE and in particular the nonlinear form of this equation, is still a challenging problem. In the present article, we use an eigenfunction expansion method to solve the time-dependent FPE where, the eigenfunctions are constructed using a piece-wise polynomial function called B-splines. Solving the FPE using the B-spline method is a new development in the literature. The B-spline approximation is a very powerful technique numerically and has been widely applied in atomic physics for the calculations of the electronic structure of atoms, ions and plasma [11, 12]. In section 2, we present the theoretical formulation of our model. We also introduce the B-spline approximation and discuss the procedure for the eigenvalue solution of the FPE in this section. In section 3, we present our results and discuss them. We summarise the results in section 4.
2 2.1
Theoretical Formulation The Model
The logistic avascular tumor growth mechanism may be described by the differential equation, dx = ax − bx2 , (1) dt with the initial condition that x(0) = x0 . Within the context of tumor growth, x(t) represents the density of cancer cells at time t, x(0) is the density at initial
208
S.F.C. Shearer, S. Sahoo, and A. Sahoo
time which is identified with the instant when the illness has been diagnosed. The intrinsic growth of the tumor denoted by a, is a parameter related to the initial mitosis rate and also the growth deceleration factor b. These regulation parameters characterize the evolution of different types of tumors. It may be noted that Eq. 1 is based on the modelling assumptions that (i) the tumor only contains one cell type, (ii) is spatially independent, (iii) does not explicitly mention nutrients, growth factors or host vasculature, (iv) tumor volume is proportional to x(t). We extend Eq. 1 to consider stochastic effects due to external factors such as temperature or drugs or radiotherapy etc, by introducing, (i) Gaussian multiplicative noise to represent the effect of the treatment by altering the tumor cell dynamics (ii) a negative additive Gaussian noise which may represent fluctuations due to the treatment resulting in cell death. Here we have implicitly assumed both the multiplicative and additive noise are correlated since they have a common origin. As a result, we obtain the Langevin type stochastic differential equation, corresponding to the determistic logistic growth law (Eq. 1) driven by the correlated Gaussian white noise as, dx = ax − bx2 + x(t) − Γ (t), dt
(2)
where (t) and Γ (t) are Gaussian white noises with the following properties: Γ (t) = (t) = 0
(3a)
(t)(t ) = 2Dδ(t − t ),
Γ (t)Γ (t ) = 2αδ(t − t ),
√ (t)Γ (t ) = Γ (t)(t ) = 2λ Dαδ(t − t ).
(3b) (3c) (3d )
Here α and D are the strengths of the additive and multiplicative noise respectively and λ denotes the degree of correlation between (t) and Γ (t) with −1 < λ ≤ 1. Following the procedure outlined in [13], we may convert the Langevin Eq. 2 to a Fokker-Planck equation (FPE). This is given by, ∂ ∂2 ∂p(x, t) = − [A(x)p(x, t)] + 2 [B(x)p(x, t)], ∂t ∂x ∂x
(4)
defined for x ∈ [xmin , xmax ] and t ≥ t0 . Here xmin and xmax are the arbitrary choices of integration domain. The drift and the diffusion coefficients denoted by A(x) and B(x) respectively are given by, √ (5) A(x) = ax − bx2 + Dx − λ Dα and
√ B(x) = Dx2 − 2λ Dαx + α.
(6)
Stochastic Dynamics of Logistic Tumor Growth
209
Both A(x) and B(x) are time-independent functions such that B(x) > 0 and A(x) have no singularities in the interval (xmin , xmax ). The conditions imposed on the probabilistic solutions are p(x, t) ≥ 0,
xmin < x < xmax ,
xmax xmin
p(x, t)dx = 1,
t0 ≤ t,
t0 ≤ t.
From the Eq. 4, the second condition also takes the form ∂ −[A(x)p(x, t)] + [B(x)p(x, t)] = 0, ∂x xmin ,xmax
(7)
(8)
(9)
for t0 ≤ t. Suitable initial conditions may be used to produce the required evolution. We select the initial condition lim = p(x, t+ ) = δ(x − x0 ),
t→t+ 0
(10)
for the transition probability distribution function p(x, t|x0 , t0 ). 2.2
Eigenvalue Problem for the Fokker-Planck Equation
The probability density function p(x, t) is a solution of the one dimensional Fokker-Planck equation of the form given by Eq. 4. To obtain an eigenvalue equation for the solution of time independent part of the FPE, we follow the procedure outlined in Ref. [14]. Here, it may be shown that the stationary solution of Eq. 4 is, −1 pst (x) = N exp[(− [B (x) − A(x)]/B(x)dx)], (11)
where N=
xmax
exp[(− xmin
[B (x) − A(x)/B(x)dx)]],
is the normalization constant. It may be pointed out that Eq. 4 is not the standard self-adjoint form. However, if we define the function g(x, t) as, (12) p(x, t) = pst (x)g(x, t) it is easy to show that g(x, t) obeys an equation of the form ∂t g = Lg,
(13)
where L is defined by Lφ =
dφ(x) d [R(x) ] − S(x)φ(x), dx dx
(14)
210
S.F.C. Shearer, S. Sahoo, and A. Sahoo
with R(x) = B(x) > 0,
and
[B (x) − A(x)]2 [B (x) − A(x)] S(x) = − 4B(x) 2
(15)
and hence Eq. 13 is now self-adjoint [14]. Separating the variables by means of g(x, t) = γ(t)G(x) we have γ(t) = e−t while G is a solution of the typical Sturm-Liouville problem associated with the equation LG(x) + G(x) = 0
(16)
subject to the boundary conditions (can be obtained from Eq. 9,
[B (xi ) − A(xi )]G(xi ) + 2B(xi )G (xi ) = 0,
[B (xf ) − A(xf )]G(xf ) + 2B(xf )G (xf ) = 0.
(17)
In Eq. 17 and for further use, we denote xi = xmin and xf = xmax . It is easy to see that = 0 is always an eigenvalue for the problem Eq. 16 and that the corresponding eigenfunction is pst (x) defined in Eq. 11. The differential problem defined by Eq. 16 and Eq. 17 has simple eigenvalues n that constitute an infinite, increasing sequence and the corresponding eigen, xf ). The lowest eigenvalues functions Gn (x) have n zeros in the interval (xi 0 = 0, corresponds to the eigenfunction G0 (x) = pst (x) which never vanishes in (xi , xf ) and is positive. These eigenfunctions constitute a complete orthonormal set of functions in L2 ([xi , xf ]) and the general solution of Eq. 4 has the form p(x, t) =
∞
cn e−n t
pst (x)Gn (x)
(18)
n=0
with c0 = 1 for normalization. The coefficients cn for a particular solution are selected by the initial condition p(x, t+ 0 ) = p0 (x) which may be calculated from the orthonormality relations as xf Gn (x) . p0 (x) cn = pst (x) xi
(19)
(20)
By solving Eq. 16, we obtain Gn (x) and the corresponding n and hence the probability density function p(x, t) at a particular time.
Stochastic Dynamics of Logistic Tumor Growth
2.3
211
B-Spline Implementation
A complete description of B-splines and their properties can be found in deBore’s book [15]. Here we summarize the main results from the application point of view. We start by defining a closed interval [xi , xf ] and divide this interval into segments. The endpoints of these segments are given by the knot sequence {ti }, i = 1, 2, ....., N +k. A set of N , B-splines of order k, Bik (x), can be defined recursively on this knot sequence according to the formula, Bi1 (x) and Bik (x) =
=
1 if ti ≤ x < ti+1 ; 0 otherwise,
x − ti ti+k − x k−1 B k−1 (x) + B (x). ti+k−1 − ti i ti+k − ti+1 i+1
(21)
We see that the Bik (x) is a piecewise polynomial of degree k-1 that is non-negative everywhere. A B-spline is non-zero on k consecutive intervals [ti , ti+1 ) (i.e., in the interval ti ≤ x < ti+1 ), so that two B-splines Bik (x) and Bjk (x) overlap if |i − j| < k, but they do not overlap when |i − j| ≥ k. Therefore, a B-spline basis of order k > 1 is not orthogonal. The sum at any point x, of all of the B-splines that do not vanish at the point is unity. The set of B-splines of order k on the knot sequence {ti } forms a complete basis for piecewise polynomials of degree k1 on the interval spanned by the knot sequence. In this work, the knots defining the grid have single multiplicity except at the end points xi and xf where it is k-fold, i.e., t1 = t2 = ... = tk = xi and tN +1 = tN +2 = ... = tN +k = xf . When multiple knots are encountered, limiting forms of the above recursive definition of the B-splines must be used. For k > 1, the B-splines generally vanish at their endpoints: however, at x = xi the first B-spline is equal to 1 (with all others vanishing) and at x = xf the last B-spline has the same behavior. The choice of knot sequence is a matter of convenience. For our applications the knot tk , tk+1 , .....tN are distributed on a linear scale between the interval xi and xf . Figure 1 shows a B-spline basis set of order k = 9 using a linear knot sequence with N = 17 in the interval [xi = 0, xf = 10]. As may be seen, the B-spline functions are distributed all along the interval [xi , xf ], thus providing a similar degree of flexibility anywhere inside the interval. There are a wide variety of finite basis sets that are commonly used in computational physics to solve eigenvalue problems. However, B-splines have a number of desirable properties that make them particularly useful. One advantage is that the banded nature of the matrices can be diagonalized with accuracy. The other advantage is that the knot sequences can be adjusted easily according to the problem in hand. This allows the generation of very large basis sets with no complications in the linear dependence. A second advantage of B-splines is the flexibility afforded by the freedom to choose the radial grid points between which the B-splines are defined. Once coded using the recursive relations, the integrals involved can be evaluated to machine accuracy with Gaussian integration, making B-splines an accurate and easily set up basis set of wide utility.
212
S.F.C. Shearer, S. Sahoo, and A. Sahoo 1
0.8
k
Bi (x)
0.6
0.4
0.2
0
0
2
4
6
8
10
x
Fig. 1. B-spline basis set in the interval [0,10] obtained with a linear knot sequence, k = 9 and N = 17. Note that the borders of the interval have k multiplicity, which explains the differences in the shape near x = 0 and 10. Note also that the first and the last B-splines are equal to 1 at x = 0 and 10.
2.4
Calculating Eigenfunctions and Eigenvalues of the Sturm-Liouville Equation Using B-Splines
We consider the Sturm-Liouville equation Eq. 16 and re-write, LG(x) + G(x) = 0, with
d LG(x) = dx
(22)
dG(x) R(x) − S(x)G(x). dx
Now we expand G(x) in terms of B-splines as, G(x) =
N −1
ci Bik (x),
(23)
i=2
where ci are the coefficients of the B-splines to be determined. It may be noted that the first and the last B-splines are not included in the summation in Eq. k 23 due to the fact that these two B-splines, B1k (x), BN (x) are generally used to k k modify B2 (x) and BN −1 (x) for the requirement of the boundary conditions Eq. 17. Substituting Eq. 23 into Eq. 22 and projecting with Bjk , we obtain equations in the form of N -2 × N -2 symmetric generalized eigenvalue equations, Lc = −Qc,
(24)
where c are the vector expansion coefficients given by c = (c2 , c3 , ....., cn−1 ) .
(25)
The matrices Lij and Qij are generally termed as interaction and overlap matrices respectively as, xf Lij = Bik (x)LBjk (x)dx, (26) xi
Stochastic Dynamics of Logistic Tumor Growth
and
Qij =
xf
xi
Bik (x)Bjk (x)dx.
213
(27)
Next we simplify the matrix Lij by substituting the operator L from Eq. 14 into Eq. 26 and obtain
xf k dB (x) d j Lij = R(x) − S(x)Bjk (x) dx. Bik (x) dx dx xi Using integration by parts in the first term of the above equation we obtain, xf dBjk (x) k (28) Lij = Bi (x)R(x) dx xi xf dBjk (x) dBik (x) − dx R(x) dx dx xi xf − Bik (x)S(x)Bjk (x)dx. xi
Bik (x)
Because the product of and Bjk (x) fails to vanish only when i and j differ by k or less, the matrices L and Q are sparse, diagonally dominant banded matrices. The solution to the eigenvalue problem for such matrices is numerically stable even when the matrices are very large. Solving the generalized eigenvalue equation, we obtain N -2 real eigenvalues μ and N -2 eigenvectors cμ . The eigenvectors satisfy orthogonality relations μ ci Bik (x)Bjk (x)cνi = δμν , ij
which leads to the orthogonality relations xf Gμ (x)Gν (x) = δμν .
(29)
xi
2.5
Eigenfunctions of the Sturm-Liouville Equation
The self-adjoint Sturm-Liouville problem associated with our time-dependant FPE for logistic tumor growth is determined by the differential equation Eq. 16 and corresponding boundary conditions Eq. 17. The spectrum of this problem thus consists of only real values of the spectral parameter . An eigenvalue solution to our problem may then be described by the pair (, G(x)), where is the eigenvalue for which the differential equation has a non-zero solution and Gn (x) is the eigenfunction, which satisfies the boundary conditions. As expected, we obtained, as shown in figure 2 and that the eigenfunctions Gn (x) have n zeros in the interval [xi , xf ]. It may be noted that we have taken care to ensure the convergence of the first few eigenvalues in each of the cases, we considered, to an accuracy of the order of 1E − 05. To achieve this accuracy, we chose a linear knot sequence and used 500 B-splines in the domain [0, 2000].
214
S.F.C. Shearer, S. Sahoo, and A. Sahoo 0.8 G1(x) G2(x) G3(x) G4(x) G5(x)
0.6 0.4
Gn(x)
0.2 0 -0.2 -0.4 -0.6 0
10
20
30
40
50
70
60
x
Fig. 2. Lowest five eigenfunctions corresponding to the correlation parameter λ = 0.9, a = 1, b = 0.1, D = 0.3 and α = 3.0
3
Results and Discussion
In this section we have obtained the eigenfunctions and eigenvalues of FokkerPlanck equation solving the Sturm-Liouville equation (Eq. 16) with the boundary conditions defined in Eq. 17 to construct the probability density distribution function p(x, t) as given by Eq. 18. It may be noted the boundary conditions change their values for each set of parameters we use. Below we present the results obtained from our numerical calculations. Figure 3 shows p(x, t) as a function of the cell number x at different times for the case when the correlation between additive and multiplicative noise is zero. In the figure we also show the steady-state probability pst (x) for comparison. As shown the tumor begins to grow in an initial state (Eq. 10) and then grows and finally relaxes to the steady-state when the time t = 5. However, the question is to examine if the time scale for relaxing to the steady-state is affected by the presence of correlated noise. In other words, how does the correlation strength influence the system to decay to the steady-state. To study this we present in 0.08
0.08 t=3 t=2
0.06 0.04
0.04
p(x,t)
0.02 0
pst(x)
0.06
0.02
0
10
20
30
0
40
0
10
30
40
30
40
t=5
t=4 0.06
0.06
0.04
0.04
0.02
0.02
0
20
0.08
0.08
0
10
20
30
0
40
0
10
20
x
Fig. 3. p(x, t) and pst (x) (dashed curve) versus x for t = 2, 3, 4 and 5 . The input model parameters are λ = 0.0, α = 3.0, D = 0.3, a = 1.0 and b = 0.1.
Stochastic Dynamics of Logistic Tumor Growth t=1
p(x,t)
0.12
0.08
0.04
0.04
0
0
10
20
30
0
40
t=3
0.12 0.08
0.08 0.04
0
10
20
0
10
30
0
40
20
30
40
30
40
t=4
0.12
0.04 0
t=2 pst(x)
0.12
0.08
215
0
10
20
x
Fig. 4. p(x, t) and pst (x) (dashed curve) versus x for t = 1, 2, 3 and 4. The input model parameters are λ = 0.9, α = 3.0, D = 0.3, a = 1.0 and b = 0.1.
figure 4 the probability distribution function as a function of x as time evolves with a non-zero positive value of correlation strength λ which we take in this case to be 0.9. Our results in figure 4 have been produced using the same parameters as in figure 3 except for the non-zero correlation parameter λ. It can be observed at λ = 0.9, the tumor reaches the steady-state roughly around the same time that we observed in case of uncorrelated noise (λ = 0) in figure 3. Here it may be pointed out that the correlation between the additive and multiplicative noise does not slow the system down to relax to the steady-state. Thus these results show that the tumor undergoes no critical slowing down as a function of correlation strength λ. This also indicates that during the dynamical growth of the tumor, most of the cells escape the control imposed by the correlated noise and the tumor cells do not respond to the treatment before reaching the steady-state. However, it is shown in the study of steady-state analysis of tumor growth [8, 9] that correlation strength has both negative and positive feedback on tumor cell growth, i.e., at the steady-state, a positive λ promotes stable growth of the tumor and a negative correlation causes extinction of the cell population [8]. However, the dynamical behavior does not change much with λ. Figure 5 shows the average number of cells as function of time for both positive and negative correlation parameter λ. It is evident from the figure that there is a change in cell number at a particular time as a function of λ. For negative correlation, there is a decrease of cell numbers as we increase the magnitude of λ indicating the correlation between additive and multiplicative noise can cause extinction of the tumor cells. However, this scenario is just the opposite in the case of positive correlation, where for such intensities of additive and multiplicative noise, the correlation promotes a stable growth similar to the one we observed in the steady-state case [8].
216
S.F.C. Shearer, S. Sahoo, and A. Sahoo 10
15
8
Average of x(t)
Average of x(t)
9
10
λ λ λ λ
5
0
5
10
15
= = = =
20
7
6
0.0 0.3 0.6 0.9
λ λ λ λ
5
4
25
0
5
10
= 0.0 = -0.3 = -0.6 = -0.9
20
15
25
time (t)
Fig. 5. x(t) as function of time (t) for different λ with the model parameters: α = 1.0, D = 0.3, a = 1.0 and b = 0.1. Left figure: for positive correlation and right figure: for negative correlation between additive and multiplicative noise.
3.1
Effect of Uncorrelated Noise
In this subsection we examine the case where the correlation between the two noises is absent. First we look at the effect of the multiplicative noise on the dynamical probability distribution by varying the multiplicative noise intensity D for a fixed value of additive noise intensity α.
t=3
p(x,t)
t=2 0.1
0.1
0.05
0.05
0
0
10
20
30
0
40
pst(x)
0
10
t=4 0.1
0.05
0.05
0
10
20
30
40
30
40
t=5
0.1
0
20
30
0
40
0
10
20
x
Fig. 6. p(x, t) and pst (x) (dashed curve) versus x for t = 2, 3, 4 and 5. The input model parameters are λ = 0.0, α = 3.0, D = 0.5, a = 1.0 and b = 0.1.
Figure 6 shows the probability distribution functions as a function of cell number for fixed values of α = 3.0 at D = 0.5 for different time evolutions. We also plot the steady-state probability distribution functions in the same subfigures for comparison. We note in this figure that the probability distribution function relaxes to the steady-state as time evolves and saturates with pst (x) around time t = 5. In figure 7, we present p(x, t) as a function of x as time evolves, for a fixed value of α = 3.0 and at D = 2.5. In this case we observe that the distribution
Stochastic Dynamics of Logistic Tumor Growth 0.2
0.2 t=3
p(x,t)
t=2 0.15
0.15
0.1
0.1
0.05
0.05
0
217
0
10
20
30
0
40
pst(x)
0
10
20
30
40
x
Fig. 7. p(x, t) and pst (x) (dashed curve) versus x for t = 2 and 3. The input model parameters are λ = 0.0, α = 3.0, D = 2.5, a = 1.0 and b = 0.1.
function relaxes more quickly than in figure 6 to the steady-state and the time the system takes to reach the steady-state is about t = 3. Comparing figures 6 and 7, we observe that the increase of the intensity of multiplicative noise can cause tumor cell fluctuations that allows decay to the steady-state more quickly. We also observe that the intense multiplicative noise enhances the stable growth of the tumor cells, as it may be seen from the figures that there is an increase in the magnitude of the probability distribution function as we increase the noise intensity. The behavior of steady state probability distribution function pst (x) as a function of multiplicative noise intensity D has been analyzed previously [6, 9]. In this case it is observed that as D is increased, the maximum of pst (x) moves from larger values of x to smaller values, showing that the multiplicative noise is a drift term, similar to the present observation in figures 6 and 7. Now we consider the case of increasing the intensity of additive noise keeping the multiplicative noise intensity fixed in order to examine the effects of additive noise on the dynamical probability of the system. In figure 8, we display the probability distribution as a function of cell number for a fixed value of D = 0.3 and at α = 5.0. We also show the results for the steady-state for these parameters. 0.1
0.1
t=3
t=2
pst(x) 0.05
p(x,t)
0.05
0
0
10
20
30
0
40
0.1
0
10
20
40
t=5
t=4
0.05
0
30
0.1
0.05
0
10
20
30
0
40
0
10
20
30
40
x
Fig. 8. p(x, t) and pst (x) (dashed curve) versus x for t = 2, 3, 4 and 5. The input model parameters are λ = 0.0, α = 5.0, D = 0.3, a = 1.0 and b = 0.1.
218
S.F.C. Shearer, S. Sahoo, and A. Sahoo 0.1
0.1 t=3
t=2
pst(x)
p(x,t)
0.05
0
0.05
0
10
20
30
0
40
0.1
0
10
20
t=4
40
30
40
t=5
0.05
0
30
0.1
0.05
0
10
20
30
0
40
0
10
20
x
Fig. 9. p(x, t) and pst (x) (dashed curve) versus x for t = 2, 3, 4 and 5. The input model parameters are λ = 0.0, α = 10.0, D = 0.3, a = 1.0 and b = 0.1.
We observe from the figure that with α = 5.0, the probability distribution function decays to the steady-state and stabilizes at a time just above t = 5. Now we apply a more intense additive noise having intensity α = 10.0 and show the result in figure 9. We observe that the system relaxes to the steady-state a little quicker than the case with α = 5.0. As in the case of intense multiplicative noise, the additive noise also affects the stability of the tumor by allowing it to relax to the steadystate more quickly with increasing intensity. However an interesting noticeable difference in the cases for different α is that at α = 10.0, there is a reduction in the magnitude of the distribution function and hence the cell number in comparison to the case of α = 5.0. Comparing the plots for different D with different α, it is noted that intense multiplicative noise causes sufficient internal fluctuations in the tumor cells and its effect we observe promotes the growth of tumor cells. However, the intense additive noise in addition to causing fluctuations in the tumor cells, reduces cell number with its action as a treatment procedure. 10
12.5
9
Average x(t)
7
10 Average x(t)
α=1.0, D = 0.3 α=3.0, D = 0.3 α=6.0, D = 0.3 α=12.0, D = 0.3
8
α=1.0, D = 1.0 α=1.0, D = 2.0 α=1.0, D = 1.5 α=1.0, D = 2.5 7.5
6
5
4
5
10
15
20
5
5
10
15
20
time(t)
Fig. 10. x(t) as function of time (t) for different α and D with the model parameters: a = 1.0 and b = 0.1
Stochastic Dynamics of Logistic Tumor Growth
219
The above discussion can be supported if we analyze the average cell number. Now we consider the case for the variation of both the intensities of multiplicative noise D as well as additive noise α and analyze the average cell number as a function of time. In figure 10 we present the average cell number as a function of time for a fixed D and varying α (left figure) and for a fixed α and varying D (right figure). For fixed D, x(t) as function of t shows that the average cell number decreases as we apply more intense additive noise. This also shows that the tumor cells adapt quickly to relax to the steady-state condition with reduction in cell number even before reaching the steady-state, which indicates a favourable condition for effective tumor treatment, i.e., if a tumor is treated as soon as it is detected there is more probability of an effective treatment. However, it may be noted that in additional calculations we carried out (not shown here) for α = 6.0 and α = 12.0 suggested that more intense additive noise may cause an increase in cell number. This indicates that an inaccurate or improper treatment can lead to tumors with malignant growth instead of killing the cells [8]. In the case when we increase intensity of multiplicative noise (right figure), we find the cell numbers start to increase from an initial instant and then decay to a saturation stage, which means that the result of fluctuation in the growth parameter by means of multiplicative noise causes tumor cells to grow.
4
Summary and Conclusions
We have solved the one dimensional time-dependent Fokker-Planck equation to study the effect of noise correlation as well as the effect of uncorrelated noise on the dynamical growth of tumors. We employed a numerical technique called the B-spline approximation to solve the Sturm-Liouville equation to obtain the eigenfunctions and eigenvalues for constructing the dynamical probability distribution functions. On analysing our results we observed that the correlation between the two noises had no effect in restricting the tumor cells reaching the steady-state in terms of time. The probability distribution functions of the tumor cells do not show any slowing down phenomena during the time evolution to relax to its steady-state regardless of whether the two noises are correlated strongly or weakly. However, we observed that the uncorrelated noise had an influence in making tumors delay reaching the steady-state. We found the effect of increasing multiplicative noise is that it can promote growth of tumor cells. On the other hand, increasing additive noise intensities is favourable for the tumor cells to shrink in size at steady-state. Also we note that the average tumor size in the steady state decreases if either one of the two noise sources becomes stronger. Accordingly, within this model, we were able to investigate the effectiveness of the additive and multiplicative noise in terms of tumor treatments. It may be pointed out that the cell number in our study is a relative number which denotes the concentration of the tumor cells. When these results are applied to a realistic biological system, the equations parameters must be determined using experimental data.
220
S.F.C. Shearer, S. Sahoo, and A. Sahoo
References [1] Gammaitoni, L., Hanggi, P., Jung, P., Marchesoni, F.: Stochastic Resonance. Rev. Mod. Phys. 70, 223–287 (1998) [2] Jia, Y., Li, J.R.: Reentrance Phenomena in a Bistable Kinetic Model Driven by Correlated Noise. Phys. Rev. Lett. 78, 994–997 (1997) [3] Molski, M., Konarski, J.: Coherent states of Gompertzian growth. Phys. Rev. E 68, 021916-1–021916-7 (2003) [4] Ferreira Jr, S.C., Martins, M.L., Vilela, M.J.: Morphology transitions induced by chemotherapy in carcinomas in situ. Phys. Rev. 67, 051914-1–051914-9 (2003) [5] Garay, R.P., Lefever, R.: A kinetic approach to the immunology of cancer: stationary states properties of effector-target cell reactions. J. Theor. Biol. 73, 417–438 (1978) [6] Ai, B.Q., Wang, X.J., Liu, G.T., Liu, L.G.: Correlated noise in a logistic growth model. Phys. Rev. E 67, 022903-1–022903-3 (2003) [7] Ai, B.Q., Wang, X.J., Liu, G.T., Liu, L.G.: Reply to ”Comment on Correlated noise in a logistic growth model”. Phys. Rev. E 77, 013902-1–013902-2 (2008) [8] Behera, A., O’Rourke, S.F.C.: Comment on correlated noise in a logistic growth model. Phys. Rev. E 77, 013901-1–013901-2 (2008) [9] Behera, A., Francesca, S., O’Rourke, C.: The effect of correlated noise in a Gompertz tumor growth model. Brazilian Journal of Physics 38, 272–278 (2008) [10] Lo, C.F.: Stochastic Gompertz model of tumour cell growth. J. Theor. Biol. 248, 317–321 (2007) [11] Sahoo, S., Mukherjee, S.C., Walters, H.R.J.: Ionization of atomic hydrogen and He+ by slow antiprotons. J. Phys. B 37, 3227–3237 (2004) [12] Sahoo, S., Gribakin, G.F., Naz, G.S., Kohanoff, J., Riley, D.: Compton scatter profiles for warm dense matter. Phys. Rev. E 77, 046402-1–046402-9 (2008) [13] Risken, H.: The Fokker-Planck Equation. Springer, Berlin (1998) [14] Petroni, N.C., De Martino, S., De Siena, S.: Exact solutions of Fokker-Planck equations associated to quantum wave functions. Phys. Lett. A 245, 1–10 (1998) [15] deBore, C.: A Practical Guide to Splines. Springer, New York (1978)
Author Index
Ando, Ei 89 Arcuri, Andrea
156
Berchenko, Yakir 127 Berke, Robert 117 Dashevskiy, Mikhail 31 Defourny, Boris 61 Diochnos, Dimitrios I. 74 Englert, Matthias 179 Ernst, Damien 61 Fennell, D.A. Grills, C.
191
191
Haraguchi, Kazuya 46 Hong, Seok-Hee 46 Jordan, Charles Luo, Zhiyuan
141 31
Nagamochi, Hiroshi 46 Nonaka, Yoshiaki 104
Ono, Hirotaka Onsj¨ o, Mikael
89, 104 117
R¨ omisch, Werner
1
Sadakane, Kunihiko 104 Sahoo, A. 206 Sahoo, S. 206 Sato, Taisuke 15 Shearer, S.F.C. 191, 206 Teicher, Mina 127 Tur´ an, Gy¨ orgy 74 V’yugin, Vladimir V. 16 V¨ ocking, Berthold 179 Wehenkel, Louis Winkler, Melanie
61 179
Yamashita, Masafumi Yang, Xin-She 169 Zeugmann, Thomas
89, 104
141