Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2827
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Andreas Albrecht Kathleen Steinhöfel (Eds.)
StochasticAlgorithms: Foundations and Applications Second International Symposium, SAGA 2003 Hatfield, UK, September 22-23, 2003 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Andreas Albrecht University of Hertfordshire Computer Science Department Hatfield, Herts AL10 9AB, UK E-mail:
[email protected] Kathleen Steinhöfel FIRST - Fraunhofer Institute for Computer Architecture and Software Engineering 12489 Berlin, Germany E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): F.2, F.1.2, G.1.2, G.1.6, G.2, G.3 ISSN 0302-9743 ISBN 3-540-20103-3 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by Olgun Computergrafik Printed on acid-free paper SPIN: 10954935 06/3142 543210
Preface
The second Symposium on Stochastic Algorithms, Foundations and Applications (SAGA 2003), took place on September 22–23, 2003, in Hatfield, England. The present volume comprises 12 contributed papers and 3 invited talks. The contributed papers included in the proceedings present results in the following areas: ant colony optimization; randomized algorithms for the intersection problem; local search for constraint satisfaction problems; randomized local search methods for combinatorial optimization, in particular, simulated annealing techniques; probabilistic global search algorithms; network communication complexity; open shop scheduling; aircraft routing; traffic control; randomized straight-line programs; and stochastic automata and probabilistic transformations. The invited talk by Roland Kirschner provides a brief introduction to quantum informatics. The requirements and the prospects of the physical implementation of a quantum computer are addressed. Lucila Ohno-Machado and Winston P. Kuo describe the factors that make the analysis of high-throughput gene expression data especially challenging, and indicate why properly evaluated stochastic algorithms can play a particularly important role in this process. John Vaccaro et al. review a fundamental element of quantum information theory, source coding, which entails the compression of quantum data. A recent experiment that demonstrates this fundamental principle is presented and discussed. Our special thanks go to all who supported SAGA 2003, to all authors who submitted papers, to the members of the program committee, to the invited speakers, and to the members of the organizing committee.
Andreas Albrecht Kathleen Steinh¨ofel
Organization
SAGA 2003 was organized by the University of Hertfordshire, Department of Computer Science, Hatfield, Hertfordshire AL10 9AB, United Kingdom.
Organization Committee Andreas Albrecht Mickael Hammar Kathleen Steinh¨ofel
Sally Ensum Georgios Lappas
Program Committee Andreas Albrecht (Chair, University of Hertfordshire, UK) Luisa Gargano (Salerno University, Italy) Juraj Hromkoviˇc (RWTH Aachen, Germany) Oktay Kasim-Zade (Moscow State University, Russia) Roland Kirschner (University of Leipzig, Germany) Michael Kolonko (TU Clausthal-Zellerfeld, Germany) Frieder Lohnert (DaimlerChrysler AG, Germany) Lucila Ohno-Machado (Harvard University, USA) Christian Scheideler (Johns Hopkins University, USA) Jonathan Shapiro (Manchester University, UK) Gregory Sorkin (IBM Research, NY, USA) Kathleen Steinh¨ofel (FhG FIRST, Germany) John Vaccaro (University of Hertfordshire, UK) Lusheng Wang (City University, Hong Kong) Peter Widmayer (ETH Z¨urich, Switzerland) CK Wong (The Chinese University, Hong Kong) Thomas Zeugmann (Medical University of L¨ubeck, Germany)
Table of Contents
Prospects of Quantum Informatics (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . Roland Kirschner
1
A Converging ACO Algorithm for Stochastic Combinatorial Optimization . . . . . . 10 Walter J. Gutjahr Optimality of Randomized Algorithms for the Intersection Problem . . . . . . . . . . . 26 J´er´emy Barbay Stochastic Algorithms for Gene Expression Analysis (Invited Talk) . . . . . . . . . . . . 39 Lucila Ohno-Machado and Winston Patrick Kuo Analysis of a Randomized Local Search Algorithm for LDPCC Decoding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Osamu Watanabe, Takeshi Sawai, and Hayato Takahashi Testing a Simulated Annealing Algorithm in a Classification Problem . . . . . . . . . . 61 Karsten Luebke and Claus Weihs Global Search through Sampling Using a PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Benny Raphael and Ian F.C. Smith Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation . . . . . 83 Uwe Naumann and Peter Gottschling Quantum Data Compression (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 John A. Vaccaro, Yasuyoshi Mitsumori, Stephen M. Barnett, Erika Andersson, Atsushi Hasegawa, Masahiro Takeoka, and Masahide Sasaki Who’s The Weakest Link? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Nikhil Devanur, Richard J. Lipton, and Nisheeth Vishnoi On the Stochastic Open Shop Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Roman A. Koryakin Global Optimization – Stochastic or Deterministic? . . . . . . . . . . . . . . . . . . . . . . . . 125 Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson Two-Component Traffic Modelled by Cellular Automata: Imposing Passing Restrictions on Slow Vehicles Increases the Flow . . . . . . . . . . . 138 Paul Baalham and Ole Steuernagel Average-Case Complexity of Partial Boolean Functions . . . . . . . . . . . . . . . . . . . . . 146 Alexander Chashkin
VIII
Table of Contents
Classes of Binary Rational Distributions Closed under Discrete Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Roman Kolpakov
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Prospects of Quantum Informatics Roland Kirschner Institut f¨ur Theoretische Physik, Univ. Leipzig, D-04109 Leipzig, Germany [email protected]
Abstract. A minimal introduction for non-physicists to basic notions of quantum physics is given with respect to quantum informatics. The requirements and the prospects of the physical implementation of a quantum computer are reviewed. Keywords: Quantum systems, qubits, quantum entanglement, decoherence.
1 Introduction Modern computer technology relies essentially on the understanding of the microstructure of matter gained in 20th century by quantum physics. The integration of thousands of transistor elements in one chip has lead to an enormous increase in computer capacities. Still a lot of effort is applied to push the microelectronics further to smaller and more compact structures. Information is physical, i.e. storage and processing of information uses physical systems which can exist in several states in a controllable way. The classical case is the one of systems decomposable into subsystems, called bits, each of which allows for just two distinguished states, |0), |1). The real physical system has much more degrees of freedom, allowing for a variety of states, besides the one used for information processing. The latter may be the direction of a current or the magnetization on a small piece of the chip. The basic physical theory describing the real system is quantum theory; the classical theory is a limiting case , which applies to the mentioned bit states. The exciting ideas of using the quantum features of physical systems for information storage and processing is attracting much interest in recent years. For extensive introductions we refer to [1–3]. More details on the relevant physics can be found in the collection of articles [4].
2 Quantum Theory and Informatics 2.1 Quantum States The states of a quantum degree of freedom are described quite differently compared to the classical ones. In particular there is no case with just two distinguished states. The smallest non-trivial space of states, in our context referred to as a qubit, is represented by the set of one-dimensional subspaces of the two-dimensional complex space H1 = C2 , spanned by the basis vectors |0 >, |1 >. The states of two qubits are accommodated in the tensor product H1 ⊗ H1 , spanned by the products |0, 0 >= |0 >1 ⊗|0 >2 , |0, 1 >, |1, 0 >, |1, 1 >. A mathematically trivial remark has essential consequences in A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 1–9, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
Roland Kirschner
physics and therefore in quantum informatics: A system of two (and more) qubits can exist in states represented by a vector, not decomposable in products of vectors referring to separate qubits. Such indecomposable states are called entangled. The basis of maximally entangled two qubit states is given by 1 1 ψ± = √ (|0, 1 > ±|1, 0 >), φ± = √ (|0, 0 > ±|1, 1 >). 2 2
(1)
The peculiar correlations between qubits being in entangled states is understood as the essential resource of advantages of quantum informatics. We shall give illustrations in the following. To be discussed in the following is also the downside of the medal: The main difficulties of quantum informatics are related to this basic feature of quantum entanglement as well. 2.2 Stochastic and Quantum Computations There is a view on the features of quantum informatics, which experts concerned with stochastic algorithms may find suitable [5]. In a classical deterministic computation the elementary unit of information is the bit, being either in state |0) or in state |1). n bits represent 2n distinct states; let us look at them as unit basis vectors in a 2n dimensional vector space. Operations are performed by Boolean gates, a sequence of gates forms a circuit. A few elementary gates, acting on one or on a pair of bits, are enough as building blocks for all Boolean operations. Typically the registers for input and output have more bits than actually used for the computation. Including passive (ancilla) bits the gates can be made time-reversible, i.e. such that the input values can be recovered from the output ones. In this scheme stochastic computation is modeled by adding a stochastic gate acting on one bit, |0) p 1− p |0) → . |1) q 1−q |1)
(2)
The entries of the stochastic matrix are probabilities, i.e. non-negative real numbers, where the sum of the ones in each row is 1. The resulting stochastic states of a bit are, instead of just two distinct ones, the linear superposition of the type p |0) + (1 − p) |1), which describes for 0 ≤ p ≤ 1 a line segment in the two-dimensional space P1 . A multibit state is then in the positive simplex of the vector space Pn spanned by the original 2n values of the n bits, |α) =
∑
px |x), 0 ≤ px ≤ 1.
(3)
x∈(0,1)n
The allowed operations are the ones preserving this positive simplex, in particular they preserve the l1 norm, ∑ |px | = 1.
Prospects of Quantum Informatics
3
This scheme now allows to pass formally to quantum computation by replacing the 2 dimensional real space, suitable for describing the states of a stochastic bit, by a complex two dimensional space H1 . Vectors differing by a non-zero complex factor represent one and the same state. The 2n distinct n bit states are replaced by the 2n basis vectors of the n qubit system, called the computational basis. A generic state vector in Hn has the form |α >=
∑
ax |x > .
(4)
x∈(0,1)n
Allowed operations are represented by unitary operators, acting on the state vector with the condition of preserving the l2 norm, ∑ |ax |2 = 1. All time-reversible Boolean gates have quantum counterparts, which by unitary operator do the analogous operation on the basis elements |01...10 >. Formally similar to the stochastic one-bit gate (2) i s the following quantum gate, |0 > cos φ sin φ |0 > → . (5) |1 > − sin φ cos φ |1 > All operations can be composed out of a few elementary gates, e.g. the latter one with particular angles φ = π4 , π8 and a two-qubit gate, e.g. the quantum CNOT gate [6], |x1 > ⊗|x2 >→ |x1 > ⊗|x1 ⊕ x2 >, x1,2 ∈ (0, 1)n .
(6)
⊕ stands for addition modulo 2. 2.3 States of Open Systems Only in the ideal situation of a quantum system isolated from the environment or, if the influence of the environment can be described in classical approximation, the representation of states as lines in a complex space H is applicable. States of open quantum systems are rather described by density matrices, i.e. positive hermitean operators ρˆ on H with trace 1 or positive hermitean forms. The data encoded in the density matrix are equivalent to fixing a orthonormal basis |ψi > in Hn plus a point in the positive simplex Pn , {pi }, 2n
ρˆ = ∑ pi |ψi >< ψi |.
(7)
i=1
Here < ψi | is the Dirac symbol of the dual basis, |ψi >< ψi | is the symbol of the projector on the one-dimensional subspace determined by |ψi >. The extreme case of states where pi0 = 1, pi = 0, i = i0 , corresponds to the ones discussed before; these states are called pure and can be represented by the vector |ψi0 >. The generic states are called mixed. The density matrix proportional to the unit operator represents the maximally mixed state. The von Neumann entropy characterizes the degree of mixing, ˆ = − ∑ pi ln pi . SvN = −tr(ρˆ ln ρ) i
(8)
4
Roland Kirschner
Operations on the state represented by the density matrix, ρ → T ρ, should preserve the trace and the positivity property. There is a stronger condition on T , complete positivity, demanding that the extensions of T to an operator on Hn ⊗ Ha by T ⊗ Ia preserve positivity. Unitary operators on state vectors act on the density matrix as TU ρ = U + ρU. They obey all conditions and they do not increase the degree of mixing SvN . 2.4 On the Virtues of Entanglement Consider the computation of a function f (x) on the positive integers x ≤ 2n and with integer values in the same range [7]. It takes two n qubit systems Hn for in- and output. Assume that an algorithm computing f has been implemented as a quantum circuit performing an unitary transformation on the n qubit quantum states, such that the action on a computational basis vector results in the other basis vector, | f (x) > = U f |x >. Instead of doing this operation successively on all 2n basis vectors one operation on the entangled tensor product state ∈ Hn ⊗ Hn , n
(I ⊗ U f ) 2− 2
∑
n
|x > ⊗|x >= 2− 2
x∈(0,1)n
∑
|x > ⊗| f (x) >,
(9)
x∈(0,1)n
leads to a state that involves in principle the result. To extract the value of the function for a particular x0 further unitary transformation have to be applied enhancing iteratively the amplitude of the corresponding basis state |x0 > ⊗| f (x0 ) > in expense of the other amplitudes. This enhancement procedure is part of Grover’s search algorithm [8]. Consider the task of finding out of 2n objects one of a special type and a function f of the object type t taking the value 0 for all types but the special one t˜ of interest; its value for this type is 1. Put the objects in some arbitrary order tx , 0 ≤ x ≤ 2n . Assume that this function f has been implemented in a quantum circuit such that U f |x >= (−1) f (tx ) |x > .
(10)
The search algorithm starts from the entangled state, n
2− 2
∑
|x > .
(11)
x∈(0,1)n
The application of the above U f just marks the contribution of the wanted item in this coherent sum by sign. In the next step the operation acting on the basis as |x >→ ∑(Dn )xy |y >, x, y ∈ (0, 1)n
(12)
y
is applied on the result, where the unitary matrix Dn is (Dn )xy =
2 − δxy . 2n
(13)
These two operations, Dn U f , are then applied repeatedly enhancing in this way the amplitude ax0 of the wanted item; the value of the order number x0 of it can be read off by a quantum measurement.
Prospects of Quantum Informatics
5
The information contained in a classical bit can be encoded in a qubit by mapping the bit values 0, 1 correspondingly to the basis vectors |0 >, |1 >. It is possible however to store and to transfer the information of up to 2 bits by one qubit [9]. This takes a second passive qubit being entangled with the first one. Let the two-qubit system be initially in the entangled state ψ− (1) and assume that the qubits are carried by two particles movable independently. Transfer first the particle 1 to the sender and particle 2 to the receiver. The sender is then doing unitary operations on the qubit state of particle (1) (2) 1. He applies U1 ,U1 acting on the computational basis of qubit 1 (|0 >1 , |1 >1 )T by matrix multiplication with (1)
U1 =
−i π e 2 0 01 (2) = . , U π 1 10 0 ei 2
(14) (1)
(2)
(1)
(2)
The two qubit state changes by applying either U1 or U1 or U1 U1 from ψ− to either ψ+ or φ− or φ+ . In this way the sender is able to produce 4 different states by manipulating the qubit 1 only. Transferring particle 1 now also to the receiver and applying there measurements onto the two particle system the receiver can read which of the 4 options the sender had chosen. Since the sender did not interact with the second particle carrying the auxiliary qubit 2 the result is indeed that two bits have been transferred by sending only one particle carrying a single qubit. 2.5 On the Drawbacks of Entanglement Only a part of the degrees of freedom of the real physical system are used for information processing in the role of bits or qubits. One cannot get rid of the remaining degrees of freedom, the environment, and one cannot eliminate completely their influence on the degrees of freedom of interest. In the classical case it is known how to deal with effects of noise and heat production and how to control possible errors caused by these effects. In the quantum case the unavoidable extra degrees of freedom result besides of that in a much more problematic effect - decoherence. The mechanism of decoherence can be understood as the result of the evolution of the whole system, the qubit system plus the environment, starting from a disentangled one, where the state vectors of the qubit subsystem and of the environment enters as factors, to an entangled state. The state of the qubit subsystem is then described by the projection of this entangled state to the subsystem. The result is a mixed state of the qubits described by a density matrix as discussed in section 2.3. In a large environment the state of the qubit subsystem gets maximally mixed after a short time, called coherence time. In this way all quantum features of the qubit system are washed out, in particular any trace of entanglement between qubits is lost. The maximally mixed state of a qubit can be considered as the uniform probability distribution on the unit sphere and does not carry any information, like the stochastic bit in the unbiased state with p0 = p1 = 12 . There is some analogy of decoherence to heat dissipation. Put a system in a heat bath ( environment being in termal equilibrium at a given temperature). The (weak)
6
Roland Kirschner
interaction between system and bath drives the system towards termal equilibrium with the bath in relaxation time. Typically, coherence times are much shorter than relaxation times. Let us return to the ideal situation of pure qubit states (4). Theoretically a measuring apparatus refers to a basis. Assume that we are given a measuring device adapted to the computational basis. Then the measurement of the qubit system in a generic state |α > results in the value x ∈ (0, 1)n with probability |ax |2 . As the effect of the measurement with result x the state of the qubit system reduces to |x >. This means that in the generic case only a repeated measurement on an ensemble of qubit systems equally prepared to be in state |α > allows to extract the values |ax |2 for all x. A complete reconstruction of |α > from the measurement results is impossible even in this stochastic sense. The situation is comfortable if we know that the system is in one of the basis states, i.e. that only one ax0 is non-vanishing. Then the measurement adapted to this basis tells us which is the value of x0 with probability 1. Therefore the quantum algorithms should result in states out of the computational basis. The peculiarities of the measurement process are understood as the results of the interaction of the (macroscopic) apparatus with the system producing an overall maximally entangled state.
3 Problems of the Physical Implementation 3.1 Basic Requirements to a Quantum Computer The basic requirement for the physical implementation of a quantum computer have been formulated by DiVincenzo [10]. These requirements seem to be generally accepted and they are taken usually as the guideline in discussing the prospects of particular proposals. The physicist’s vocabulary used in formulating the requirements has been explained above. 1 A scalable physical system with well characterized qubits. 2 The ability to initialize the state of the qubit system to a simple fiducial state, e.g. to the one with the factorized state vector where all qubit vectors are the basis vector |0 >. 3 Long relevant coherence times, much longer than the gate operation time. 4 A universal set of quantum gates. 5 A qubit specific measurement capability. 6 Ability to interconvert stationary and flying qubits. 7 Ability to faithfully transmit flying qubits. The latter two requirements are the ones for quantum communication. This would allow to exchange the information in the quantum form, i.e. without intermediate conversion into the classical form and back.
Prospects of Quantum Informatics
7
3.2 The Critical View A straightforward comparison of the contemporary experimental abilities to a quantum computer meeting the above requirements leads to a pessimistic conclusion [11]. To justify the effort, the computing capacity of the new divice should be comparable with the existing ones. The proposed quantum error correction codes [12, 13] promise to compensate errors on the level o f 10−5 per qubit and per gate. This would take about 10 ancilla qubits per working qubit for beating these potential errors. This leads to an estimate of n ≈ 105 qubits for doing useful computations with error control. If spins of electrons or nuclei are used as qubits then initialization (requirement 2) could be done by cooling (below 100 mK) and by applying a strong (1 T) magnetic field. It is questionable that this can be done with a precision of 10−5 ( = probability of any of the qubits to be not in the initial state |0 >). The requirements 4 and 5 mean to manipulate and to measure the quantum state of n = 105 spins, i.e. switching on and off about n2 pairwise interactions of the spins at each operation of the computing algorithm. The most essential problem is the one about coherence (requirement 3). To keep coherence during a sequence of controlled interaction of 105 qubits seems to be a requirement that cannot be achieved in foreseeable future. In any case one cannot believe to achieve this by scaling up the present experiments concerned with coherence of up to 7 qubits. Most theoretical treatments of proposed qubit systems and of realizations of gates are presented in the formulation of closed systems. This ignores decoherence or assumes it to be a small correction. A physically realistic treatment of qubit systems should be formulated in the language of open systems, describing the states by density matrices and the gates by operators in the more general class beyond the unitary ones. In this way the decoherence can be included inherently into the theoretical model. In the critical note [11] it has been proposed to study a qubit system in the opposite extreme of small coherence, i.e. in almost maximally mixed states. The author refers to the classical limit of quantum system. The classical limit of a qubit system would be the one of classical spins, the states of each of them are represented by points on the unit sphere. On the other hand the maximally mixed qubit state can be considered as a uniform probability distribution on the unit sphere. This is not a classical spin state which in contrary corresponds to one distinguished point on this sphere. From this we conclude that the classical spin system does not tell us much about the situation of small coherence. 3.3 The Optimistic View In recent years a number of experiments in different types of physical systems have demonstrated the controlled evolution of a few qubit-like quantum degrees of freedom. Actually quantum coherence phenomena have been the topic in quantum optics since the laser had been invented. The improvement of experimental techniques, e.g. ion and atom traps, high quality microcavities, the advanced technology of nuclear magnetic resonance, is the basis of the new coherence experiments. Implementations of qubitlike systems and quantum gate operations have been proposed, discussed and experimentally investigated in different physical situations, among them ions trapped in a
8
Roland Kirschner
clever setup of static and oscillating electromagnetic field, microcavities at optical and microwave frequencies in interaction with atoms, electron and nuclear spin states at donors in silicon, nuclear spins bound in molecules controlled by nuclear magnetic resonance technique, charge states of superconductors or flux states of superconducting circuits. This experimental progress is the real source of optimism about quantum informatics. Several schemes of quantum communication have been demonstrated experimentally with the qubits implemented as the polarization states of photons, in particular the dense coding mechanism described theoretically above [14]. Two entangled photon polarizations have been used also to implement save quantum cryptography and quantum teleportation, i.e. transmission of a (unknown) qubit state without converting the information to the classical form (compare requirements 6, 7). It is difficult by now to transform a product polarization state of photons, |0 >1 |0 >2 , into an entangled one. Fortunately there are sources of entangled photon pairs. Some crystals have peculiar highly non-linear optical properties and are able to convert a high frequency photon into a pair of equal frequency photons polarized orthogonal to each other. The polarization depends on the direction of emission and the polarization states appear disentangled in general. In the particular situation of type II conversion there are special direction of emission where the photons emerge in an entangled polarization state. By letting one of the photons pass through additional birefrigent plates one compensates for relative time delay and turns the polarization of this photon (corresponding to the transformations U1 ,U2 (14)) in order to produce one of the 4 entangled basis states (1). The resulting two-photon polarization state can be detected by coincidence counters. By this technique one is able to distinguish only 3 of the 4 basis states, the states φ± result unfortunately in the same coincidence signal. But this was enough to demonstrate the transport of information stored in 3 distinguished states by one photon, i.e. a channel capacity exceeding the classical one by a factor 1.5. Ions can be kept for some time in a small region by a superposition of static and oscillating electromagnetic fields. Several ions can be placed nearby, each ion feeling the vibrations of its neighbors. Laser cooling allows to slow down the termal motion to 1K. By controlled laser pulses one stimulates the transitions between selected ion energy levels. The subsystem of two levels (|0 >, |1 >) serves as a qubit and is carried by one ion. A third ion level with high transition rate to and from |0 > (but not to |1 >) can be used to detect whether the qubit is in state |0 > and to initialize the qubit to |0 > [15]. One-bit gates are constructed by laser stimulated interactions between the levels |0 > and |1 >. Two-bit gates can work with the vibrational interaction between ions. Experiments demonstrating the controlled interaction of 4 qubit have been reported [16]. The nuclear spin qubits considered in the nuclear magnetic resonance set-up reside on one molecule. Working with a macroscopic number of molecules allows to detect coherence effects of up to 7 qubits despite of having rather low coherence. Two superconducting pieces separated by a thin oxid layer can be in discrete charge states, characterized by the number of electronic Cooper pairs in one of the pieces. This number changes by tunneling through the layer (Josephson effect). By an applied electric field two adjacent charge states become distinguished as the ones of lowest
Prospects of Quantum Informatics
9
energy; they can form the qubit subsystem. A superconducting circuit, interrupted at one place by a thin oxid layer, can be in discrete states, characterized by units of the induced magnetic flux through the circuit contour. By an applied external magnetic field two adjacent flux states are separated as the ones of lowest energy and they can also form the qubit subsystem. In the superconducting setup the influence from degrees of freedom other than the mentioned charge and flux ones are suppressed. Coherence times can be larger, but also the gate operation times are larger here. Experiments on the controlled coupling of two charge-state qubits have been reported recently [17]. These few (and incomplete) details on some physical implementations may be sufficient to provide an illustration of the involved physics and of the challenges of quantum informatics.
References 1. Steane, A.: Quantum computing. Reports Prog. Phys. 61 (1998) 117 2. Alber, G. et al.: Quantum Information. Springer tracts in Modern Physics, 173, Springer Verl., 2001. 3. Preskill, J.: Lecture notes. http://www.theory.caltech.edu/people/˜preskill/ph229 4. (Collection of papers): Fortschr. Phys. 48 (2000) 9-11 5. Fenner, A.: A physics-free introduction to quantum computational model. preprint arXiv: cs.CC/0304008, in http://xxx.lanl.gov/find/cs 6. Barenco, A. et al.: Phys. Rev. A 52 (1995) 3457 7. Deutsch, D.: Proc. R. Soc. London A 400 (1980) 97 8. Grover, L.: Phys. Rev. Lett. 80 (1996) 4329 9. Bennett, G.H. and Wiesner, S.J.: Phys. Rev. Lett. 69 (1992) 2881 10. DiVincenzo, D.P.: in ref. [4], pp. 771 11. Dyakonov, M.I.: Quantum computing: A view from the enemy camp. in Lurgyi, S.,Xu, J. and Zaslavsky, A. (eds.): Future Trends in Microelectronics, Wiley 2002 12. Steane, A.: Phys. Rev. Lett. 77 (1996) 793 13. Calderbank, A. and Shor, P.: Phys. Rev. A 54 (1996) 1098 14. Mattle, K., Weinfurter, H., Kwiat, P.G. and Zeilinger, A.: Phys. Rev. Lett. 76 (1996) 4656 15. Cirac, J.I., Zoller, P.: Phys.Rev.Lett. 74 (1995) 4091 16. Sackett, C.A. et al.: Nature 404 (2000) 256 17. Pashkin, Yu.A. et al.: Nature 421 (2003) 823
A Converging ACO Algorithm for Stochastic Combinatorial Optimization Walter J. Gutjahr Dept. of Statistics and Decision Support Systems, University of Vienna [email protected] http://mailbox.univie.ac.at/walter.gutjahr/
Abstract. The paper presents a general-purpose algorithm for solving stochastic combinatorial optimization problems with the expected value of a random variable as objective and deterministic constraints. The algorithm follows the Ant Colony Optimization (ACO) approach and uses Monte-Carlo sampling for estimating the objective. It is shown that on rather mild conditions, including that of linear increment of the sample size, the algorithm converges with probability one to the globally optimal solution of the stochastic combinatorial optimization problem. Contrary to most convergence results for metaheuristics in the deterministic case, the algorithm can usually be recommended for practical application in an unchanged form, i.e., with the “theoretical” parameter schedule. Keywords: Ant colony optimization, combinatorial optimization, convergence results, metaheuristics, Monte-Carlo simulation, stochastic optimization.
1 Introduction In many practical applications of combinatorial optimization, a smaller or larger extent of uncertainty on the outcome, given a special choice of the decision maker, must be taken account of. A well-established way to represent uncertainty is by using a stochastic model. If this approach is chosen, the objective function of the optimization problem under consideration gets dependent not only on the decision, but on a random influence as well; in other word, it gets a random variable. The aim is then to optimize a specific functional of this random variable. Usually, this functional is the expected value; e.g., if the objective function represents cost, then the quantity to be minimized can be the expected cost. (Particular applications of risk theory, especially in financial engineering, also consider other functionals, e.g. the variance of the objective. We do not deal with this situation here.) In some formally relative simple stochastic models, the expected value of the objective function can either be represented explicitly as a mathematical expression, or at least be easily computed numerically to any desired degree of accuracy. Then, the solution of the stochastic optimization problem is not essentially different from that of a deterministic optimization problem; the stochastic structure is hidden in the representation of the expected objective function, and exact or heuristic techniques of combinatorial optimization can be used. The situation changes if it is only possible to determine estimates of the expected objective function by means of sampling or simulation. For A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 10–25, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
11
example, consider the single machine total tardiness problem, an NP-hard sequencing problem (see Du and Leung [8]): n jobs have to be scheduled on a single machine, each job has a processing time and a due date, and the objective is to minimize the sum of the tardiness values, where tardiness is defined as the positive part of the difference between completion time and due date. Although the formula for the objective in the deterministic case is simple, no closed-form expression for the expected total tardiness in the case where the processing times are random variables (following a given joint distribution) is known, and its numerical computation would be very complicated and time-consuming. However, a relatively straightforward approach for approximating the expected total tardiness is to draw a sample of random scenarios and to take the average total tardiness over these scenarios as an estimate. Examples for other problems where the same approach seems promising are stochastic vehicle routing problems (see, e.g., Bertsimas and Simchi-Levi [3]), emergency planning based on simulation (Bakuli and Smith [2]), facility location problems involving queuing models (Marianov and Serra [18]), project financing with uncertain costs and incomes (Norkin, Ermoliev and Ruszczynski [19]), manpower planning under uncertainty (Futschik and Pflug [9]), or activity crashing using PERT (Gutjahr, Strauss and Wagner [16]). For the approximate solution of hard deterministic combinatorial optimization problems, several metaheuristics have been developed. One of these metaheuristics with a currently rapidly growing number of applications is Ant Colony Optimization (ACO), rooted in work by Dorigo, Maniezzo and Colorni [7] at the beginning of the nineties and formulated more recently as a metaheuristic by Dorigo and DiCaro [6]. Like some other metaheuristics, ACO derives its basic idea from a biological analogy; in the case of ACO, the food-searching behavior of biological ant colonies is considered as an optimization process, and from this metaphor, strategies for solving a given combinatorial optimization problem by simulating walks of “artificial ants” are derived. It has been shown that certain ACO variants have the favorable property that the intermediate solutions found by the system converge to the globally optimal solution of the problem (see [12]–[14]). The aim of the present investigation is to develop an ACO algorithm that is able to treat the more general case of a stochastic combinatorial optimization problem, using the generally applicable sampling approach described above. As in the deterministic case, guarantees on the convergence to the optimal solution are highly desirable. It turns out that such a convergence result is indeed possible for the algorithm presented here. It will be argued that, contrary to most convergence results for metaheuristics for deterministic problems, our algorithm can be recommended for practical use in an unchanged form, i.e., with the same parameter schedule as assumed for obtaining the convergence result. Whereas for other metaheuristic approches, extensions to stochastic problems have already been studied intensely (see, e.g., Arnold [1] for Evolutionary Strategies or Gutjahr and Pflug [15] for Simulated Annealing), the research on ACO algorithms for stochastic problems is currently only at the very beginning. An interesting first paper has been published by Bianchi, Gambardella and Dorigo [4], it concerns the solution of the probabilistic travelling salesman problem. Nevertheless, the approach chosen in [4] is tailored to the specific problem under consideration, and it depends on the availabil-
12
Walter J. Gutjahr
ity of a closed-form expression of the expected objective function value. The algorithm presented here has a considerably broader range of application. We think that ACO is especially promising for problems of the considered type for three reasons: First, it works with a “memory” (the pheromone trails, see below) which effects a certain robustness against noise; this is a common feature with Evolutionary Strategies and Genetic Algorithms, but different from Simulated Annealing or Tabu Search. Secondly, also problems with a highly constrained solution space (e.g., permutation problems) can be encoded in a natural way. Third, problem-specific heuristics can be incorporated to improve the performance. The two last issues seem to give the ACO approach a specific advantage in the field of highly constrained combinatorial optimization.
2 Stochastic Combinatorial Optimization Problems We deal with stochastic combinatorial optimization problems of the following general form: Minimize F(x) = E ( f (x, ω)) s.t.
x ∈ S.
(1)
Therein, x is the decision variable, f is the (deterministic) objective function, ω denotes the influence of randomness (formally: ω ∈ Ω, where (Ω, Σ, P) is the probability space specifying the chosen stochastic model), E denotes the mathematical expectation, and S is a finite set of feasible decisions. We need not to assume that E ( f (x, ω)) is numerically computable, since it can be estimated by sampling: Draw N random scenarios ω1 , . . . , ωN independently from each other. A sample estimate is given by 1 ˜ F(x) = N
N
∑ f (x, ων ) ≈ E ( f (x, ω)).
(2)
ν=1
˜ Obviously, F(x) is an unbiased estimator for F(x). For example, in the single machine total tardiness problem mentioned in Section 1, N arrays, each consisting of n random processing times for job 1 to n according to the given distribution(s), can be generated from independent random numbers. For each of these arrays, the total tardiness of the considered schedule (permutation) x can be computed. The average over the N total ˜ tardiness values is the sample estimate F(x) of F(x). Let us emphasize that, contrary to its deterministic counterpart, problem (1) can be nontrivial already for a very small number |S| of feasible solutions: Even for |S| = 2, we obtain, except if F(x) can be computed directly, a nontrivial statistical hypothesis testing problem (see [19]).
3 Ant Colony Optimization For the sake of a clearer understanding of the algorithm given in the next section, we recapitulate the main ideas of ACO by presenting one particular ACO algorithm, GBAS
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
13
(see [12]), designed for deterministic problems. GBAS has been chosen since it is also the kernel of the algorithm S-ACO in Section 4. Essential general features of ACO are the following: – Solutions are constructed randomly and step-by-step. – Construction steps that have turned out as part of good solutions are favored. – Construction steps that can be expected to be part of good solutions are favored. In GBAS (Graph-Based Ant System), the given problem instance is encoded by a construction graph C , a directed graph with a distinguished start node. For sequencing problems as the TSP or the single-machine total tardiness problem mentioned above, the construction graph is essentially a complete graph with the items to be scheduled as nodes. For problems with other combinatorial structures (e.g., subset problems), suitable other graphs are used. The stepwise construction of a solution is represented by a random walk in the construction graph. In this walk, each node is visited at most once, already visited nodes are “tabu” (infeasible). There may also be additional rules defining particular nodes as infeasible after a certain partial walk has been traversed. When there is no feasible unvisited successor node anymore, the walk stops and is decoded as a complete solution for the problem. The encoding must be such that each walk that is feasible in the sense above corresponds to exactly one feasible solution. (Usually, also the reverse holds, but we do not make this to an explicit condition.) Since, if the indicated condition is satisfied, the objective function value is uniquely determined by a feasible walk, we may denote a walk by the same symbol x as a decision or solution and consider S as the set of feasible walks. When constructing a walk in the algorithm, the probability pkl to go from a node k to a feasible successor node l is chosen as proportional to τkl ·ηkl (u), where τkl is the socalled pheromone or trail level, a memory value storing how good step (k, l) has been in previous runs, and ηkl (u) is the so-called attractiveness or visibility, a pre-evaluation of how good step (k, l) will presumably be (e.g., the reciprocal of the distance from k to l in a TSP). ηkl (u) is allowed to depend on the given partial walk u up to now. This pre-evaluation is done in a problem-specific manner. Pheromone initialization and update is performed as follows: Pheromone initialization: Set τkl = 1/m, where m is the number of arcs of the construction graph. Pheromone update: First, set, for each arc, τkl = (1 − ρ)τkl , where ρ is a so-called evaporation factor between 0 and 1. This step is called evaporation. Then, on each ˆ where L(x) denotes arc of the best walk xˆ found up to now, increase τkl by ρ / L(x), the length of walk x, defined as the number of arcs on x. Thus, the overall amount of pheromone remains equal to unity. This step reinforces the arcs (partial construction steps) of already found good solutions. Random walk construction and pheromone update are iterated. Instead of a single walk (“one ant”), s walks (s > 1) are usually constructed sequentially or, in parallel implementations, simultaneously (“s ants”).
14
Walter J. Gutjahr
Note that for being able to do the pheromone update as described above, the best found walk up to now has to be stored in a global variable x. ˆ Each time a new random walk x is completed, the objective function value of the corresponding feasible solution is computed and compared with the objective function value of x. ˆ If x turns out to be better than w, ˆ the walk stored in xˆ is replaced by x.
4 Extension of the Algorithm to Stochastic Problems We present now an extension S-ACO of the algorithm GBAS of the last section to the case of the stochastic optimization problem (1). S-ACO leaves the basic procedure largely unchanged, but modifies the pheromone-update subprocedure by introducing a stochastic test whether the solution stored as the current best one should still be considered as optimal. In the pseudo-code formulation below, we write τkl (n) instead of τkl in order to denote the dependence on round number n; the same for pkl (n). For τmin (n), see the comments after the procedure. Feasibility of a continuation (k, l) of a partial walk u ending with node k is defined as in Section 3: The continuation (k, l) is feasible if node l is not yet contained in u, and none of the (eventual) additional rules specifies l as infeasible after u has been traversed. procedure S-ACO { do pheromone-initialization; for round n = 1, 2, . . . { for ant σ = 1, . . . , s { set k, the current position of the ant, equal to the start node of C ; set u, the current walk of the ant, equal to the empty list; while (a feasible continuation (k, l) of the walk u of the ant exists) { select successor node l with probability pkl (n), where 0, if (k, l) is infeasible, pkl (n) = τkl (n) ηkl (u) / ∑(k,r) τkr (n) ηkr (u) , otherwise, the sum being over all feasible (k, r); set k = l, and append l to u;
} set xσ = u;
} from {x1 , . . . , xs }, select a walk x; do S-pheromone-update(x, n); // see below }
}
procedure S-pheromone-update(x, n) { compute estimate 1 ˜ F(n) = Nn
Nn
∑
ν=1
f (x, ων )
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
15
by applying Nn independent sample scenarios ων to x; if (n = 1) { set xˆ = x; ˆ ˜ set F(n) = F(n); } else { compute estimate 1 ˆ F(n) = Nn
Nn
∑
ν=1
f (x, ˆ ων )
ˆ by applying Nn independent sample scenarios ων to x; ˜ ˆ if (F(n) < F(n)) set xˆ = x;
} set, with L(x) denoting the length of walk x, max ((1 − ρ) τkl (n) + ρ / L(x), ˆ τmin (n)), if (k, l) ∈ x, ˆ τkl (n + 1) := otherwise; max ((1 − ρ) τkl (n), τmin (n)),
(3)
} Comments: The essential difference to the deterministic case is that in the stochastic case, it is not possible anymore to decide with certainty whether a current solution x is better than the solution currently considered as the best found, x, ˆ or not. This can only be tested by statistical sampling, which happens in the specific pheromone update subprocedure used here, S-pheromone-update. Even the result of this test can be erroneous, due to the stochastic nature of all objective function evaluations, i.e., the test yields the correct comparison result only with a certain probability. For the same reason, it is not even possible to decide which ant has, in the current round, produced the best walk. The procedure above prescribes that one of the s produced walks is selected, according to whatever rule. A promising way to do that would be to evaluate each xσ at a random scenario drawn specifically for this round and to take x as the walk with best objective value. The first part of the subprocedure S-pheromone-update compares the solution x selected in the present round with the solution considered currently as the best, x. ˆ This is done by determining sample estimates for both solutions (practically speaking: by estimating the expected costs of both solutions by means of Monte-Carlo simulation with Nn runs each). Scenarios ων have to be drawn independently from scenarios ων (i.e., the simulation runs have to be executed with two independent series of random numbers). The winner of the comparison is stored as the new x. ˆ The question which sample size Nn should be chosen in round n will be dealt with in the next section. In the second part of the subprocedure, pheromone update is performed essentially in the same way as described in Section 3, but with an additional feature: If the computed pheromone value τkl (n) would fall below some predefined lower bound τmin (n),
16
Walter J. Gutjahr
we set τkl (n) = τmin (n). (The idea of using lower pheromone bounds in ACO is due to St¨utzle and Hoos [20], [21]). Again, the question how τmin (n) should be chosen in dependence of n will be treated in the following section. The computation of the attractiveness values ηkl (u) needs some explanation. As mentioned, these values are obtained from a suitable problem-specific heuristic. Although, in principle, one could work with “zero-information” attractiveness values, all set equal to a constant, the choice of a good attractiveness heuristic will improve the performance of the algorithm considerably. In the stochastic case, there is the difficulty that certain variables possibly used by such a heuristic are not known with certainty, because they depend on the random influence ω. This difficulty can be solved either by taking the expected values (with respect to the distribution of ω) of the required variables as the base of the attractiveness computation (in most stochastic models, these expected values are directly given as model parameters), or by taking those variable values that result from a random scenario ω drawn for the current round. Presumably, both will perform much better than applying zero-information attractiveness.
5 Convergence For the validity of the algorithm S-ACO presented in Section 4, we are able to give a strong theoretical justification: It is possible to prove that, under rather mild conditions, the current solutions produced by the algorithm converge with probability one to the globally optimal solution. In the sequel, we first present and then discuss these conditions. (i) The optimal walk x∗ is unique. (ii) The function f (x, ων ) observed at random scenario ων can be decomposed in expected value and error term as follows: f (x, ων ) = f (x) + εxν , where f (x) = E ( f (x, ω)), εxν is normally distributed with mean 0 and variance (σ(x))2 , and all error variables εxν are stochastically independent. (iii) The attractiveness values ηkl (u) satisfy ηkl (u) > 0 for all prefices u of x∗ and for all (k, l) on x∗ . (iv) The lower pheromone bound is chosen as τmin (n) =
cn log(n + 1)
with limn→∞ cn > 0. (E.g., τmin (n) = c/ log n for some c > 0 fulfills the condition.) (v) The sample size Nn grows at least linearly in n, i.e., Nn ≥ C · n for some constant C > 0. Condition (i) is in some sense the strongest of the four conditions, but it can probably be removed along the same lines as the corresponding condition in [14] for the
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
17
deterministic special case. Also if this should not be the case, a negligible change of the objective function (e.g., adding ε · i(x), where i(x) is the index of solution x according to some order, and ε is sufficiently small) makes (i) satisfied. As to Condition (ii), it should be observed that it is always possible to decompose a random variable f (x, ω) with existing expected value into this expected value and an error term. That the error terms are normally distributed is not a very restrictive assumption, since in S-ACO, the observations f (x, ων ) are always used as independent terms in the sample estimate (2), where they produce (after suitable normalization and for large sample size Nn ) an approximately normally distributed random variable by the central limit theorem, so it does not make an essential difference if they are assumed as normally distributed from the beginning. If one wants to get rid of the assumption of normally distributed error terms, one can also apply stochastic dominance arguments, as, e.g., in [15] for the convergence of stochastic Simulated Annealing. By independent simulation runs each time a value f (x, ων ) is required, the condition on stochastic independence of the error terms can easily be made satisfied. Condition (iii) is very weak, since it is only violated if the problem-specific attractiveness values have been chosen in such an inappropriate way that the optimal walk x∗ is blocked by them a priori. Condition (iv) is easy to satisfy. Also Condition (v) makes, in general, no problems (cf. Remark 1 after the theorem below). Theorem 1. If conditions (i) – (v) are satisfied, then for the currently best found walk x(n) ˆ in round n, for the pheromone values τkl (n) in round n and for the probability Pσ (n) that some fixed considered ant σ traverses the optimal walk x∗ in round n, the following assertions hold: ˆ = x∗ lim x(n)
n→∞
lim τkl (n) =
n→∞
lim Pσ (n) = 1.
n→∞
with probability 1, 1/L(x∗ ), if (k, l) on x∗ , 0, otherwise
(4) with probability 1,
(5) (6)
In informal terms: On the indicated conditions, the solutions subsequently produced by S-ACO tend to the (globally) optimal solution, pheromone concentrates on the optimal walk and vanishes outside, and the current walks of the ants concentrate on the optimal walk. We prove Theorem 1 with the help of five lemmas. In the proofs, we use the following notational conventions: – x(n) is the walk selected in round n, i.e., the first parameter given to the procedure S-pheromone-update when it is called in round n, – x(n) ˆ is the current value of xˆ before the update of xˆ in the else-branch of Spheromone-update in round n. In particular: x(1) ˆ = x(1). In all five lemmas, we always assume implicitly that conditions (i) – (v) are satisfied.
18
Walter J. Gutjahr
Lemma 1. For each fixed positive integer n1 , there exists with probability one an integer n = n(ω) ≥ n1 , such that x∗ is traversed by all ants in round n. Proof. Because of the lower pheromone bound as given by condition (iv), τkl (n) ≥ τmin (n) =
cn c ≥ log(n + 1) 2 log(n + 1)
(7)
for some c > 0 and for n ≥ n0 with some n0 ∈ IN. By condition (iii), for all prefices u of x∗ and for all (k, l) on x∗ ,
ηkl (u) ≥ γ > 0
since the optimal walk x∗ contains only a finite number of arcs. Moreover, ηkl (u) ≤ Γ for some Γ ∈ IR, since there is only a finite number of feasible paths. Therefore, for the probability that the optimal walk x∗ is traversed by a fixed ant in round n, the estimate below is obtained, where uk (x∗ ) denotes the prefix of walk x∗ ending with node k (note that the sum of the pheromone values is unity):
∏
(k,l)∈x∗
pkl (n, uk (x∗ )) =
τkl (n) γ γ ≥ ∏ τkl (n) Γ ∑(k,r) τk,r (n) (k,l)∈x∗ Γ
∏∗
≥
(k,l)∈x
≥
∏
(k,l)∈x∗
γc = 2Γ log(n + 1)
τkl (n) ηkl (uk (x∗ )) ∏ τ (n) ηkr (uk (x∗ )) (k,l)∈x∗ ∑(k,r) kr
γc 2Γ log(n + 1)
L(x∗ )
.
(8)
Obviously, estimation (8) holds as well, if the l.h.s. refers to the probability of traversing x∗ conditional on any event in round 1 to n − 1. Now, let Bn denote the event that x∗ is traversed in round n by all ants. Evidently, ¬Bn1 ∧ ¬Bn1 +1 ∧ . . . is equivalent to the statement that no round n ≥ n1 exists such that x∗ is traversed in round n by all ants. We show that Prob (¬Bn1 ∧ ¬Bn1 +1 ∧ . . .) = 0.
(9)
This is seen as follows. With n = max(n0 , n1 ), the last probability is equal to Prob (¬Bn1 ) · Prob (¬Bn1 +1 | ¬Bn1 ) · Prob (¬Bn1 +2 | ¬Bn1 ∧ ¬Bn1 +1 ) · . . . ≤
∞
∏ Prob (¬Bn | ¬Bn1 ∧ ¬Bn1 +1 ∧ . . . ∧ ¬Bn−1)
n=n
≤
∞
∏
n=n
γc 1− 2Γ log(n + 1)
L(x∗ )·s
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
19
because of (8) and the remark thereafter. The logarithm of the last expression is L(x∗ )·s ∞ γc ∑ log 1 − 2Γ log(n + 1) n=n ≤ −
∞
∑
n=n
γc 2Γ log(n + 1)
L(x∗ )·s
= −∞,
since ∑n (log n)−λ diverges for λ > 0. It follows that (9) holds, which proves the lemma. Lemma 2. Conditionally on the event that x(n) ˆ = x∗ and x(n) = x∗ , ˜ ˆ Prob(F(n) < F(n)) ≤ g(n), and conversely, conditionally on the event that x(n) ˆ = x∗ and x(n) = x∗ , ˆ ˜ Prob(F(n) < F(n)) ≤ g(n), √ where g(n) = φ(−C n) with φ denoting the distribution function of the standard normal distribution, and C is a constant only depending on σ = min σ(x)
(10)
x∈S
(cf. condition (ii)) and on δ = min{F(x) − F(x∗ ) | x = x∗ } > 0
(11)
(cf. condition (i)). In other words: The probability that x∗ looses a comparison against a suboptimal solution is always smaller or equal to g(n). ˜ Proof. Because of condition (ii) and by definition of F(n), 1 Nn ˜ F(n) = ∑ f (x, ων ) Nn ν=1 is normally distributed with mean F(x) and ˜ var(F(n)) =
1 (σ(x))2 2 · N · (σ(x)) = . n Nn2 Nn
˜ For the same reason, F(n) is normally distributed with mean F(x) ˆ and ˆ var(F(n)) =
(σ(x)) ˆ 2 , Nn
˜ ˆ ˜ ˆ and F(n) and F(n) are stochastically independent. Hence F(n) − F(n) is normally distributed with mean F(x) − F(x) ˆ and variance (σ(x))2 (σ(x)) ˆ 2 σ2 2σ2 + ≤2 ≤ Nn Nn Nn an
20
Walter J. Gutjahr
with a > 0 given by condition (v), and σ given by (10). For x(n) ˆ = x∗ , this yields: F(x) − F(x∗ ) ˜ ˆ Prob (F(n) − F(n) < 0) = φ − (σ(x))2 /Nn + (σ(x∗ ))2 /Nn √ δ ≤ φ − = φ(−C n) 2 2σ /an √ √ with C = δ a / ( 2σ) > 0. The second part of the assertion follows immediately because of the symmetry in ˆ ˜ the computation of F(n) and F(n). Lemma 3. For the function g(n) defined in Lemma 2, ∞
∏ (1 − g(n)) = 1 n1 →∞ lim
(12)
n=n1
holds. Proof. Because of C > 0, we have 0 < g(n) < 1. Taking logarithm, we obtain that (12) is equivalent to ∞
lim
n1 →∞
∑ (− log(1 − g(n)) = 0,
(13)
n=n1
where each term in the sum is positive. Because of − log(1 − x) ≤ x for all x, a sufficient condition for (13) being satisfied is ∞
lim
n1 →∞
∑ g(n) = 0.
(14)
n1 =1
Let ϕ(x) = φ (x) denote the density function of a standard normal distribution. By elementary calculations, it is seen that φ(x) ≤ ϕ(x)/(−x) for x < 0. Therefore one obtains √ ∞ ∞ ∞ √ ϕ(−C n) ∑ g(n) = ∑ φ(−C n) ≤ ∑ C√n n=1 n=1 n=1 ∞ √ 1 ∞ 1 C2 n ≤ ∑ ϕ(−C n) = C√2π ∑ exp − 2 . C n=1 n=1 2 The function exp − C2 n is decreasing in n, so ∞ C2 n C2 x ∑ exp − 2 ≤ 0 exp − 2 dx < ∞. n=1 ∞
Thus, ∑∞ n=1 g(n) < ∞. Since g(n) > 0, this proves (14) and therefore also the lemma.
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
21
Lemma 4. With probability one, there is an n2 = n2 (ω) such that x(n) ˆ = x∗ for all n ≥ n2 . Proof. Let n0 be the index introduced in the proof of Lemma 1, such that (7) holds for all n ≥ n0 . We choose n1 ≥ n0 in such a way that ∞
∏ (1 − g(n)) ≥ 1 − ε,
n=n1
which is possible by Lemma 3. Let Gn denote the event that in round n, the optimal solution x∗ is taken for the comparison in S-pheromone-update, either as the currently selected solution x(n), or as the current best-solution candidate x(n), ˆ or both, and that ˆ + 1) = x∗ . Event Gn occurs in two possible situx∗ wins the comparison, such that x(n ations: (a) x(n) = x(n) ˆ = x∗ . Then automatically (i.e., with probability 1) x(n ˆ + 1) = x∗ . (b) x(n) = x(n), ˆ and either x(n) = x∗ or x(n) ˆ = x∗ . In this situation, by Lemma 2, x∗ wins the comparison with a probability of at least 1 − g(n), with the effect that x(n ˆ + 1) = x∗ . Furthermore, let Dn denote the event that round n is the first round with n ≥ n1 where x∗ is traversed by all ants. With the notation in the proof of Lemma 1, Dn = ¬Bn1 ∧ ¬Bn1 +1 ∧ . . . ∧ ¬Bn−1 ∧ Bn . Consider two arbitrary fixed rounds n2 and n with n1 ≤ n2 ≤ n. For n > n2 , the event Dn2 ∧ Gn2 ∧ Gn2 +1 ∧ . . . ∧ Gn−1 implies that x(n) ˆ = x∗ , hence Prob (Gn | Dn2 ∧ Gn2 ∧ Gn2 +1 ∧ . . . ∧ Gn−1 ) ≥ 1 − g(n)
(15)
by the consideration above. For n = n2 , on the other hand, the event Dn2 ∧ Gn2 ∧ . . . ∧ Gn−1 reduces to Dn2 , which implies x(n2 ) = x∗ , such that also in this case, (15) holds by the consideration above. Therefore, Prob (Gn2 ∧ Gn2 +1 ∧ . . . | Dn2 ) =
∞
∏ Prob (Gn | Dn2 ∧ Gn2 ∧ Gn2+1 ∧ . . . ∧ Gn−1)
n=n2
≥
∞
∞
n=n2
n=n1
∏ (1 − g(n)) ≥ ∏ (1 − g(n)) ≥ 1 − ε.
The events Dn1 , Dn1 +1 , . . . are mutually exclusive, and by Lemma 1, Prob (Dn1 ) + Prob(Dn1 +1 ) + . . . = 1. Using this, we obtain: The probability that there is a round n2 ≥ n1 , such that round n2 is the first round after round n1 where x∗ is traversed by all ants and x(n) ˆ = x∗ for all n ≥ n2 , is given by ∞
∑
n2 =n1
Prob (Dn2 ∧ Gn2 ∧ Gn2 +1 ∧ . . .)
22
Walter J. Gutjahr
= ≥ (1 − ε)
∞
∑
n2 =n1
∞
∑
n2 =n1
Prob (Gn2 ∧ Gn2 +1 ∧ . . . | Dn2 ) · Prob (Dn2 )
Prob (Dn2 ) = 1 − ε.
(16)
Since the l.h.s. of (16) does not depend on ε and ε > 0 is arbitrary, the considered probability must be exactly 1, which proves the assertion. Lemma 5. With probability one, τkl (n) → 1/L(x∗ ) for (k, l) ∈ x∗ and τkl (n) → 0 for (k, l) ∈ / x∗ , as n → ∞. Proof. By Lemma 4, there is with probability one an integer n2 such that x(n) ˆ = x∗ for all n ≥ n2 . (i) Let (k, l) ∈ x∗ . In round n2 and all subsequent rounds, (k, l) is always reinforced. Set L = L(x∗ ) for abbreviation. A lower bound for τkl (n) is obtained by omitting the rule that τkl (n + 1) is set equal to τmin (n) if it would otherwise decrease below τmin (n) by evaporation. Based on this lower bound estimation, we get by induction w.r.t. t = 1, 2, . . . that τkl (n2 + t) ≥ (1 − ρ)t τkl (n2 ) +
ρ t−1 ∑ (1 − ρ)i. L i=0
(17)
As t → ∞, the expression on the r.h.s. of (17) tends to ρ ∞ 1 ∑ (1 − ρ)i = L . L i=0 Therefore, for sufficiently large t, τkl (n2 + t) > 1/(2L). On the other hand, τmin (n) → 0, hence τmin (n2 +t) < 1/(2L) for sufficiently large t, which means that updates by setting τkl (n + 1) equal to τmin (n) do not happen anymore for large values of t. Thus, for some t0 and integers t ≥ 1, we find in analogy to (17) (but now with equality instead of inequality) that
ρ t −1 τkl (n2 + t0 + t ) = (1 − ρ) τkl (n2 + t0 ) + ∑ (1 − ρ)i, L i=0
t
and the expression on the r.h.s. tends to 1/L as t → ∞. (ii) Let (k, l) ∈ / x∗ . Then (k, l) is never reinforced anymore in round n2 and any subsequent round. Thus the pheromone on (k, l) decreases geometrically until the lower bound τmin is reached. Since τmin → 0 as well, we have τkl (n) → 0 as n → ∞. Proof of Theorem 1. The first two assertions of the theorem, eqs. (4) and (5), are the assertions of Lemma 4 and Lemma 5, respectively. The third assertion, eq. (6), is seen as follows: ¿From (5), we obtain for (k, l) ∈ x∗ and prefix u of x∗ that, with δkr = 1 if k = r and δkr = 0 otherwise, lim pkl (n, u) =
n→∞
1 · ηkl (u) = 1. ∑(k,r) δkr · ηkr (u)
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
23
Therefore also the probability that a fixed ant σ traverses x∗ , which is given by Pσ (n) = tends to unity as n → ∞.
∏
(k,l)∈x∗
pkl (n, u),
Remark 1. In Gutjahr and Pflug [15], a similar convergence result has been shown for a modification of the Simulated Annealing metaheuristic designed for the application to stochastic optimization problems. There, however, a growth of the sample size Nn of order Ω(n2γ ) with γ > 1 was required for obtaining the convergence property. The growth of order Ω(n) required in Theorem 1 is much more favorable. While runtime limits are reached soon when the sample size is increased faster than with quadratic order, a linear increment usually does not impose severe practical restrictions. Remark 2. For the solution of deterministic combinatorial optimization problems by metaheuristic approaches, some convergence results exist. For Simulated Annealing, e.g., it has been demonstrated by Gelfand and Mitter [11] and by Hajek [17] that by applying a suitable “cooling schedule”, one can achieve that the distribution of the current solution tends to the uniform distribution on the set of globally optimal solutions. A related result for two particular ACO algorithms (both of the GBAS-type outlined above) has been obtained in [13]. Nevertheless, it is clear that when applied to NP-hard problems, these algorithms cannot overcome the general limitations demonstrated by NP-complete-ness theory: If an algorithm is designed in such a way that it is guaranteed to find the optimal solution of any (or even: some) NP-hard problem, a price must be paid: runtime will get prohibitive for larger problem instances. For example, the theoretical cooling schedule assumed in the convergence results for Simulated Annealing is too slow to be wellsuited for practical applications; it has to be modified towards faster cooling, which, on the other hand, introduces the risk of premature convergence to suboptimal solutions. (This dilemma has sometimes been formulated under the term of “No-Free-Lunch Theorems”.) For this reason, algorithms with theoretical guarantee of convergence to optimality are sometimes considered as not practicable. It is interesting to see that this restriction needs not to hold for the algorithm S-ACO presented here: Of course, when applied to large instances of problems that are NPhard even in the deterministic boundary case, S-ACO is subject to the same limitations as the deterministic-problem algorithms converging to optimality. Very large problem instances, however, are not typical for stochastic combinatorial problems in current practice. As argued at the end of Section 2, such problems are already nontrivial in the case of small feasible sets, say, with a few hundred elements or even less. For such problem instances, the algorithm S-ACO can be implemented without any modification; also the linear increase of the sample size will not lead to prohibitive runtime behavior.
6 Modifications The algorithm S-ACO can be modified in several different ways. Let us only indicate one possible line of extension:
24
Walter J. Gutjahr
Our procedure S-pheromone-update follows a “global-best” reinforcement strategy (see Gambardella and Dorigo [10]): the arcs on that walk that is considered as the best found up to now (in any of the previous rounds) are reinforced. An alternative strategy is the classical pheromone update of Ant System [7], where the amount of reinforcement is chosen proportional to the “fitness” of the solution, or the rank-based pheromone update, introduced by Bullnheimer, Hartl and Strauss [5]: the arcs on the k best walks found in the current round are reinforced by a pheromone increment proportional to (k − j + 1)/k for the walk with rank j ( j = 1, . . . , k). We shortly outline the rank-based case; the classical case can be treated analogously. In the stochastic context, one cannot determine the absolute ranks of the walks, but, as indicated in the Comments in Section 4, one can evaluate the walks at a random scenario or at a small sample of random scenarios drawn specifically for this round. In this way, ranks relative to the current scenarios(s) can be computed. Now, one can choose between two alternatives: (i) Perform S-ACO in two phases: In the first phase, replace in S-pheromone-update the global-best update rule by rank-based pheromone update w.r.t. the the currently drawn scenario(s). For this kind of update, sampling for getting the estimates F˜ and Fˆ is not required. In the second phase, start with the pheromone values obtained in the first phase, and perform, from now on, in S-pheromone-update the (global-best) update rule described in Section 4. (ii) Instead of working in two phases, perform pheromone update in each round by a weighted mix between the global-best update described in Section 4 and the rankbased update w.r.t. the current scenario(s). It is likely that the convergence result of Section 5 can be generalized to alternative (i) above. A generalization to alternative (ii) is much more difficult; presumably, convergence to the optimal solution can only be obtained if the weight for the application of the rank-based update scheme is gradually reduced. Both alternatives may be advantageous in practice compared with the basic algorithm, since they allow a broad initial exploration of the solution space (the results of this “learning” rounds are stored in the pheromone), which can possibly speed up convergence by guiding the search in later rounds.
7 Conclusion We have presented a general-purpose algorithm S-ACO applicable to all problems of one of the most frequent problem type in stochastic combinatorial optimization, namely expected-value optimization under deterministic constraints, and shown that on specific, rather mild conditions, S-ACO converges with probability one to the globally optimal solution of the given stochastic optimization problem. Since the algorithm can usually be applied without the necessity of tuning parameters from “theoretical” to “practical” schemes and still keeps the property of convergence to optimality, it might be a promising candidate for computational experiments in diverse areas of application of stochastic combinatorial optimization. Of course, experimental comparisons with other metaheuristic algorithms for this problem field, either ACO-based or derived from other concepts, would be very interesting and could be of considerable practical value.
A Converging ACO Algorithm for Stochastic Combinatorial Optimization
25
References 1. Arnold, D.V., “Evolution strategies in noisy environments - a survey of existing work”, Theoretical Aspects of Evolutionary Computing, Kallel, L., Nauts, B., Rogers, A. (eds.), Springer (2001), pp. 239–250. 2. Bakuli, D.L., MacGregor Smith, J., “Resource allocation in state-dependent emergency evacuation networks”, European J. of Op. Res. 89 (1996), pp. 543–555. 3. Bertsimas, D., Simchi-Levi, D., “A new generation of vehicle routing research: robust algorithms, addressing uncertainty”, Operations Research 44 (1996), pp. 286–304. 4. Bianchi, L., Gambardella. L.M., Dorigo, M., “Solving the homogeneous probabilistic travelling salesman problem by the ACO metaheuristic”, Proc. ANTS ’02, 3rd Int. Workshop on Ant Algorithms (2002), pp. 177–187. 5. Bullnheimer, B., Hartl, R. F., Strauss, C., “A new rank–based version of the Ant System: A computational study”, Central European Journal for Operations Research 7 (1) (1999), pp. 25–38. 6. Dorigo, M., Di Caro, G., “The Ant Colony Optimization metaheuristic”, in: New Ideas in Optimization, D. Corne, M. Dorigo, F. Glover (eds.), pp. 11-32, McGraw–Hill (1999). 7. Dorigo, M., Maniezzo, V., Colorni, A., “The Ant System: An autocatalytic optimization process”, Technical Report 91–016, Dept. of Electronics, Politecnico di Milano, Italy (1991). 8. Du, J., Leung, J.Y.T., “Minimizing total tardiness on one machine is NP-hard”, Mathematics of Operations Research 15 (1990), 483–495. 9. Futschik, A., Pflug, Ch., “Confidence sets for discrete stochastic optimization”, Annals of Operations Research 56 (1995), pp. 95–108. 10. Gambardella, L.M., Dorigo, M., “Ant-Q: A Reinforcement Learning approach to the traveling salesman problem”, Proc. of ML-95, Twelfth Intern. Conf. on Machine Learning (1995), pp. 252–260. 11. Gelfand, S.B., Mitter, S.K., “Analysis of Simulated Annealing for Optimization”, Proc. 24th IEEE Conf. on Decision and Control (1985), pp. 779–786. 12. Gutjahr, W.J., “A graph–based Ant System and its convergence”, Future Generation Computer Systems 16 (2000), pp. 873–888. 13. Gutjahr, W.J., “ACO algorithms with guaranteed convergence to the optimal solution”, Information Processing Letters 82 (2002), pp. 145–153. 14. Gutjahr, W.J., “A generalized convergence result for the Graph–based Ant System”, accepted for publication in: Probability in the Engineering and Informational Sciences. 15. Gutjahr, W.J., Pflug, G., “Simulated annealing for noisy cost functions”, J. of Global Optimization, 8 (1996), pp. 1–13. 16. Gutjahr, W.J., Strauss, Ch, Wagner, E., “A stochastic branch-and-bound approach to activity crashing in project management”, INFORMS J. on Computing, 12 (2000), pp. 125-135. 17. Hajek, B., “Cooling schedules for optimal annealing”, Mathematics of OR, 13 (1988), pp. 311–329. 18. Marianov, V., Serra, D., “Probabilistic maximal covering location-allocation models for congested systems”, J. of Regional Science, 38 (1998), 401–424. 19. Norkin, V.I., Ermoliev, Y.M., Ruszczynski, A., “On optimal allocation of indivisibles under uncertainty”, Operations Research 46 (1998), pp. 381–395. 20. St¨utzle, T., Hoos, H.H., “The MAX-MIN Ant system and local search for the travelling salesman problem”, in: T. Baeck, Z. Michalewicz and X. Yao (eds.), Proc. ICEC ’97 (Int. Conf. on Evolutionary Computation) (1997), pp. 309–314. 21. St¨utzle, T., Hoos, H.H., “MAX-MIN Ant System”, Future Generation Computer Systems, 16 (2000), pp. 889–914.
Optimality of Randomized Algorithms for the Intersection Problem J´er´emy Barbay Department of Computer Science, University of British Columbia, 201-2366 Main Mall, Vancouver, B.C. V6T 1Z4 Canada [email protected]
Abstract. The “Intersection of sorted arrays” problem has applications in indexed search engines such as Google. Previous works propose and compare deterministic algorithms for this problem, and offer lower bounds on the randomized complexity in different models (cost model, alternation model). We refine the alternation model into the redundancy model to prove that randomized algorithms perform better than deterministic ones on the intersection problem. We present a randomized and simplified version of a previous algorithm, optimal in this model. Keywords: Randomized algorithm, intersection of sorted arrays.
1 Introduction We consider search engines where queries are composed of several keywords, each one being associated with a sorted array of references to entries in a database. The answer to a conjunctive query is the intersection of the sorted arrays corresponding to each keyword. Most search engines implement these queries. The algorithms are in the comparison model, where comparisons are the only operations permitted on references. The intersection problem has been studied before [1, 4, 5], but the lower bounds apply to randomized algorithms, when some deterministic algorithms are optimal. Does it mean that no randomized algorithms can do better than a deterministic one on the intersection problem? In this paper we present a new analysis of the intersection problem, called the redundancy analysis, more precise and which permits to prove that for the intersection problem, randomized algorithms perform better than deterministic algorithms in term of the number of comparisons. The redundancy analysis also makes more natural assumptions on the instances: the worst case in the alternation analysis is such that an element considered by the algorithm is matched by almost all of the keywords, while in the redundancy analysis the maximum number of keywords matching such an element is parametrized by the measure of difficulty. We define formally the intersection problem and the redundancy model in Section 2. We give in Section 3 a randomized algorithm inspired by the small adaptive algorithm, and give its complexity in the redundancy model, for which we prove it is optimal in Section 4. We answer the question of the utility of randomized algorithm for the intersection problem in Section 5: no deterministic algorithm is optimal in the redundancy model. We list in Section 6 several points on which we will try to extend this work. A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 26–38, 2003. c Springer-Verlag Berlin Heidelberg 2003
Optimality of Randomized Algorithms for the Intersection Problem
27
2 Definitions In the search engines we consider, queries are composed of several keywords, and each keyword is associated to a sorted array of references. The references can be, for instance, addresses of web pages, but the only important thing is that there is a total order on them, i.e. all unequal pair of references can be ordered. To study the problem of intersection, we hence consider any set of arrays on a totally ordered space to form an instance [1]. To perform any complexity analysis on such instances, we need to define a measure representing the size of the instance. We define for this the signature of an instance. Definition 1 (Instance and Signature). We consider U to be a totally ordered space. An instance is composed of k sorted arrays A1 , . . . , Ak of positive sizes n1 , . . . , nk and composed of elements from U . The signature of such an instance is (k, n1 , . . . , nk ). An instance is “of signature at most” (k, n1 , . . . , nk ) if it can be completed by adding arrays and elements to form an instance of signature exactly (k, n1 , . . . , nk ). Example 1. Consider the instance of Figure 1, where the ordered space is the set of positive integers: it has signature (7, 1, 4, 4, 4, 4, 4, 4) A= B= C= D= E= F= G=
9 1 3 9 4 5 8
2 9 14 10 6 10
9 12 15 17 7 19
11 13 16 18 10 20
A: B: 1 2 C: 3 D: E: 4 F: 567 G: 8
9 9 9 9
11 12 13 14 15 16 10 10 10
17 18 19 20
Fig. 1. An instance of the intersection problem: on the left is the array representation of the instance, on the right is a representation which expresses better the structure of the instance, where the abscissa of elements are equal to their value.
Definition 2 (Intersection). The Intersection of an instance is the set A1 ∩ . . . ∩ Ak composed of the elements that are present in k distinct arrays. Example 2. The intersection A ∩ B ∩ . . . ∩ G of the instance of Figure 1 is empty as no element is present in more than 4 arrays. Any algorithm (even non-deterministic) computing the intersection must certify the correctness of the output: first, it must certify that all the elements of the output are indeed elements of the k arrays; second, it must certify that no element of the intersection has been omitted by exhibiting some proof that there can be no other elements in the intersection than those output. We define the partition-certificate as a proof of the intersection.
28
J´er´emy Barbay
Definition 3 (Partition-Certificate). A partition-certificate is a partition (I j ) j≤δ of U into intervals such that any singleton {x} corresponds to an element x of ∩i Ai , and each other interval I has an empty intersection I ∩ Ai with at least one array Ai . Imagine a function which indicates for each element x ∈ U the name of an array not containing x if x is not in the intersection, and “all” if x is in the intersection. The minimal number of times such a function alternates names, for x scanning U in increasing order, is also the minimal size of a partition-certificate of the instance (minus one), which is called alternation. Definition 4 (Alternation). The alternation δ(A1 , . . . , Ak ) of an instance (A1 , . . . , Ak ) is the minimal number of intervals forming a partition-certificate of this instance. Example 3. The alternation of the instance in Figure 1 is δ = 3, as we can see on the right representation that the partition (−∞, 9), [9, 10), [10, +∞) is a partition-certificate of size 3, and that none can be smaller. The alternation measure was used as a measure of the difficulty of the instance [1], as it is the non-deterministic complexity of the instance, and as there is a lower bound increasing with δ on the complexity of any randomized algorithm. By definition of the partition-certificate: – for each singleton {x} of the partition, any algorithm must find the position of x in all arrays Ai , which takes k searches; – for each interval I j of the partition, any algorithm must find an array, or a set of arrays, such that the intersection of I j with this array, or with the intersection of those arrays, is empty. The cost for finding such a set of arrays can vary and depends on the choices performed by the algorithm. In general it requires less searches if there are many possible answers. To take this into account, for each interval I j of the partition-certificate we will count the number r j of arrays whose intersection with I j is empty. The smaller is r j , the harder is the instance: 1/r j measures the contribution of this interval to the difficulty of the instance. Example 4. Consider for instance the interval [10, 11) in the instance of Figure 1: r j = 4. A random algorithm choosing an array uniformly has probability r j /k to find an array which do not intersect [10, 11), and will do so on average before k/r j trials, even if it doesn’t memorize which array it tried before. k being fixed, 1/r j measures the difficulty of proving that no element of [10, 11) is in the intersection of the instance. We name the sum of those contributions the redundancy of the instance, and it forms our new measure of difficulty: Definition 5 (Redundancy). Let A1 , . . . , Ak be k sorted arrays, and let (I j ) j≤δ be a partition-certificate for this instance. – The redundancy ρ(I) of an interval or singleton I is defined as equal to 1 if I is a / otherwise. singleton, and equal to 1/#{i, , AI ∩ I = 0}
Optimality of Randomized Algorithms for the Intersection Problem
29
– The redundancy ρ((I j ) j≤δ ) of a partition-certificate (I j ) j≤δ is the sum ∑ j ρ(I j ) of the redundancies of the intervals composing it. – The redundancy ρ ((Ai )i≤k ) of an instance of the intersection problemis the minimal redundancy of a partition-certificate of the instance, min{ρ (I j ) j≤δ , ∀(I j ) j≤δ }. Note that the redundancy is always well defined and finite: if I is not a singleton then by definition there is at least one array Ai whose intersection with I is empty, and / > 0. #{i, Ai ∩ I = 0} Example 5. The partition-certificate (−∞, 9), [9, 10), [10, 11), [11, +∞) has redundancy at most 12 + 13 + 14 + 12 , and no other partition-certificate has a smaller redundancy, hence our instance has redundancy 76 . The redundancy analysis permits to measure the difficulty of the instance in a finer way than with the alternation: for fixed k, n1 , . . . , nk , δ, several instances of signature (k, n1 , . . . , nk ) and alternation δ may present different difficulties for any algorithm, and different redundancies. Example 6. In the instance from Figure 1 the only way to prove that the intersection of those arrays is empty, is to compute the intersection of one of the arrays from {A, B,C, D} with one of the arrays from {E, F, G}. For simplicity, and without loss of generality, we suppose the algorithm searches to intersect A with another array in {B,C, D, E, F, G}, and we focus for this example on the number of unbounded searches performed, instead of the number of comparisons: the randomized algorithm looking for the element of A in an array from {B,C, D, E, F, G} chosen at random performs on average only 2 searches in the first instance, as the probability to find an array whose intersection is empty with A is then 12 . On the other hand, consider the instance of a subtle variant of the instance of Figure 1, where the element 9 would be present in all the arrays but one, for instance E (only two elements needs to change, F[4] and G[2] which were equal to 10 and are now equal to 9). As the two instances have the same signature and alternation, the alternation analysis yields the same lower bound for both instances. But the randomized algorithm described above performs now on average k/2 searches, as opposed to 2 searches on the original instance. This difference of performance is not expressed by a difference of alternation, but is expressed by a difference of redundancy: the new instance has a redundancy of 12 +1+ 12 = 2 larger than the redundancy 76 of the original instance1 .
3 Randomized Algorithm The algorithm we propose here is a randomized and simplified version of the small adaptive algorithm [5]. It uses the unbounded search algorithm, which looks for an element x in a sorted array A of unknown size, starting at position init, with complexity 2log2 (p−init), where p is the insertion position of the element in the array. It returns a value p such that A[p − 1]<x≤A[p]. This algorithm has already been studied before, 1
This is just a particular case given as an example, see Section 5 for the general proof.
30
J´er´emy Barbay
it can be implemented using the doubling search and binary search algorithms [1, 4–6], or directly to improve the complexity by a constant factor [3]. Given t∈{1, . . . , k}, and k non-empty sorted sets A1 , . . . , Ak of sizes n1 , . . . , nk , the rand intersection algorithm (algorithm 1) computes the intersection I=A1 ∩ . . . ∩Ak . For simplicity, we assume that all arrays contain the element −∞ at position 0 and the element +∞ at position ni + 1. The algorithm is composed of two nested loops. The outer loop iterates through potential elements of the intersection in variable m and in increasing order, and the inner loop checks for each value of m if it is in the intersection. In each pass of the inner loop, the algorithm searches for m in one array As which potentially contains it. The invariant of the inner loop is that, at the start of each pass and for each array Ai , pi denotes the first potential position for m in Ai : Ai [pi − 1] < m. The variables #YES and #NO count how many arrays are known to contain m, and are updated depending on the result of each search. A new value for m is chosen every time we enter the outer loop, at which time the current subproblem is to compute the intersection on the sub-arrays Ai [pi , . . . , ni ] for all values of i. Any first element Ai [pi ] of a sub-array could be a candidate, but a better candidate is one which is larger than the last value of m: the algorithm chooses As [ps ], which is by definition larger than m. Then only one array As is known to contain m, hence #YES ← 1, and no array is known not to contain it, hence #NO ← 0. The algorithm terminates when all the values of the current array have been considered, and m has taken the last value +∞.
Algorithm 1 Rand Intersection (A1 , . . . , Ak ) Given k non-empty sorted sets A1 , . . . , Ak of sizes n1 , . . . , nk , the algorithm computes in variable I the Intersection A1 ∩ . . . ∩ Ak . Note that the only random instruction is the choice of the array in the inner loop. for all i do pi ← 1 end for / s←1 I ← 0; rep´eter m ← As [ps ] #NO ← 0; #YES ← 1; tant que #YES < k and #NO = 0 faire Let As be a random array s.t. As [ps ] = m. ps ← Unbounded Search(m, As , ps ) if Ai [pi ] = m then #NO ← 1 else #YES ← YES + 1 end if fin tant que if #YES = k then I ← I ∪ {m} end if for all i such that Ai [pi ] = m do pi ← pi + 1 end for jusqu’`a ce que m = +∞ return T
Optimality of Randomized Algorithms for the Intersection Problem
31
Theorem 1. Algorithm rand intersection (algorithm 1) performs on average O(ρ ∑ log(ni /ρ)) comparisons on an instance of signature (k, n1 , . . . , nk ) and of redundancy ρ. Proof. Let (I j ) j≤δ be a partition-certificate of minimal redundancy ρ. Each comparison performed by the algorithm is said to be performed in phase j if m ∈ I j for some interval j I j of the partition. Let Ci be the number of binary searches performed by the algorithm j during phase j in array Ai , let Ci = ∑ j Ci be the number of binary searches performed by the algorithm in array Ai over the whole execution, and let (r j ) j≤δ be such that r j is / otherwise. equal to 1 if I is a singleton, and to #{i, Ai ∩ I j = 0} Let’s consider a fixed phase j ∈ {1, . . . , δ}: if the phase is positive (if m ∈ I) then j j Ci = 1 ∀i = 1, . . . , k. Remark that in this case 1/r j is also equal to 1, so that Ci = 1/r j . j If the phase is negative (if m ∈ / I), Ci is a random variable. j
– If Ai ∩ I j = 0/ then Ci ∈ {0, 1} as the algorithm will terminate the phase whenever j / = 1r j , searching in Ai . The probability to do such a search is Pr{Ci = 1|Ai ∩ I j = 0} j
j
/ = 1 ∗ Pr{Ci = 1|Ai ∩ I j = so the average number of searches is E(Ci |Ai ∩ I j = 0) / = 1r . 0} j
j
– If Ai ∩ I j = 0/ then at each new search, either Ci is incremented with probability j 1 k−1 , because the search occurred in Ai ; or Ci is fixed in a final way with probarj bility k−1 , because an array of empty intersection with I was searched; or neither 1+r
incremented nor fixed, with probability 1 − k−1j , in the other cases. This system j 1 and fixed is equivalent to a system where Ci is incremented with probability 1+r j with probability 1+r j rj
−1 =
1 rj
rj 1+r j .
j
From this we can deduce that Ci is incremented on average
times before it is fixed.
So the algorithm performs on average E(Ci ) = ∑ j
1 rj
= ρ binary searches in array Ai .
Let gli, j be the increment of pi due to the lth unbounded search in array Ai during phase j. Notice that ∑ j,l gli, j ≤ ni . The algorithm performs 2 log(gli, j + 1) comparisons during the lth search of phase j in array Ai . So it performs 2 ∑ j,l log(gli,l + 1) comparisons between m and an element of array Ai during the whole execution. Because of the gl
concavity of the function log(x + 1), this is smaller than 2Ci log(∑ j,l Ci, ij + 1), and be cause of the preceding remark ∑ j,l gli, j ≤ni , this is still smaller than 2Ci log( nCii + 1). The functions fi (x)=2x log( nxi +1) are concave for x≤ni , so E( fi (Ci ))≤ fi (E(Ci )). As the average complexity of the algorithm in array Ai is E( f (Ci )), and as E(Ci ) = ρ, on average the algorithm performs less than 2ρ log( nρi + 1) comparisons between m and an element in array Ai . Summing over i we get the final result, which is O(ρ ∑i log nρi ).
32
J´er´emy Barbay
4 Randomized Complexity Lower Bound We prove now that no randomized algorithm can do asymptotically better. The proof is quite similar to the lower bound of the alternation model [1], and differs mostly in lemma 1, which must be adapted to the redundancy and whose lower bound is improved by a constant multiplicative factor. In Lemma 1 we prove a lower bound on average on a distribution of instances of redundancy at most ρ = 4 and of output size at most 1. We use this result in Lemma 2 to define a distribution on instances of redundancy at most ρ ∈ {4, 4n1} by combining p = o(ρ) sub-instances. In Lemma 3 we prove that any instance of signature (k, n1 , . . . , nk ) has redundancy ρ at most 2n1 + 1, so that the result of lemma 2 holds for any ρ ≥ 4. Finally applying the Yao-von Neumann principle [7–9] in Theorem 2 this gives us a lower bound of Ω(ρ ∑ki=2 log(ni /ρ)) on the complexity of any randomized algorithm for the Intersection problem. Lemma 1. For any k ≥ 2, and 0 < n1 ≤ . . . ≤ nk , there is a distribution on instances of the Intersection problem with signature at most (k, n1 , . . . , nk ), and redundancy at most 4, such that any deterministic algorithm performs at least 14 ∑ki=2 log ni + ∑ki=2 2ni1+1 − k+2 comparisons on average. Proof. Let C be the total number of comparisons performed by the algorithm, and for each array Ai note Fi = log2 (2ni + 1), and F = ∑ki=2 Fi . Let’s draw an index w ∈ {2, . . . , k} equal to i with probability FFi , and (k − 1) positions (pi )i∈{2,...,k} such that ∀ i each pi is chosen uniformly at random in {1, . . . , ni }. Let P and N be two instances such that in both P and N, for any 1A1 [1]; and such that the elements at position pi in all other arrays than Aw are equal to A1 [1].
½
Fig. 2. Distribution on (P, N): elements are represented by a dot of abscissa their value, full large dots correspond to the element at position pi in each array Ai .
Optimality of Randomized Algorithms for the Intersection Problem
33
Let x = A1 [1] be the first element of the first array. Note x-comparisons the comparisons between any element and x. Because of the special relative positions of the elements, a comparison between two elements b and d in any arrays doesn’t yield more information than two comparisons between x and b and between x and d: the relative positions to x of elements b and d permit to deduce their order. Hence any algorithm performing C comparisons between arbitrary elements can be expressed as an algorithm performing no more than 2C x-comparisons, and any lower bound L on the complexity of algorithms using only x-comparisons is a L/2 lower bound on the complexity of algorithm using comparisons between arbitrary elements. The redundancy of such instances is no more than 4: the interval (−∞, A1 [1]) is sufficient to certificate that no element smaller than x is in the intersection, and stand for a redundancy of at most 1; the interval (A1 [n1 ], +∞, ) is sufficient to certificate that no element larger than A1 [n1 ] is in the intersection, and stands for a redundancy of at most 1; the interval [A1 [1], A1 [n1 ]] is sufficient in N to complete the partition-certificate, and stands for a redundancy of at most 1; the singleton {x} and the interval (A1 [1], A1 [n1 ]] are sufficient in P to complete the partition-certificate, and stands for a redundancy of 1 . at most 1+ k−1 The only difference between instances P and N is the relative position of element Aw [pw ] to the other elements composing the instance, as described in Figure 2. Any algorithm computing the intersection of P has to find the (k − 1) positions {p2 , . . . , pk }. Any algorithm computing the intersection of N has to find w and the afferent position pw . Any algorithm distinguishing between P and N has to find pw : we will prove that it needs on average almost F2 = 12 ∑ki=2 log2 (2ni + 1) comparisons to do so. Let A be a deterministic algorithm using only x-comparisons to compute the intersection. As the algorithm A has not distinguished between P and N till it found w, let Xi denote the number of x-comparisons performed by A in array Ai for both P or N. Let Yi denote the number of x-comparisons performed by A in array Ai for N; and let ξi be the indicator variable which equals 1 exactly if pi has been determined by A on instance P. The number of comparisons performed by A is C = ∑ki=2 Xi . Restricting ourselves to arrays in which the position pi has been determined, we can write C ≥ ∑ki=2 XI xii = ∑ki=2 Yi ξi . Let’s consider E(Yi ξi ): the expectancy can be decomposed as a sum of probabilities i E(Yi ξi )= ∑h Pr{Yi ξi ≥h}, and in particular E(Yi ξi )≥ ∑Fh=1 Pr{Yi ξi ≥h}. Those terms can be decomposed using the property Pr{a∨b} ≤ Pr{a}+Pr{b}: Pr{Yi ξi ≥ h} = Pr{Yi ≥ h ∧ ξi = 1} = 1 − Pr{Yi < h ∨ ξi = 0} ≥ 1 − Pr{Yi < h} − Pr{ξi = 0} = Pr{ξi = 1} − Pr{Yi < h}
(1)
The probability Pr{Yi < h} is bounded by the usual decision tree lower bound: if we consider the binary x-comparisons performed by algorithm A in set Ai , there are at most 2h leaves at depth less than h. Since the insertion position of x in Ai is uniformly chosen, these leaves are equiprobable and have total probability at most h Pr{Yi < h} ≤ 2n2i +1 = 2h−Fi . Those terms for h ∈ {1, . . . , Fi } form a geometric sequence
34
J´er´emy Barbay
whose sum is equal to 2(1 − 2−Fi ), so E(Yi ξi ) ≥ Fi Pr{ξi = 1} − 2(1 − 2−Fi). Then k
E(C) ≥ ∑ E(Yi ξi ) ≥ i=2
≥
k
k
i=2
i=2
∑ Fi Pr{ξi = 1} − ∑ 2(1 − 2−Fi ) k
k
i=2
i=2
∑ Fi Pr{ξi = 1} + 2 ∑ 2−Fi − 2(k − 2).
(2)
Let’s fix p = (p2 , . . . , pk ). There are only k − 1 possible choices for w. Algorithm A can only differentiate between P and N when it finds w. Let σ denote the order in which these instances are dealt with by A for p fixed. Then ξi = 1 if and only if σi ≤ σw , and so Pr{ξi = 1|p} = ∑ j:σ j ≥σi Fj /F. Summing over p, and then over i, we get an expression of the first term in Equation (2): Fj Pr{p} Pr{ξi = 1} = ∑ Pr{ξi = 1|p} Pr{p} = ∑ ∑ p p j:σ j ≥σi F k
k Fi Fj Fi Fj Pr{p} = ∑ Pr{p} ∑ ∑ . F p i=2 j:σ j ≥σi i=2 j:σ j ≥σi F k
∑ Fi Pr{ξi = 1} = ∑ ∑ ∑ p
i=2
In the sum, each term “Fi Fj ” appears exactly once, and
2
∑ Fi i
hence
= 2 ∑ ∑ Fi Fj − ∑ Fi 2 , i i≤ j
i
2 k 1 k ∑ ∑ Fi Fj = 2 ∑ Fi + ∑ Fi 2 , i=2 j:σ j ≥σi i=2 i=2 k
which is independent of p. Then we can conclude: 2 k k k 11 1 k ∑ Fi Pr{ξi = 1} = 2 F ∑ Fi + ∑ Fi 2 ∑ Pr{p} = 2 ∑ Fi . p i=2 i=2 i=2 i=2 Plugging this into Equation (2), we obtain a lower bound of 12 ∑ki=2 Fi + 2 ∑ki=2 2−Fi − 2(k−2), which is 12 ∑ki=2 log2 (2ni +1) + 2 ∑ki=2 2ni1+1 − 2(k−2) on the average number of x-comparisons E(C) performed by any deterministic algorithm restricted to xcomparisons. This in turn implies a lower bound of 14 ∑ki=2 log2 (2ni +1) + ∑ki=2 2ni1+1 − (k−2) on the average number of comparisons performed by any deterministic algorithm, hence the result. Lemma 2. For any k ≥ 2, 0 < n1 ≤ . . . ≤ nk and ρ ∈ {4, . . . , 4n1 }, there is a distribution on instances of the Intersection problem of signature at most (k, n1 , . . . , nk ), and redundancy at most ρ, such that any deterministic algorithm performs on average Ω(ρ ∑ki=1 log(ni /ρ)) comparisons.
Optimality of Randomized Algorithms for the Intersection Problem
35
Proof. Let’s draw p = ρ/4 pairs (Pj , N j ) j∈{1,...,p} of sub-instances of signature k, n1 /p, . . . , nk /p) from the distribution of lemma 1. As ρ ≤ 4n1 , p ≤ n1 and n1 /p > 0 hence the sizes of all the arrays are positive. Let’s choose uniformly at random each sub-instance I j between the positive sub-instance Pj and the negative subinstance N j , and form a larger instance I by unifying the arrays of same index from each sub-instance, such that the elements from two different sub-instances never interleave, as in Figure 3
½
¾
Æ
Fig. 3. p elementary instances unified to form a single large instance.
This defines a distribution on instances of redundancy at most 4p so less than ρ, and of signature at most (k, n1 , . . . , nk ). Solving this instance implies to solve all the p sub-instances. Lemma 1 gives a lower bound of 14 ∑ki=2 log(2ni /p + 1) + ∑ki=2 2ni1+1 − k+2 comparisons on average for each of the p sub problems, hence a lower bound of p 1 k k k 4 ∑i=2 log(2ni /p + 1) + p(∑i=2 2ni /p+1 − k+2), which is Ω(ρ ∑i=1 log(ni /ρ)). Lemma 3. For any k ≥ 2, 0 < n1 ≤ . . . ≤ nk , any instance of signature (k, n1 , . . . , nk ) has redundancy ρ at most 2n1 + 1. Proof. First observe that there is always a partition-certificate of size 2n1 + 1. Then that the redundancy of any partition-certificate is by definition smaller than the size of the partition. Hence the result. Theorem 2. For any k ≥ 2, 0 < n1 ≤ . . . ≤ nk and ρ ∈ {4, . . ., 4n1 }, the complexity of any randomized algorithm for the Intersection problem on instances of signature at most (k, n1 , . . . , nk ), and redundancy at most ρ is Ω(ρ ∑ki=1 log(ni /ρ)). Proof. This a simple application of lemma 2, lemma 3 and of the Yao-von Neumann principle [7–9]: – lemma 2 gives a distribution for ρ ∈ {4, . . . , 4n1 } on instances of redundancy at most ρ, – and lemma 3 proves that there are no instances of redundancy more than 2n1 + 1, hence the result of lemma 2 holds for any ρ ≥ 4. – Then the Yao-von Neumann principle permits to deduce from this distribution a lower bound on the complexity of randomized algorithms.
36
J´er´emy Barbay
This analysis is much finer than the previous lower bound presented in [1], where the additive term in −k was ignored, although it makes the lower bound trivially negative for large values of the difficulty δ. Here the additive term is suppressed for mini ni ≥ 128, and the multiplicative factor between the lower bound and the upper bound is reduced to 16 instead of 64. This technique can be applied to the alternation analysis of the intersection with the same result. Note also that a multiplicative factor of 2 in the gap comes from the unbounded searches in the algorithm and can be reduced using a more complicated algorithm for the unbounded search [3].
5 Comparisons with the Alternation Model The redundancy model is strictly finer than the alternation model: some algorithms, optimal for the alternation model, are not optimal anymore in the redundancy model (theorem 3), and any algorithm optimal in the redundancy model is optimal in the alternation model (theorem 4). So the rand intersection algorithm is theoretically better than its deterministic variant, and the redundancy model permits a better analysis than the alternation model. Theorem 3. Any deterministic algorithm performs Ω(kρ ∑i log(ni /kρ)) comparisons in the worst case over all instances of signature at most (k, n1 , . . . , nk ) and redundancy at most ρ. Proof. The proof uses the same decomposition than the proof of theorem 2, but uses an adversary argument to obtain a deterministic lower bound. Build δ = kρ 3 sub-instances of signature (k, nδ1 , . . . , nδk ), redundancy at most 3, such that x = A1 [1] is present in half of the other arrays, as in Figure 4. On each sub-instance an adversary can force any deterministic algorithm to perform a search in each of the arrays containing x, and in a single array which does not contain x. Then the deterministic algorithm performs 12 ∑ki=2 log nδi comparisons. In total over all sub-instances, the adversary can force any deterministic algorithm to perform kρ k ni ni ni δ k k 2 ∑i=2 log δ comparisons, i.e. 4 ∑i=2 log kρ , which is Ω(kρ ∑i=2 log kρ ). As x log(n/x) is an increasing function of x, kρ ∑i log(ni /kρ) > ρ ∑i log(ni /ρ) and no deterministic algorithm is optimal in the redundancy model. Theorem 4. Any algorithm optimal in the redundancy model is optimal in the alternation model. Proof. By definition of the redundancy ρ and of the alternation δ of an instance, δk ≤ ρ ≤ δ. So if an algorithm performs O(ρ ∑ log nρi ) comparisons, it also performs O(δ ∑ log nδi ) comparisons. Hence the result as this is the lower bound in the alternation model. This proves also that the measure of difficulty of Demaine, L´opez-Ortiz and Munro [4] is not comparable with the measure of redundancy, as it is not comparable with the measure of alternation [2, Section 2.3]. This means that the two measures are complementary, without being redundant in any way, as it was for the alternation. All those measures describe the difficulty of the instance:
Optimality of Randomized Algorithms for the Intersection Problem ½
¾
¾
¿
37
Fig. 4. Element x is present in half of the arrays of the sub-instance. ½
Fig. 5. The adversary performs several strategies in parallel, one for each sub-instance.
– the alternation describes the number of key blocs of consecutive elements in the instance; – the cost describes the repartition of the size of those blocs; – the redundancy describes the difficulty to find each bloc. But only the cost and the alternation matter, because the alternation analysis is reduced to the redundancy analysis.
6 Perspectives The t-threshold set and opt-threshold set problems [2] are natural generalizations of the intersection problem, which could be useful in indexed search engines. The redundancy seems to be important in the complexity of these problems as well, but we failed to define the proper measure there. Once the proper definition of a certificate and of the difficulty measure is found, the results of this paper should be generalized to the tthreshold set and opt-threshold set problems. Deterministic algorithms for the intersection have been studied on practical data [5]. The performance of the randomized versions of those algorithms, in terms of the number of comparisons and in terms of running time, will be studied. We expect for instance
38
J´er´emy Barbay
the average number of comparisons to decrease, as the randomized algorithm is more independent from the structure of the instance than the deterministic one.
Acknowledgments This work was done at UBC, Vancouver, Canada, during a post-doc internship founded by the “Institut National de Recherche en Informatique et Automatisme” (INRIA) of France, under the mentor-ship of Jo¨el Friedman. The redundancy analysis is a development of joint work with Claire Kenyon on the alternation analysis. The author wishes to thank all these people.
References 1. J´er´emy Barbay and Claire Kenyon. Adaptive intersection and t-threshold problems. In Proceedings of the 13th ACM-SIAM Symposium On Discrete Algorithms (SODA), pages 390–399. ACM-SIAM, ACM, January 2002. 2. J´er´emy Barbay and Claire Kenyon. Randomized lower bound and deterministic algorithms for intersection, t-threshold and opt-threshold. submitted, May 2003. 3. Jon Louis Bentley and Andrew Chi-Chih Yao. An almost optimal algorithm for unbounded searching. Information processing letters, 5(3):82–87, 1976. 4. Erik D. Demaine, Alejandro L´opez-Ortiz, and J. Ian Munro. Adaptive set intersections, unions, and differences. In Proceedings of the 11th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 743–752, 2000. 5. Erik D. Demaine, Alejandro L´opez-Ortiz, and J. Ian Munro. Experiments on adaptive set intersections for text retrieval systems. In Proceedings of the 3rd Workshop on Algorithm Engineering and Experiments, Lecture Notes in Computer Science, pages 5–6, Washington DC, January 2001. 6. Kurt Mehlhorn. Data Structures and Algorithms 1: Sorting and Searching, chapter 4.2 Nearly Optimal Binary Search Tree, pages 184–185. Springer-Verlag, 1984. 7. J. Von Neumann and O. Morgenstern. Theory of games and economic behavior. 1st ed. Princeton University Press, 1944. 8. M. Sion. On general minimax theorems. Pacic Journal of Mathematics, pages 171–176, 1958. 9. A. C. Yao. Probabilistic computations: Toward a unified measure of complexity. In Proc. 18th IEEE Symposium on Foundations of Computer Science (FOCS), pages 222–227, 1977.
Stochastic Algorithms for Gene Expression Analysis Lucila Ohno-Machado1 and Winston Patrick Kuo1,2 1
Decision Systems Group, Division of Health Science and Technology, Harvard-MIT, Cambridge, USA [email protected] 2 Department of Oral Medicine, Infection, and Immunity Harvard School of Dental Medicine, Boston, USA [email protected] http://dsg.harvard.edu
Abstract. Recent advances in the measurement of gene expression have allowed large data sets to become available for different types of analyses. In these data sets, the number of variables exceeds the number of observations by at least one order of magnitude. Substantial variable reduction is usually necessary before learning algorithms can be utilized in practice. Commonly used greedy variable selection strategies preclude the discovery of potentially important variable combinations if the variables in the combination are not sufficiently informative in isolation. Given the high dimensionality, artifacts are frequent and the use of evaluation techniques to prevent model overfitting need to be employed. In this article, we describe the factors that make the analysis of high-throughput gene expression data especially challenging, and indicate why properly evaluated stochastic algorithms can play a particularly important role in this process. Keywords: Gene expression, supervised learning, unsupervised learning, microarrays.
1 Introduction Understanding the mechanisms that determine cell development is essential to the discovery of new therapies. It involves deciphering genomic, transcriptomic, and proteomics data generated by high-throughput experimental technologies and organizing information gathered from traditional biology. Recent technological advances in the biological sciences have opened a whole new domain for the development and application of machine learning algorithms to the study of gene expression profiles and other sources of genomic data (Figure 1). In this domain, usually called functional genomics, the analysis is not based on the differences in gene structure, (which is the object of study in structural genomics), but rather based on how genes are differentially expressed in particular tissues and respective environments. The current technology allows for simultaneous measurement of the expression of several thousands of genes. This leads to the large m, small n problem (where m is the number of variables, and n is the number of observations), or the “curse of dimensionality”. Improper use of machine learning algorithms may yield models that do not generalize to new cases. This is particularly problematic for unsupervised algorithms, where erroneous conclusions may be A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 39–49, 2003. c Springer-Verlag Berlin Heidelberg 2003
40
Lucila Ohno-Machado and Winston Patrick Kuo
Fig. 1. Correlations between genomic research data using supervised and unsupervised learning approaches to facilitate gene discovery for genomic medicine applications.
derived from artifacts in the data and no objective validation is possible. In this article, we briefly review high-throughput gene expression measurement technologies and the most popular methods to analyze the data obtained from microarrays (also known as “gene chips”)[1, 2]. Other high-throughput methods used to measure gene expression include; serial analysis of gene expression (SAGE)[3], massively parallel signature sequencing (MPSS)[4], and array-based comparative genomic hybridization (CGH)[5]. Methods such as Northern blot and Real-Time PCR (RT-PCR) do not measure thousands of genes simultaneously, and therefore the analysis is not challenged by the same problems as high-throughput technologies. These latter methods, however, are often used to confirm the results of the less specific high throughput methods. In this article, we describe the fundamental problems with the supervised and unsupervised learning methods that are commonly applied to these types of data sets.
Stochastic Algorithms for Gene Expression Analysis
41
1.1 Gene Expression Measurements Genes are made of DNA, which is composed of a string of four types of nucleotides: A, C, T, and G. By examining the sequences, researchers hope to implicate regions of DNA (genes) that control normal and disease processes. While proteins are responsible for the activities in the cell, they are ultimately controlled by the instructions contained in the sequences, thru an intermediate step involving messenger RNA (mRNA). DNA is transcribed into mRNA, which is composed of A, C, T, and U nucleotides and is translated into protein from the nucleus of the cell to the cytoplasm, which is called the “central dogma” of molecular biology. Proteins are responsible for the phenotype (i.e., characteristics) of an individual. Therefore, by knowing which mRNAs are present in the cell, we can identify which genes are being expressed, and begin to understand some of the basic elements controlling the structure and function of the cell. Microarrays Microarrays is an emerging technology and it has made it possible to measure several thousands of pre-specified genes or expressed sequence tags (ESTs, which are segments of DNA that have been shown to be transcribed but have not been characterized or annotated) on a glass slide in a single experiment. Currently there are three common types of DNA microarray platforms, cDNA, short and long oligonucleotide arrays. They differ in their probe design, probe reagents, choice of solid support, and their production method. cDNA microarrays consists of long (100-1,000 basepairs) double stranded probes. Inserts from cloned libraries are amplified, purified, and then robotically spotted onto a solid substrate. cDNA arrays have lower specificity due to cross-hybridization (nonspecific binding). PCR products of sequences of nucleotides corresponding to genes or ESTs are spotted on the array, which are obtained from sequence databases. Spotting of ESTs offer the potential for the discovery of new genes and defining their roles in a given environment. The strength of this approach is its flexibility, the ability to change the layout of probes for a specified (focused) experiment to a more comprehensive global approach. In addition to their low cost, the capability to compare two samples simultaneously, such as normal cells compared with cancer cells, offers an enormous advantage for pairwise analysis. Oligonucleotide arrays or Affymetrix GeneChip is synthesized by in situ, currently offers the highest reproducibility and specificity. Probes are short (25 basepairs) single stranded DNA molecules tethered to a solid support, using a modification of semiconductor photolithography technology. GeneChips are designed with 16-20 probes representing each gene on an array. Each oligonucleotide (probe) on the array is matched with an identical one, expect to the 13th position, where there is a single-base mismatch. This mismatch sequence serves as an internal control for non-specific hybridization. This technology allows for applications involving analysis of large amounts of known sequence content. Compared with short in situ synthesized probes, oligonucleotide arrays of longer probe length offer higher signal intensity while still maintaining specificity[6]. These
42
Lucila Ohno-Machado and Winston Patrick Kuo
Fig. 2. Schematic overview of a microarray experiment. RNA is extracted from the tissue of interest. The RNA undergoes a quality control test, either by running a gel or thru an Agilent BioAnalyzer. The RNA is then prepared for hybridization onto a glass slide, washed, scanned, and quantified for data analysis.
arrays generally use oligonucleotides of 30-80 basepairs in length and a robotic deposition printing process. An advantage of this technology is that specific oligonucleotides can be purchased and then printed into custom arrays by an in-house facility. Samples that are collected for microarray experiments are generally prepared as specified in the vendor’s protocol. Eventually each sample is labeled by incorporating a multicolor fluorescence, single-color fluorescence, or an isotope. Fluorescently labeled samples (tissue of interest) containing mRNA are hybridized onto the array. Hybridization is a process of binding two complementary sequences (that is A is complementary to T and a C is complementary to G, and vice versa). The signal from an array that has been hybridized is extracted using a scanner. The image file is then converted to numerical data for further analysis. The level of expression is positively associated with a greater binding of the RNA transcripts to the probe on the array, giving a brighter intensity after hybridization. Due to variation, either biological or technology-based it is recommended that the experiments be conducted in triplicate. Figure 2 illustrates a schematic overview. In order to obtain informational gene lists for positive candidates from microarray studies, while reducing the number of false positives and false negatives, and yet obtain meaningful results requires a well though-out study design. Before expression data from different cDNA or oligonucleotide microarrays can be compared to each other and applying machine learning algorithms (discussed later), the data requires a form of normalization. Normalization attempts to identify the biological information by removing the impact of non-biological influence on the data, and by correcting for systematic bias in expression data. Systematic bias can be caused by differences in labeling efficiencies, scanner malfunction, differences in the initial quantity of mRNA, different concentrations of DNA on the arrays (reporter bias), printing and robotic-tip problems and other microarray batch bias, uneven hybridization, as well as experimenter-related issues. There are several techniques that are widely used to normalize gene-expression data[7]. Every normalization strategy relies on a set of assumptions. It is important to understand your data to know whether these assumptions are appropriate to use on your data set or not. In general, all of the strategies assume that the average gene does not change, either looking at the entire data set or at a user-defined subset of genes. Presently, crossplatform comparisons[8] are hindered by the fact that individual microarray platforms interrogate different sets of genes, by the lack of standardized RNA processing and la-
Stochastic Algorithms for Gene Expression Analysis
43
Fig. 3. Microarray data analysis flowchart. There are major four areas involved in a microarray experiment. (a) experimental design, (b) data capture, (c) data analyses, and (d) biological validations.
beling procedures, appropriate software and annotation tools, and the lack of a common reference RNA that would allow for cross-experimental normalization. After a thorough data analysis, the identification of a list or cluster of genes, that have been associated to a particular event or process under investigation, one has to ask whether all or few of these genes have biological meaning. It is imperative that researchers assess the false positive rate and conduct independent biological validations to confirm those findings. Generally estimates of false positives can be made from statistical analyses of replicated experiments. Techniques that are used to verify expression data will vary depending on the process studied. Real-time PCR, Rnase protection, and Northern Blot can be methods used to confirm the relative gene expression quantitatively. Other methods of validation include in situ hybridization and immunohistochemistry. Figure 3 illustrates a flowchart of the essential steps involved in a microarray experiment.
2 Learning Methods In the early microarray experiments the number of observations was usually very limited. There was no classification task at hand, but rather a need for simple exploration of the data. As researchers started to uncover regular patterns in the data and demonstrate that these patterns could be potentially used to classify the observations into known categories. Current methodologies in functional genomics that use larger RNA expression data sets for clustering can be roughly divided into two categories: supervised learning (analysis to determine ways to accurately split into or predict groups
44
Lucila Ohno-Machado and Winston Patrick Kuo
of samples or diseases), and un-supervised learning (analysis looking for characterization of the components of a data set, without a priori input on cases or genes). The most commonly used unsupervised learning technique was hierarchical agglomerative cluster analysis[9]. Other supervised learning techniques included simple nearest neighbor algorithms[10], statistical regression[11], linear discriminant analysis[12], artificial neural networks[13], classification trees[14], and support vector machines[15]. 2.1 Unsupervised Learning Several clustering algorithms utilized for gene expression data take as input a halfmatrix of pairwise similarities or distances. Hierarchical clustering became a very popular method in this domain, partially because freeware was made available early on, which was pioneered by Michael Eisen and colleagues[9]. K-means and related algorithms such as self-organizing maps were also utilized[16]. Since this type of analysis usually relies on very subjective assessment of the resulting clusters, proxy measures for cluster quality have been suggested, such as the occurrence of common regulatory sequences and common categorization under the same functional category. The popularity of hierarchical clustering has resulted in its inappropriate utilization in some cases. As the cost of experiments has dropped significantly, more recent experiments include a larger number of observations than it was the case just a few years ago. Researchers have started to use results of microarray gene expression to classify cases into known diagnostic or prognostic categories. As they were familiar with hierarchical clustering, this technique was used with the hope that the discovered clusters would correspond to categories of interest. An illustration of hierarchical and k-means clustering is shown in Figure 4, using an oral cancer microarray data set. This was widely used in tumor classification, for instance in leucemia[16], lung cancer[17], colon cancer[18], prostate cancer[19], oral cancer[20], and breast cancer[21]. Though this approach does not take into account the category of interest, and is not guaranteed to result in clusters that match the categories. However, as a consequence, supervised learning techniques were explored. Interestingly, one advocated for the use of unsupervised clustering to determine gene “centers” to be used in supervised learning[22]. Another determined clusters using the classification label as a criterion for entry or exit a particular cluster assignment[23]. Other common applications of clustering algorithms in this domain are the identification of profiles in data collected over a number of time points. In this type of “time series” data, there are usually only a few time points, and generally a few samples. Clustering data collected over time is an area of current investigation. Algorithms that do not assume independence of the data in the different time points are more likely to produce better clusters. As mentioned before, the evaluation of cluster quality is for the most part subjective, so it is difficult to demonstrate superiority of one technique over another. 2.2 Supervised Learning As opposed to other biomedical problems in which the first methods to be considered as those derived from classical statistics, in the domain of gene expression profiles obtained from microarray experiments, the first methods to be used were those originated
Stochastic Algorithms for Gene Expression Analysis
45
(a)
(b) Fig. 4. (a) Hierarchical and (b) K-means clustering results using oral cancer data. Hierarchical clustering algorithm applied to normal and oral cancer samples using average linkage heuristic and Euclidean distance metric. (x-axis are the samples and y-axis are the distances) The algorithm was able to classify the samples into two distinct groups, normal and cancer. K-means algorithm was also applied to normal and oral cancer samples (k=50). (x-axis are the samples and y-axis are the genes) Green and red represents low and high expressed genes, respectively. In this case the cancer samples were highly expressed when compared to the normal samples.
46
Lucila Ohno-Machado and Winston Patrick Kuo
from artificial intelligence or machine learning communities. For example, neural networks and support vector machine applications were published before simple regression models were used. The reason for this may be simple: regression software will simply not run if the number of variables is equal to or larger than the number of observations. This is apparently always the case in microarray data. Most neural networks and support vector machine implementations work in this case, and this may have been the reason why they were used first. Another reason is that some researchers believe that the characterization of the several profiles or “ingerprints” involved in gene expression analysis can only be characterized by several genes.
3 Dimensionality Reduction The purpose of developing models for classification of observations into categories of interest is two-fold: (1) to allow the prediction of class membership when a new observation is presented, and (2) to determine which variables contribute the most for the model and in which way they interact to determine membership. It is well known that certain phenotypes (e.g., diseases or complex traits) are determined by a combination of several genes, rather than one particular gene. These combinations may involve pairs, triples, or multi-gene patterns that occur together or in a particular temporal series. These gene networks are very difficult to characterize, and gene expression data from a microarray experiment that represents a single point in time might have the potential to help researchers discover or verify a hypothesized gene association. This would be particularly true when the microarray technology evolves to eliminate problems with noise and reliability. As it would be unlikely that thousands of genes are involved in defining an observation’s membership to a particular set of categories of interest, it is sensible to use a dimensionality reduction technique. Another reason to do this is merely practical: supervised learning software may not run efficiently on a large number of variables. 3.1 Variable Compression The need for reducing dimensionality in gene expression data has motivated the use of techniques such as principal components analysis and partial least squares to create linear combinations of variables (the so-called components) to serve as inputs to a variety of algorithms. An arbitrary number of components are selected, which account for “most” of the variability in the data. The use of principal components is particularly inappropriate in the context of supervised learning, since the categories of interest are not taken into account when defining the principal components, and therefore the potential component that would maximize the separation of the categories of interest may be left unconsidered. This is solved by utilizing partial least squares, a related technique in which the categories of interest are taken into account when defining the main components. Although it is possible (although cumbersome) to determine which genes have the most influence on the model by looking at the respective coefficients for each of the selected components, this is seldom done. Therefore, there is no advantage in terms of gene discovery or the development of small models (i.e., models that utilize few genes as biomarkers for a particular category
Stochastic Algorithms for Gene Expression Analysis
47
of interest) is achieved by using these variable compression techniques. The only advantage is that the number of inputs can drop arbitrarily, so that several different supervised learning techniques can produce results quickly. For the purposes of interpreting the supervised learning models, that is, extracting useful biological information from the models, it is more important to select the genes that contribute the most for a given classification model. This can be done directly using variable selection techniques. 3.2 Variable Selection Direct variable selection is a good alternative to variable compression in cases where the majority of variables are not believed to be important for the classification model. By selecting a small number of variables, the model becomes easier to interpret, harder to overfit the data and therefore more generalizable to new cases. The microarray analysis literature contains several examples in which variable selection is performed in a univariate manner, usually with utilization of a simple but many times inappropriate t-test. If there is reason to believe that genes can act synergistically or antagonistically to determine a certain phenotype, then the univariate approach is likely to miss important genes that, in isolation, do not account for a large difference in classification. A multivariate approach is therefore recommended. Multivariate approaches such as the ones based on genetic algorithms become prohibitive given the large number of variables, conditional multivariate algorithms such as forward or stepwise variable selection can be applied. It is interesting to note that, for some applications, these greedy algorithms can generate good results. We, however, do not know how much better the results could have been if more complex selection algorithms had been used. Furthermore, given the curse of dimensionality, and the small number of observations, it is difficult to anticipate whether we will be able to demonstrate superiority of complex selection algorithms over simple ones. In small data sets, there may be several variables that can near perfectly predict the test sets when used in isolation. There are also several tuples of variables that can perfectly predict the test set. Reporting on all of those does not make sense. Selecting a particular variable or tuple is possible, but since they all produce perfect classification in the test sets, it is an arbitrary decision. It is rare to see researchers performing a large number of cross-validations and select variables based on their frequency in the resulting models. It is possible to end up with variables that have good association with the classification outcome by chance alone.
4 Stochastic Processes, Stochastic Algorithms We have presented the reasons why we believe variable selection is more meaningful than variable compression in gene expression analysis. We have also indicated that biomedical researchers are often interested in genes that may have synergistic or antagonistic effects, and whose effect in isolation may not be large enough for them to be selected using a univariate or simple hill-climbing algorithms. The search is further complicated by the fact that there are thousands of genes being considered, and no more than a couple hundreds of observations. In the optimization function, local maxima are expected to occur frequently. There is therefore a tremendous need for efficient
48
Lucila Ohno-Machado and Winston Patrick Kuo
stochastic algorithms for variable selection that can cover a large portion of the search space. Routine genetic algorithms that use optimization functions based on classification performance and number of variables can be used for variable selection, but they are often too slow. Gene expression is a stochastic process[24–26]. Reasons for this non-deterministic characteristic are not well understood. Deterministic algorithms may not be appropriate for application in this domain. In fact, recently there have been some applications of stochastic algorithms in gene expression modeling[27, 28]. Gene expression data sets from different microarray platforms are easily available and provide an interesting domain area for the development and validation of novel stochastic algorithms. Particularly of interest are the challenges imposed by the high dimensionality of the data and the paucity of objective validation procedures. Given the latter, although these data sets have been shown to provide useful information for health scientists, it is imperative to keep in mind their limitations in term of allowing for the discovery of regulatory networks and other equally ambitious goals. Although more accessible than clinical data, the analysis and interpretation of results obtained from gene expression data analysis require extensive biological validations usually by other methodologies.
Acknowledgements Funding for this work was provided by the John and Virginia Taplin Faculty Fellowship (LOM).
References 1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 1991;252(5013):1651-6. 2. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 1996;14(13):1675-80. 3. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science 1995;270(5235):484-7. 4. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 2000;18(6):630-4. 5. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998;20(2):207-11. 6. Tolonen AC, Albeanu DF, Corbett JF, Handley H, Henson C, Malik P. Optimized in situ construction of oligomers on an array surface. Nucleic Acids Res 2002;30(20):e107. 7. Quackenbush J. Computational analysis of microarray data. Nat Rev Genet 2001;2(6):41827. 8. Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 2002;18(3):40512.
Stochastic Algorithms for Gene Expression Analysis
49
9. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A 1998;95(25):14863-8. 10. Frederiksen CM, Knudsen S, Laurberg S, TF OR. Classification of Dukes’ B and C colorectal cancers using expression arrays. J Cancer Res Clin Oncol 2003;129(5):263-71. 11. Weber G, Vinterbo S, Ohno-Machado L. Building an asynchronous web-based tool for machine learning classification. Proc AMIA Symp 2002:869-73. 12. Stephanopoulos G, Hwang D, Schmitt WA, Misra J. Mapping physiological states from microarray expression measurements. Bioinformatics 2002;18(8):1054-63. 13. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001;7(6):673-9. 14. Zhang H, Yu CY, Singer B. Cell and tumor classification using gene expression data: construction of forests. Proc Natl Acad Sci U S A 2003;100(7):4168-72. 15. Lee Y, Lee CK. Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 2003;19(9):1132-9. 16. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286(5439):531-7. 17. Wigle DA, Jurisica I, Radulovich N, Pintilie M, Rossant J, Liu N, et al. Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res 2002;62(11):3005-8. 18. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A 1999;96(12):6745-50. 19. Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, et al. Delineation of prognostic biomarkers in prostate cancer. Nature 2001;412(6849):822-6. 20. Kuo WP, Hasina R, Ohno-Machado L, Lingen MW. Classification and identification of genes associated with oral cancer based on gene expression profiles. A preliminary study. N Y State Dent J 2003;69(2):23-6. 21. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, et al. Molecular portraits of human breast tumours. Nature 2000;406(6797):747-52. 22. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 2002;99(10):6567-72. 23. Dettling M, Buhlmann P. Supervised clustering of genes. Genome Biol 2002;3(12):RESEARCH0069. 24. Elowitz MB, Levine AJ, Siggia ED, Swain PS. Stochastic gene expression in a single cell. Science 2002;297(5584):1183-6. 25. Swain PS, Elowitz MB, Siggia ED. Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc Natl Acad Sci U S A 2002;99(20):12795-800. 26. Ozbudak EM, Thattai M, Kurtser I, Grossman AD, van Oudenaarden A. Regulation of noise in the expression of a single gene. Nat Genet 2002;31(1):69-73. 27. Kastner J, Solomon J, Fraser S. Modeling a hox gene network in silico using a stochastic simulation algorithm. Dev Biol 2002;246(1):122-31. 28. Blake WJ, M KA, Cantor CR, Collins JJ. Noise in eukaryotic gene expression. Nature 2003;422(6932):633-7.
Analysis of a Randomized Local Search Algorithm for LDPCC Decoding Problem Osamu Watanabe, Takeshi Sawai, and Hayato Takahashi Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology [email protected]
Abstract. We propose an approach for analyzing the average performance of a given (randomized) local search algorithm for a constraint satisfaction problem. Our approach consists of two approximations. Using a randomized algorithm for LDPCC decoding, we experimentally investigate the reliability of these approximations and show that they could be used as a tool for analyzing the average performance of randomized local search algorithms. Keywords: Local search, LDPCC decoding, constraint satisfaction, randomized algorithms.
1 Introduction There are many problems that can be formulated as a “constraint satisfaction problem”, a problem of searching for a solution that satisfies a given set of constraints. Well known 3SAT is a typical example of such problems. A “local search” is simple yet one of the important algorithmic approaches for solving such constraint satisfaction problems. Here we focus on “randomized” local search algorithms, local search algorithms making some algorithmic decision randomly. For example, for 3SAT Problem, an algorithm given in Figure 1 is one of the typical examples of such algorithms. (This algorithm is one of the WalkSAT family [5]. The idea of this algorithm can be found in [7].) Though very simple, it has been known that this randomized local search algorithm works well. More precisely, when parameters S and t are chosen appropriately, RandSAT succeeds to find a satisfying assignment (if any) on average under a certain probabilistic situation. Unfortunately, though, most of such positive results have been obtained through computer experiments, and existing rigorous mathematical analyses are far from verifying these computer experiments. For example, the following facts have been proved formally; but they are far from justifying the performance of RandSAT given by computer experiments. (See the cited papers for the choice of constants cδ and dδ .) Theorem 1. [8] For any δ < 1, set S = cδ (4/3)n ≈ 21.33n and t = 3n. Then RandSAT gives a sat. assignment with probability δ for all satisfiable formula.
Supported in part by a Grant-in-Aid for Scientific Research on Priority Areas “StaticalMechanical Approach to Probabilistic Information Processing (SMAPIP)” 2002 and Scientific Research (C) 2001-2002 from the Ministry of Education of Japan.
A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 50–60, 2003. c Springer-Verlag Berlin Heidelberg 2003
Analysis of a Randomized Local Search Algorithm for LDPCC Decoding Problem
51
program RandSAT(F = C1 ∧C2 ∧ · · · ∧Cm ); repeat S times X1 , ..., Xn ← a randomly chosen assignment; repeat t steps if F = 1 then output the current assignment and halt; choose one unsat. clause Ck of F randomly, and one variable Xi in Ck randomly; flip the value of Xi ; program end. Fig. 1. Local search algorithm for 3SAT
Theorem 2. 1 [3] Set S = 1 and t = n. For any δ < 1, if a satisfiable formula with m ≥ dδ n2 clauses is given uniformly at random, then RandSAT gives a satisfying assignment with probability δ. It may be difficult to give a rigorous mathematical analysis of the average performance of a randomized local search algorithm. But still there may be some “semi formal” analysis that may not be rigorous but that is still reasonable for investigating the average behavior of the algorithm. Furthermore, through such a semi formal analysis, we may be able to understand when and how the algorithm works, which also leads us to design better algorithms. In this paper we propose2 such a “semi formal” approach. For a given randomized local search algorithm, our approach consists of the following two approximations. I. Approximate the average and randomized execution of the algorithm by a simple random process. II. Approximate this process by simple probabilistic recurrence formulas. For our technical discussion we will consider one constraint satisfaction problems — LDPCC Decoding Problem — and a randomized local search algorithm for this problem. Using the above approach, we analyze the average performance of this algorithm. We give some experimental evidences that these approximations are reasonable ones. Their theoretical analysis is our future open problem.
2 LDPCC(3, 4) Decoding and Our Goal We consider the following problem, which we will call (3, 4)-Low Density Parity Check Code Decoding (in short, LDPCC Decoding). 1
2
Gent [2] pointed out that almost trivial randomized algorithm finds a sat. assingment with high probability when a satisfiable formula with Ω(n log n) clauses is given uniformly at random. Thus, we have a better average-case upper bound. But the theorem is still of interest as a mathematical analysis of a local search algorithm. A similar approach has been proposed recently by R. Monasson, etal [6].
52
Osamu Watanabe, Takeshi Sawai, and Hayato Takahashi
LDPCC(3, 4) Decode — Low Density Parity Check Code Decoding — Input: A set of the following type equations on Boolean variables x1 , ..., xn . xi(1,1) + xi(1,2) + xi(1,3) + xi(1,4) = c1 xi(2,1) + xi(2,2) + xi(2,3) + xi(2,4) = c2 (3) .. . xi(m,1) + xi(m,2) + xi(m,3) + xi(m,4) = cm Where • c1 ∼ cm are 0 or 1 constants, • + is mod 2 addition (i.e., the ex-or operation), • i( j, k) ∈ {1, ..., n} is an index, where i( j, 1) ∼ i( j, k) are all distinct for each j, 1 ≤ j ≤ m, and • each xi appears exactly three times (which implies m = 3n/4). Output: An assignment to x1 , ..., xn satisfying (3) with the smallest number of 1’s. (See also an explanation of the next subsection.) Gallager [1] introduced the notion of LDPCC and proposed a decoding algorithm, which is now called a BP decoding algorithm. (BP = (Loopy) Belief Propagation was proposed and has been studied in the AI community, as a basic algorithm for analyzing a given Basian network.) Recently several researchers (see, e.g., [4]) rediscovered this code and the BP decoding algorithm. Furthermore recent detail analyses of LDPCC family (see, e.g., [4]) showed that some member of LDPCC family seems to have a good performance, as close as the Shannon limit. A randomized local search algorithm that we will study is a variation of randomized decoders that have been studied as a simplified model of the BP decoding algorithm3. 2.1 Basics on Linear Codes and Our Average-Case Senario Besides the requirements on the form of equations of (3), the above defined problem is regarded as a general decoding problem of a linear code. Thus, let us first recall briefly some basic concepts on linear codes. We will call each equation of (3) a parity check equation, or a check equation in short. Let x = (x1 , ..., xn ) and c = (c1 , ..., cm ). Then a system of check equations like (3) is expressed as Hx = c by using an m × n 0,1-matrix H, which is called a parity check matrix. For a parity check matrix H, a vector a = (a1 , ...., an ) satisfying Ha = 0 is called a code word (w.r.t. H), which is supposed to be sent for communication. We consider a binary symmetric channel for our noise model; when a code word is transmitted, some of its bits are flipped independently at random with noise probability p. That is, when sending a cord word a, a message b received through the channel is computed as b = a + v, where v is a 0,1-vector whose each bit takes 1 with probability p. We call v a noise vector. (Here by + we mean the bit wise addition under modulo 2.) 3
It should be mentioned, however, that the BP decoding algorithm is not a randomized algorithm; it is a deterministic algorithm calculating the probability — likelihood – of xi = 1 for each i under the condition that a given syndrome is observed.
Analysis of a Randomized Local Search Algorithm for LDPCC Decoding Problem
53
Let c = (c1 , ..., cm ) be a vector computed from the received message b as c = Hb, which is called a syndrome. Then from the property of the code word a, and the linearity of H, it follows c = Hb = H(a + v) = Ha + Hv = Hv. Thus, for a given c (and also a given H), solving (3) is to obtain a noise vector v, from the syndrome c computed from the received message b. Usually H is not regular and there are more than one solutions. On the other hand, since p is small, say p < 0.2, it is likely that a solution with the smallest number of 1’s is the actual noise vector. This is why the problem asks for a solution with the smallest number of 1’s. However, from the average-case scenario we will assume (see below), the actual requirement for the solution is an assignment that satisfies (3) and that has exactly pn 1’s. (One can prove, for small p, such an assignment is unique, for almost all input instances. Throughout this paper, by pn we mean pn.) Thus, the problem should not be regarded as an NP type optimization problem; rather, it should be regarded as an NP type search problem. A parity check matrix corresponding to the system of equations satisfying the particular form specified the above is very sparse; it has exactly four 1’s in each row and exactly three 1’s in each column. A parity check matrix of this type is called a (3, 4)low density parity check matrix (an LDPC(3, 4) matrix in short), and a code defined using such a sparse parity check matrix is called a (3, 4)-low density parity check code (LDPCC(3, 4) or LDPCC in short). We would like to study the “average performance” (or more generally, “probabilistic behavior”) of a randomized local search algorithm for the above decoding problem. Precisely speaking, when discussing such average performance the following probabilistic situation has been usually considered, see, e.g., [4]. The Average-Case Scenario for LDPCC Problem: An input instance is given as follows. (a) A parity check matrix H is chosen uniformly at random from the set of all LDPC(3, 4) matrices. (b) A noise vector v is chosen uniformly at random from the set of all n bit 0,1-vectors with pn 1’s, and then a syndrome c = (c1 , ..., cm ) is computed as c = Hv. Then the goal is to compute an assignment to x = (x1 , ..., xn ) that has pn 1’s and that satisfies (3) for given H and c.
2.2 Our Randomized Decoder For solving LDPCC Decoding Problem, we consider a randomized local search algorithm given in Figure 2. (In this algorithm the outer loop of the algorithm of Figure 1 is omitted.) Some notions need to be defined for explaining the algorithm. Consider any problem instance, i.e., a pair of a LDPC(3, 4) matrix H and a syndrome vector c = (c1 , ..., cm ). We may assume that the syndrome c is generated as c = Hv from some noise vector v = (v1 , ..., vn ) with pn 1’s. For any variable xi , a check equation on xi is a equation of
54
Osamu Watanabe, Takeshi Sawai, and Hayato Takahashi program RandDECODE(H, (c1 , ..., cm )); % Denote (c1 , ..., cm ) by c. % Denote (x1 , ..., xn ) by x. (x1 , ..., xn ) ← (0,0, ..., 0); repeat t steps if Hx = c then output the current assignment and halt; choose one variable xi at random from those with the highest “penalty”; flip the value of xi ; program end. Fig. 2. Randomized local search algorithm for LDPCC Decoding Problem
(3) having the variable xi . We say that a variable x j is related to xi if it appears in one of the check equations on xi . Note that every variable has three check equations, and each variable has at most nine related variables. (In fact, it is quite likely that most variables has exactly nine related variables.) Consider any point of an execution of the algorithm on H and c. The penalty of a variable xi is the number of check equations on xi that are not satisfied under the current assignment to variables x1 , ..., xn . (Precisely speaking, “penalty” is not for a variable but for the current assignment to the variable. But we simply say, e.g., the penalty of x4 .) For example, suppose that a variable x4 has the following check equations, where their three syndrome bits are computed from an assignment stated as (b) to the corresponding variables. (a)
x4 + x2 + x11 + x15 = 0 x4 + x21 + x23 + x24 = 1 x4 + x30 + x39 + x40 = 0
(b)
0 + 1 + 1 = 0 0+ 1+0+0 = 1 0+0+0 = 0
Algorithm’s goal is to compute the assignment given as (b). For example, since all variables are initially assigned 0, the penalty of x4 (under the initial assignment) is 1. On the other hand, if only x4 and x21 are assigned 1 and the other eight related variables are assigned 0, then x4 has penalty 3. Note the following simple facts: (i) penalty is an integer between 0 and 3, (ii) the situation that a variable xi has penalty 0 means that all three check equations on xi are satisfied (i.e., consistent with the given syndrome), but (iii) it does not necessary mean that xi and its all related variables are assigned correctly. The algorithm terminates if and only if all variable have penalty 0. Although it does not necessary mean that the correct noise vector v (used to produce the given syndrome c) is found, it is quite unlikely that the algorithm terminates by finding some other noise when small t (e.g., t = 2pn) is used. (This property is provable for, e.g., t = 2n from the fact that for almost all H, there is no pair of vectors v and v both having at most 2pn 1’s that give the same syndrome w.r.t. H.) We conducted computer experiments under our average-case scenario; these experiments show that the algorithm works quite well when p is small, say p ≤ 0.08. Figure 3 (a) is for the relation between noise probability p and “success probability”, the probability that the algorithm succeeds to find a correct noise vector when it is executed on a random parity check matrix and a syndrome computed from a random noise vector with pn 1’s for various p. We used the step bound t = 3pn in this experiment. It shows that the algorithm succeeds with high probability when p < 0.08 and fails with high
Analysis of a Randomized Local Search Algorithm for LDPCC Decoding Problem 1
1
t= 2pn t= 4pn t= 6pn t= 8pn t=10pn
n=20000 n=10000 n= 5000
p=0.08x
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0.06
55
0.065
0.07
0.075
0.08
0.085
0.09
0.095
0.1
(a) p vs. success prob. (t = 2pn)
0.105
0.11
0 0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
(b) p vs. success prob. (n = 5000)
Fig. 3. Average-case performance of RandDECODE
probability when p > 0.09. Figure 3 (b) indicates that the success probability is improved by using larger step bound t. (Three lines in Figure 3 (a) are for size n = 5000, 10000, 20000, where the range of noise probability is p = 0.07 ∼ p = 0.11. Four lines in Figure 3 (b) are for t = 2pn, 4pn, 6pn, and 10pn, where n is fixed to 5000. For each size and each noise probability, the algorithm is executed 20 times for randomly chosen matrices and noise vectors.) As shown in these graphs, the success probability drops sharply at some noise probability, and by using a larger step bound, we can postpone this drop though a limit seems to exist. Let us call this dropping point a “success threshold”; more precisely, a success threshold is the noise probability such that the success probability becomes 0.5. Here for a technical goal of our analysis, we try to estimate this success threshold for n = 5000 and various step bounds t. For our analysis, we modify the algorithm RandDECODE slightly. We introduce the notion of “weight”; each variable is assigned weight, which is simply W g−1 , where g is the penalty of the variable. W is some number determined by n. (A variable with 0 penalty is assigned 0 weight.) Then instead of choosing a flipped variable among those with the highest penalty, it is chosen from all variables according to their weights. With large enough W , e.g., W ≈ n, we can naturally expect that a variable with the highest penalty is usually flipped; hence, this modification does not change the performance of the algorithm. In our following experiments, we will use W = 1000 for n = 5000, and we will analyze the algorithm under this modification.
3 Analysis First let us see what the algorithm does at each step. Suppose at some step of the execution, a variable x4 is chosen as a flipped variable. The state of x4 is specified by its current value and the values of nine related variables, i.e., the variables appearing three check equations on x4 . Here instead of using 0 or 1 values, we use symbols and × indicating respectively the current value is correct and wrong w.r.t. the target solution. We also add penalty information; that is, the state of each variable is specified by a pair
◦
56
Osamu Watanabe, Takeshi Sawai, and Hayato Takahashi
of its penalty and correctness. Suppose, for example, that the states of x4 and its related variables are changed as follows by flipping the value of x4 . x x x x15 x x15 (1, 2 ) (2,11 (0, 2 ) (1,11 ×) (3, ×) ×) (2, ×) flip x4 x4 x4 (2, ×) (1, ) (1, ) (2, ×) (1, ) (2, ) (2, ) (3, ×) =⇒ (2, ) (1, ) (1, ) (1, ) (0, ) (0, )
◦ ◦ ◦
◦ ◦
◦
◦
◦ ◦ ◦
◦ ◦
◦
Like the above, a tuple of 1 + 9 penalty-correctness pairs is called a variable configuration. By this flip, the variable configuration of x4 is changed. Note that the variable configuration of every related variable is also changed by the flip. That is, one step of the algorithm is to change the variable configurations of a flipped variable and its related variables. Note here that for a flipped variable, new penalty after the flip is determined by its current penalty. For example, as in the above example, if the penalty of a flipped variable is 2, then it becomes 1 after the flip. Similarly, penalty 3 becomes 0, and penalty 1 becomes 2. On the other hand, the penalty of each related variable is changed depending on the status of the check equation on the flipped variable that it appears. The penalty of a related variable gets decremented (resp., incremented) if the variable appears in an unsatisfied (resp., satisfied) equation. For example, the penalty of x2 is changed from 2 to 1 because it appears in an unsatisfied equation, i.e., an equation that is not consistent with the given syndrome. On the other hand, since the second equation is satisfied, the penalty of every variable in the second equation gets incremented by the flip. In our experiments, we focus on the number of variables of each variable configuration type. That is, we trace how those numbers are changed during the execution. But in the following explanation we consider, for the sake of simplicity, a tuple + + + − − − − + − (n+ 0 , n1 , n2 , n3 , n0 , n1 , n2 , n3 ) of eight numbers, where each ng (resp., ng ) denotes the number of variables with penalty g that is correctly (resp., wrongly) assigned. We will call this tuple a numerical configuration. From this view point, each step of the algorithm is to update this numerical configu+ ration. For example, by flipping x4 as above, n+ 2 is decremented, and n1 is incremented. Besides the penalty of every related variable is also changed. For example, due to the + penalty change of x2 , n+ 1 gets decremented and n0 gets incremented. Similarly, the numerical configuration must be updated based on the penalty changes for the other eight variables.
3.1 Approximation I Suppose that we can calculate expected numerical configuration at a given step within reasonable accuracy. Then we have enough information on the average behavior of the algorithm. For example, by computing the step when n+ 0 reaches to n, the total number of variables, we can estimate the average number of steps until the algorithm terminates. This is our ultimate goal. Towards this goal, in this paper, we propose two step approximations. The first approximation is to simulate the execution by a simple Markov type random process. Note that, at each step, the penalty and the correctness of a flipped variable
Analysis of a Randomized Local Search Algorithm for LDPCC Decoding Problem
57
is randomly determined; more specifically, a correctly (resp., wrongly) assigned varig−1 /W able with penalty g is chosen as a flipped variable with probability Pg+ = n+ total gW + − + − + − − g−1 2 (resp., Pg = ng W /Wtotal ), where Wtotal = (n1 + n1 ) + (n2 + n2 )W + (n3 + n− )W . 3 On the other hand, the penalty and the correctness of related variables depend on the structure of the check equations on the flipped variable, i.e., sets of variables appearing in these equations. These sets are determined by H that is randomly chosen but fixed prior to the execution. Here for the first approximation, we assume that these related variables are chosen randomly at each step from all variables with a “legitimate” state. In other words, we select, in our approximation, penalty-correctness pairs for the related variables randomly at each step. Suppose, for example, some wrongly assigned variable with penalty 2 is chosen as a flipped variable. Since it’s penalty is 2, two check equations are not satisfied (i.e., inconsistent with the syndrome), and one check equation is satisfied (i.e., consistent with the syndrome). That is, the status of the flipped variable and its check equations is as follows (see the left). 1 , 2 , 3 = (1, ), (0, ), (0, ) 1 2 3 ← unsat. (2, ×) 4 5 6 ← unsat. ⇐ 4 , 5 , 6 = (2, ×), (1, ), (2, ×) 7 8 9 ← sat. 7 , 8 , 9 = (0, ), (1, ×), (1, )
◦ ◦
◦ ◦
◦ ◦
Then we choose a state (i.e., a penalty-correctness pair) for each related variable. Consider, for example, the first equation. Since the flipped variable is wrongly assigned and the first equation is unsatisfied, “legitimate” correct/wrong states for three , × ×, × ×, ..., etc. Hence we choose, e.g., variables in the equation are +++ + + |unsat = n+ (1, ), (1, ), (2, ) with probability P1,1,2 1 · n1 · n2 /Nunsat , where Nunsat is the total number of legitimate three penalty-correctness pairs for an unsatisfiable equation, which is calculated as follows.
◦
◦
◦
◦◦◦ ◦
◦
Nunsat = n+ n+ n+ + n+n− n− + n−n+ n− + · · · , − − − − + + + − where n+ = n+ 0 + n1 + n2 + n3 and n = n0 + n1 + n2 + n3 . Once the states of nine related variables are chosen, we can update the numerical configuration according to the rule explained above. In summary, we simulate the execution of the algorithm by repeating the following − simple random process for t times (or until n0 = n+ 0 + n0 reaches n). (1) Choose one penalty-correctness pair (g, s) for a flipped var. with prob. Pgs . (2) For each check equation with status state ∈ {sat., unsat.} choose three penaltys s s3 correctness pairs (g1 , s1 ), (g2 , s2 ), (g3 , s3 ) with prob. Pg11 ,g22 ,g 3 |state . (3) Then update the numerical configuration as the algorithm does at each step. The important point here is that this process of updating a numerical configuration + − − (n+ 0 , ..., n3 , n0 , ..., n3 ) is completely determined the current numerical configuration. That is, this is a simple Markov process. Our computer experiments show that this simple random process can simulate the actual execution well (to a certain extent, more precisely speaking). For example, Fig− ure 5 shows how the number of penalty 0 variables (i.e., n0 = n+ 0 + n0 ) grows during the execution. The simulation matches to the real execution quite well for p = 0.08. On the other hand, for p = 0.1, some difference can be seen towards the end, though the
58
Osamu Watanabe, Takeshi Sawai, and Hayato Takahashi
5000
5000
4500
4500
4000 4000 3500 3500 3000 3000 2500
2500
2000
2000
1500 0
100
200
300
400
500
600
700
800
0
100
200
300
400
500
600
700
800
900
1000
(a) p = 0.08 (b) p = 0.1 Bars indicate the max and min number at every 25 step of the real execution. A solid line is for the real execution, and a dashed line is for the simulation. Fig. 4. Real execution vs. simulation by a simple random process
approximation is still reasonable as a whole. (The experiments are conducted by using one fixed pair of a parity check matrix and a noise vector for size n = 5000. The initial numerical configuration is computed from this actual input4 ; that is, it is a real one. The results are the average of 20 executions (for the algorithm) and 10 executions (for the simulation). These averages are taken until the fastest execution terminates, which is for keeping enough number of trials. All experiments that will be shown below follow this style.) For all the other numbers showing some aspects of the state, our experiments show that the simulation can reasonably approximate the real execution; in particular, for the case where the execution succeeds to decode. Thus, we conjecture that our first step approximation captures the average behavior of the algorithm at least for the case where the algorithm works on average. On the other hand, we think that by investigating why the simulation diverges from the real execution, we may be able to find out some reasoning why it fails to reach to the solution. For such investigation, we believe our next approximation would be a useful tool.
3.2 Approximation II Though simple, our simulation of the execution is still a random process on a huge number of states. Here we propose one more approximation for analyzing this random process analytically or deterministically. The idea is simple. Instead of randomly choos− ing ten penalty-correctness pairs and updating a numerical configuration (n+ 0 , ..., n3 ), we simply update the numbers in the configuration according to the probabilities that it gets incremented/decremented by the flip. That is, we define a recurrence formula for updating a numerical configuration. 4
It is possible to estimate the initial numerical configuration analytically from p; but for simplicity and comparison, we used a real value.
Analysis of a Randomized Local Search Algorithm for LDPCC Decoding Problem 8000
1
2pn 4pn 6pn 8pn 10pn
7000
59
t = 2pn t = 4pn t = 6pn t = 8pn t = 10pn
0.8
6000
5000
0.6
4000 0.4
3000
2000 0.2
1000
0 0.08
0.085
0.09
0.095
0.1
0.105
(a) Estimation of the average number of steps
0 0.06
0.07
0.08
0.09
0.1
0.11
0.12
(b) Fig. 3 (b) with predicted success thresholds
Fig. 5. Prediction of success thresholds
Although it is not so difficult to define such a reccurence formula, we omit stating it here5 because it becomes very long (even for our simpler numerical configuration). Instead we explain the idea by considering some example cases. Our random process chooses states (i.e., penalty-correctness pairs) for a flipped variable and its related variables, and then update a numerical configuration. For example, with probability P2− , a pair (2, ×) is chosen for a flipped variable. If (2, ×) were chosen, then n− 2 would get decreased by 1, and n+ would get increased by 1 by the flip. Here we simply decrease n− 1 2 − − + by P2 and increase n1 by P2 . Similarly, the probability that (1, ), (1, ), (2, ) are chosen as penalty-correctness pairs for three related variables in one unsat. check equa+++ |unsat , tion on the flipped variable (when (2, ×) is chosen for the flipped variable) is P1,1,2 + + and if they were chosen, then n1 would get decreased by 2, and n2 would get decreased + by 1, whereas n+ 0 would get increased by 2, and n1 would get increased by 1. (As a to+ + tal, both n1 and n2 would get decreased by 1, and n+ 0 would get increased by 2.) Hence, + − +++ + − +++ for this effect, we decrease n+ and n by P P 1 2 2 1,1,2 and increase n0 by 2P2 P1,1,2 . This is the idea of defining our updating formula. Our computer experiments show that various numbers computed by a recurrence formula based on this idea matches to simulation results. Such a nice fitting can be observed even for the case that noise probability is large and there are some differences between our simulations and actual executions. In fact, for our case, we can rigorously prove that the recurrence formula for, e.g., n+ 0 , indeed computes its expectation; and we believe that some strong concentration can be also proved rigolously, which is our future work. Also we would like to use this approximation for investigating the relation between the real execution and the simulation.
◦
◦
◦
3.3 Prediction of Success Thresholds For illustrating that our two step approximations are reasonable, we use the recurrence formula for a numerical configuration and estimate a success threshold for several step bounds t = 2pn, 4pn, 6pn, 8pn, and 10pn. 5
The reccurence formula is given as a C program that can be found in our web page [9].
60
Osamu Watanabe, Takeshi Sawai, and Hayato Takahashi
Suppose that our approximations are reasonably accurate. Then we have a recurrence formula, by which we can compute an average numerical configuration + + + − − − − (n+ 0 , n1 , n2 , n3 , n0 , n1 , n2 , n3 ) at each step. Thus, for a given p, we can compute the − step when n0 (= n+ 0 + n0 ) reaches to n, i.e., the step that the execution terminates; we can regard this number as the average number of steps needed for computing a noise vector with pn 1’s. Figure 4 (a) shows how this number grows when noise probability p increases. Five linear lines indicate a step bound given by t = 2pn, 4pn, 6pn, 8pn, and 10pn. If the average number of steps exceeds this step bound, then we may consider that the algorithm fails to terminate within the step bound. That is, the intersection between each step bound line and the average step graph gives us an estimate for a success threshold for the corresponding step bound. In Figure 4 (b), we put these estimated success thresholds into the graph of Figure 3 (b). This illustrates that our prediction of success thresholds is reasonable. (It would be perfect if all black dots were on the success probability 0.5 line.)
References 1. R.G. Gallager, Low density parity check codes, IRE Trans. Inform. Theory, IT-8(21), 21–28, 1962. 2. I. Gent, On the stupid algorithm for satisfiability, Report APES-03-1998, http://www.cs.strath.ac.uk/~apes/reports/apes-03-1998.ps.gz, 1998. 3. E. Koutsoupias and C.H. Papadimitriou, On the greedy algorithm for satisfiability, Information Processing Letters, 43(1), 53–55, 1992. 4. D. MacKay, Good error-correcting codes based on very sparse matrices, IEEE Trans. Inform. Theory, IT-45(2), 399–431, 1999. 5. D. McAllester, B. Selman, and H. Kautz, Evidence for invariants in local search, in Proc. AAAI’97, MIT Press, 321–326, 1997. 6. R. Monasson, personal communication. 7. C.H. Papadimitriou, On selecting a satisfying truth assignment, in Proc. 32nd IEEE Sympos. on Foundations of Computing, IEEE, 566–574, 1997. 8. U. Sch¨oning, A probabilistic algorithm for k-SAT and constraint satisfaction problems, in Proc. of the 40th Ann. IEEE Sympos. on Foundations of Comp. Sci. (FOCS’99), IEEE, 410– 414, 1999. 9. O. Watanabe, http://www.is.titech.ac.jp/~smapip/.
Testing a Simulated Annealing Algorithm in a Classification Problem Karsten Luebke and Claus Weihs University of Dortmund, Department of Statistics, 44221 Dortmund, Germany [email protected] http://www.statistik.uni-dortmund.de
Abstract. In this work we develop a new classification algorithm based on simulated annealing. The new method is evaluated and tested in a variety of situations which are generated and simulated by a Design of Experiments. This way, it is possible to find data characteristics that influence the relative classification performance of different classification methods. It turns out that the new method improves the classification performance of the classical Linear Discriminant Analysis (LDA) significantly in some situations. Moreover, in a real life example the new algorithm appears to be better than LDA. Keywords: Simulated annealing, classification, desgin of experiments, latent factors.
1 Introduction Classification (or supervised learning) is a ubiquitous challenge which is tackled by many different methods like for example Linear Discriminant Analysis (LDA), Support Vector Machines (SVM) or decision tree classifier. Surprisingly, the simple classification method of LDA does perform well even in situations, where the underlying premises are not met. Our new method is closely related to LDA (so that we keep the good classification performance) but does not make use of the premises. It projects the original observation first on some latent factors and then transforms these latent factors in order to discriminize the classes. We call the new method Classification Pursuit Projection (ClPP). It is a computer intensive method as it is using a stochastic algorithm, namely Simulated Annealing as a function minimizing algorithm. In order to compare the performance of the new method to LDA an experimental design is used. To achieve most general results 7 characteristics of a classification problem are varied in the design. The paper is organized as follows: In the next section we introduce the underlying scoring function of the classification problem which is minimized by simulated annealing. In section 3 the implementation of the Simulated Annealing Algorithm is described. The characteristics of the classification problem which are varied are introduced in section 4. The results of the simulations are reported in section 5. The new method was also tested on a real world example in section 6. After that a brief outlook on future work will be given as well as some concluding remarks. A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 61–70, 2003. c Springer-Verlag Berlin Heidelberg 2003
62
Karsten Luebke and Claus Weihs
2 Optimal Scoring with Latent Factors Linear Discriminant Analysis is a statistical method for classification. In LDA the classification is based on the calculation of the posteriori probabilities of a trial point. The class with the highest posteriori probability is chosen. To calculate the posteriori probability it is assumed that the data comes from a multivariate normal distribution where the classes share a common covariance matrix but have different mean vectors. Hastie et.al. show in [4] that LDA is equivalent to canonical correlation analysis and optimal scoring. This is possible as LDA can be seen as a special linear regression. One of the problems in LDA is that the estimated covariance matrix of the data points has to be inverted for the classification. Especially in a high dimensional problem this can cause numerical problems. In the paper of Hastie et.al. they try to overcome possible numerical problems by using a penalysing term. This may not be optimal as the covariance matrix is transformed away from singularity without using information in the data. In the new method the data is first projected on (few) latent factors so that the original covariance matrix need not to be inverted and so there are no problems with singular matrices. Assume that there are n observations with p variables in the predictor space and k classes. Let – X ∈ IRn×p : Predictor variables. – Y ∈ IRn×k : Indicator matrix of the classes. The basic idea is as follows: Assign l ≤ k − 1 scores to the classes and regress these scores on X. We are looking for scores (of the k classes) and a suitable regression of these scores on the predictor variables so that the residuals are small for the true class and large for the wrong. The average squared residual function is (see Hastie et.al. ([5], p. 392): 1 ASR(H, M) = ||Y H − XM||2 , n
(1)
where – H ∈ IRk×l is the score matrix of the classes, and – M ∈ IR p×l is the regression parameter matrix. To avoid trivial solutions the constraint H (Y Y /n)H = Il
(2)
is used. To tackle the numerical problems in calculating M so called latent factors are derived. Latent factors are linear combinations of the original predictor variables Z = XG with G ∈ IR p×r , r < p. These latent factors must fulfill the side condition that they are orthonormal, i.e. Z Z = (XG) (XG) = Ir .
(3)
Testing a Simulated Annealing Algorithm in a Classification Problem
63
In the regression context latent factors (or Reduced Rank Regression) turned out to be quite an improvement over the ordinary least squares regression (see for example [2]). Latent factors can also be used to interpret the data. Sometimes they are calculated because of some model assumptions, e.g. that the class (or the response variable) depends on some underlying latent factors. With latent factors Z instead of the original X the ordinary least squares estimator of the regression coefficient M˜ is: Mˆ˜ = (Z Z)−1 Z Y H = Z Y H.
(4)
Note that we made use of the side-condition (3). The ASR of latent factors is: 1 ASR(H, Z) = ||Y H − ZZ Y H||2 , n
(5)
subject to (2) and (3). With Z = XG equation (5) is: 1 ||Y H − XG(XG)Y H||2 n 1 = ||(In − XG(XG))Y H||2 , n
ASR(H, G) =
(6) (7)
subject to (2) and (3). As in general ||AB|| = ||A|| ||B|| it is necessary in minimizing (6) to optimize G and H together. After the calculation of G and H the classification can then take place in the linear map of the data X: η(x) = XM,
η(X) ∈ Rn×l
(8)
Let η¯ k be the mean of the linear map of observations from class k. Then the assigning of observations is done by l
kˆ = argmin ∑ wi (η(x)i − η¯ ki ),
(9)
i=1
where ηi is the i-th column of η and wi is the weight corresponding to the i-th dimension of the linear map space. If different a-priori probabilities of the classes are given, equation (9) is adapted, for example by subtracting −2logπk with πk as the a-priori class probability. Hastie et.al. show in [4] that if the weight is calculated as wi =
1 ri2 (1 − ri2 )
(10)
with ri2 being the mean squared residual of the i-th optimally scored fit, then (9) is proportional to the Mahalanobis distance in the original feature space X. As this equivalence is based on the way they calculate the scoring and regression matrix it may not
64
Karsten Luebke and Claus Weihs
be usable here. Another problem is that the weight is symmetric to 12 . That means, that if the squared residual in dimension a is very good, for example ri2a = 0.95, it gets the same weight as a dimension in which the prediction is very bad ri2b = 0.05. To avoid these difficulties we tried three different strategies: – Estimate the weights stepwise by a line search so that the training data is optimally separated, i.e. that the apparent misclassification rate in the training data is minimal. The problem is, that the misclassification rate is constant on intervals of weights – especially when there are only few observations. – Use wi = r12 . By this good predictions get a high weight and low weight is assigned i
to dimensions where the predictions are bad. – Let k
wi = n( ∑
∑
g=1 x j ∈classg
(η(x j )i − η¯ gi )2 )−1 .
(11)
Here the weight is reciprocal to the squared sum of distances of the observations from the mean of the class (in the projected space). In our studies it turned out that (11) is best in most situations. So in the comparison of the classification ability we used it in the allocation rule.
3 Optimizing the ASR in the Latent Factor Model As described in the previous section the cost function ASR (6) can be described as a function of G and H, with G fulfilling the side-condition (3) and H under the side condition (2). A simulated annealing (SA) algorithm is applied to the vectorized projection and scoring matrices (vec(G) , vec(H) ) to minimize the ASR (6). Simulated Annealing was already applied successfully to a Reduced Rank Regression [14] and to a classification problem [11]. To fulfill the side-conditions, new trial points are adapted to the requirements of the side condition. This is done by a QR decomposition (see for example [3], p.66) of the appropriate matrix multiplication. (In a QR decomposition a matrix A is decomposed into A = QR with Q being orthonormal and R a triangle matrix.) As with the QR decomposition the image space is unchanged and just a orthonormal basis (Q) is constructed, it is a suitable tool for the given problem. Thus a new trial point H˜ (generated by the transition function in the SA algorithm) is updated to a trial point H by 1
(Y Y /n) 2 H˜ = QY RY ˜ Y−1 , H = HR
(12) (13)
with QY orthonormal and RY being an upper triangular matrix. Similarly, X G˜ = QX RX ˜ −1 . G = GR X
(14) (15)
Testing a Simulated Annealing Algorithm in a Classification Problem
65
These matrices G, H obviously fulfill the side-conditions (2), and (3). The implementation of the SA algorithm is based on the one described in [10]. So the algorithm is a stochastic version of the well-known Nelder-Mead Simplex method. The Nelder-Mead (or downhill simplex) method is transforming a simplex of m + 1 points in an m dimensional problem. The functional values are calculated and the worst point is reflected through the opposite face of the simplex. If this trial point is best the new simplex is expanded further out. If the function value is worse than the second highest point the simplex is contracted. If no improvement at all is found the simplex is shrunk towards the best point. This procedure terminates when the differences in the function values between the best and worst points are small. The implemented simulated annealing algorithm can be summarized as follows: 1. Build a random start simplex on (vec(G) , vec(H) ) . Set the “start temperature” t0 . 2. Add to the function values f = ASR (6) of the points in the simplex a random number so that ftemp = f + t|log(u)|, where u is uniformly distributed over (0, 1). 3. According to the Nelder-Mead transition function generate a trial point using the temporary function values ftemp . 4. Adapt the trial point to the side conditions by a QR decomposition. 5. Accept the new trial point according to Nelder-Mead with the function value ftemp (trial) = f (trial) − t|log(u)| of the trial point. So a better trial point is always accepted and a worse trial point is accepted with a certain probability. 6. Repeat step 2-5 sufficiently often. Reduce the temperature according to the cooling scheme. 7. Repeat step 2-6 sufficiently often. For the QR decomposition and the matrix operations the C++ math library newmat10 [1] was used. The data is generated and the comparison with LDA is done with the statistical software R [8]. The parameters of the SA are set as follows: – Random start simplex. – Start temperature: t0 = k−1 2 . This is a reasonable choice as the ASR can be compared with a multidimensional R2 which in a regression context can take values between 0 and 1. – Linear cooling scheme: t = 0.8t. – max 200 iterations at each temperature. – max 100 iterations of the temperature.
4 Experimental Design of the Simulation Study Let us now construct simulation data set. We start with the basic relationship between the predictor variables X and Y represented by the classes C1, C2, C3. All predictor variables have the same variance 1, but they differ in their expectation as well as kurtosis and skewness. Before the transformations described in the folowing they are drawn from a normal distribution. Table 1 show the expectations of each predictor variable in the classes (±1.64 is the upper/ lower 5% quartile of the standard normal distribution):
66
Karsten Luebke and Claus Weihs Table 1. Expectation of Predictors in Classes Var C1 C2 C3 Var C1 C2 C3 X1 0 -1.64 1.64 X7 0 -1 1 0 -1.64 X8 1 0 -1 X2 1.64 0 X9 -1 1 0 X3 -1.64 1.64 0 1.64 1.64 X10 0 0 0 X4 0 1.64 X11 0 0 0 X5 1.64 0 X12 0 0 0 X6 1.64 1.64
The predictor variables X4 –X6 are useful to separate one class from the others and X10 –X12 do not contain any information for separating the classes. In order to obtain most general results an experimental design is used for transforming the predictor X. If the new Classification Pursuit Projection method is compared to LDA only in one or two specific examples it may happen that it performs well in one and poor in the other and the user can not understand why. With proper experimental design it is possible to find potential factors in the data that influence the performance of the method. We tried to include design factors in the experiment that could have an influence on the goodness of the results of the methods according to the statistical background of a classification problem. The experiments are based on the experiments in [12]. In that study LDA turned out to be very stable against violations of the underlying assumptions and in quite a lot of situations it outperforms Quadratic Discriminant Analysis, Decision trees, Na¨ıve Bayes, k-Nearest Neighbors as well as a Neural Network. Moreover, linear Discriminant Analysis was almost optimal – calculated relative to the Optimal Bayes classification. Influencing factors in a classification problem are of course the number of classes and the number predictor variables. We fix these by using three classes and 12 predictor variables. Also only balanced a-priori class probabilities are used, so all classes have an a-priori probability of 13 . The factors listed below were investigated on their influence on the classification performance of LDA and the new Classification Pursuit Projection. They are varied in a fractional factorial design (see for example [7]) where the 7-factor interaction was used to construct a 27−1 design. With this alias structure it is possible to estimate all effects up to three-factor-interactions, which are confounded with the four-factor-interactions, but those should be neglectable here. The low level was assigned -1 and the high level 1. In general the low level (-1) is used for the situations which should be of low(er) difficulty for LDA, whereas the high level (+1) is used for for situations where the premises of LDA are not met or the learning problem is more complicated (e.g. less training observations, see below). The variables Xi , i = 1, · · · , 12 are transformed according to the experimental design. In this design the number of observations varies, the deviation from the normal distribution and the dependency in the predictor variables. We also added some deflection, i.e. the predictor variables were transformed in the space either towards the center of the true class or towards the center of the nearest different class. We varied the direction as well as the number of observations that are deflected and the number of deflected variables. The influencing factors are summarized as follows:
Testing a Simulated Annealing Algorithm in a Classification Problem
67
– Training observations (obs): With more training data more information is available to learn the classification rule. In the low level there are n = 1000 and in the high level n = 100 observations in the training data. – Skewness (skew) and Kurtosis (kurt): With different skewness and kurtosis the deviation of the normal distribution assumption is simulated. The Johnson System [9] was used to generate a wide range of values in the skewness-kurtosis-plane. Low skewness is set to 0.12 whereas the high skewness is 1.152. Low and high kurtosis values are 2.7 and 5. – Dependency (dep): The predictor variables are either independent (low level) or j = 1, · · · , 12 (high level). dependent, i.e. X˜ j = ∑12 i=1 Xi − X j , – Deflection (defl): As the deflection factor we used the percentage of observations that were transformed: At the high level 40% of the observations are shifted, at the low level only 10%. – Deflected Variables (defvar): On the high level 8 of the 9 relevant predictor variables are effected whereas on the low level only X1 is deflected. – Direction of Deflection (dir): The direction of the deflection is either half the way towards the mean of the true class (low level) or toward the mean of the nearest (different) class. In [12] the interpretation of the results was difficult as they used a Plackett-Burman design in which the main effects are confounded with two-factor-interactions. But some two-factor-interactions seems to have a possible effect, for example the interaction between the Direction of the Deflection and the Number of the deflected variables.
5 Results of the Simulation Study The shown results are from 40 runs of the 27−1 fractional factorial design. So there were 2560 runs altogether. The comparison between LDA and ClPP was carried out on a validation data set which consists of 1000 observations which follow the same (posttransformation) distribution as the training data. In order to analyze the results of the experiment we study the relative misclassification error rate defined as: relErr =
ˆ ∑1000 i=1 kiClPP = ki 1000 ˆ ∑i=1 kiLDA = ki
(16)
on the 1000 observations of the validation data. So if in a situation relErr < 1 then in terms of misclassification error ClPP is better than LDA. In 7 of the 2560 runs the Simulated Annealing did not converge, i.e. after the maximum number of iterations the function values of the points in the simplex differ more than 1.0E − 8. These runs were restarted with the same training and validation data. That time the SA did converge fine. The average runtime of the algorithm was 15 sec on a 1600MHZ computer. The overall mean of relErr is 0.9991. So in all situations together the new ClPP method is slightly better than LDA (the “winner” in [12]). But interpretation of the overall mean is very difficult. As the experimental design is built to test the effects of situations which are more difficult for LDA it is not surprising that ClPP is better than LDA. The overall mean is not indicating that ClPP is outperforming LDA on average –
68
Karsten Luebke and Claus Weihs
as we used simulated data. So it is much more interesting to investigate the effect of the characteristics in the data. This is done by a regression of the log(relErr) on the coded factors. relErr is logarithmized to transform the range of the response variable to whole IR and not only IR+ . The estimated regression coefficients of the influencing factors are shown together with their p-values in table 2. The p-value is indicating the probability that, under the hypothesis that the true regression coefficient equals zero, a value larger or equal the observed one occurs. Table 2. Estimated Coefficients and p-values of Main Effects Factor Coefficient p-value obs −0.0089 0.000 0.008 kurt −0.0046 0.132 skew −0.0026 0.192 dep −0.0022 0.873 defl −0.0003 0.0033 0.052 devar 0.0100 0.000 dir
Table 2 shows that the ClPP is significantly better than LDA if there are less observations in the training data. ClPP also performs better than LDA if there are departures from the normal-distribution assumption (i.e. if the distribution has got a higher kurtosis and/ or skewness). The improvement with higher kurtosis is significant. Concerning misclassification error ClPP is performing worse relative LDA in the case of deflected variables. It has got difficulties to handle situations in which the deflected variables are transformed towards the nearest different class (dir). An interesting effect is, that both classification methods do not differ in the way they react on the number of deflected observations (defl, p-value> 0.5). The estimated coefficients (and p-values) of the (at 25% significance level) important interactions are shown in table 3. Table 3. Estimated Coefficients and p-values of Significant Interactions Interaction Coefficient p-value obs:skew −0.0024 0.166 0.075 obs:dep −0.0031 0.0022 0.201 kurt:skew 0.0020 0.239 kurt:devar 0.0023 0.176 skew:devar 0.0061 0.000 defl:dir 0.0062 0.000 devar:dir
Obviously the result that the performance differences between ClPP and LDA mainly depends on the number of observations and the direction of the deflection is confirmed.
Testing a Simulated Annealing Algorithm in a Classification Problem
69
6 Real World Example In the following the classification performance of ClPP is compared to LDA in a real world problem. The data set consists of 13 economic variables with 157 quarterly observations from 1955/4 to 1994/4 (see [6]) of the German business cycle. The German business cycle is classified in a four phase scheme: upswing, upper turning point, downswing and lower turning point. There were 6 complete cycles in the time period. The prediction ability was tested by the leave-one-cycle out validation: One cycle was left out as a validation set, the other 5 cycles are used to train the method and then the misclassification rate was estimated on the validation set. It is shown in [13] that in general LDA is among the best classifiers for this classification task. Despite the fact that the observed group sizes vary the a-priori group probabilities are set equal. As it turned out that ‘unit labor costs’ (LC) and ‘wage and salary earners’ (L) are the most stable economic indicators for business-cycle classification LDA and ClPP were also compared using only these two variables. The results are shown in table 4.
Table 4. Estimated Error Rates in German Business-Cycle Classification var LDA ClPP all 0.49 0.45 L,LC 0.36 0.35
Table 4 confirms the results of the simulation study: ClPP is slightly performing better than LDA.
7 Conclusion and Outlook The new Classification Pursuit Projection method based on a Simulated Annealing algorithm slightly outperforms the classical LDA (and therefore a lot of other classification methods) in many situations. Within a simulation study which was constructed using statistical Design of Experiments it was possible to find factors that influence the relative misclassification error. With data coming from a real world example ClPP was also slightly better than LDA. In future work the ClPP will be compared with the SVM. Also a new Design of Experiments with further (possible) influencing factors like number of predictor variables or number of classes will be constructed. It should be noted that the superiority of ClPP should be high in situations in which there is high collinearity in the data which was not tested so far. The flexibility of Simulated Annealing can be used to construct classifiers that are optimized to classify future observations.
70
Karsten Luebke and Claus Weihs
Acknowledgment This work has been supported by the Collaborative Research Center ‘Reduction of Complexity in Multivariate Data Structures’ (SFB 475) of the German Research Foundation (DFG).
References 1. Dirk Eddelb¨uttel. Object-oriented econometrics: Matrix programming in C++ using gcc and newmat. Journal of Applied Econometrics, 11(2):299–314, 1996. 2. Ildiko E. Frank and Jerome H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):199–209, 1993. 3. David A. Harville. Matrix Algebra From a Statisticians’s Perspective. Springer, 1997. 4. Trevor Hastie, Andreas Buja, and Robert Tibshirani. Penalized discriminant analysis. The Annals of Statistics, 23(1):73–102, 1995. 5. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001. 6. Ulrich Heilemann and H.J. M¨unch. West german business cycles 1963-1994: A multivariate discriminant analysis. In: CIRET-Conference in Singapore, CIRET-Studien 50, 1996. 7. Klaus Hinkelmann and Oscar Kempthorne. Design and abalysis of experiments: Volume I: introduction to experimental design. Wiley, 1994. 8. Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996. 9. N.L. Johnson. Bivariate distributions based on simple translation systems. Biometrika, 36:149–176, 1949. 10. W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numerical Recipes in C. Cambridge University Press, second edition, 1992. 11. Michael C. R¨ohl, Claus Weihs, and Winfried Theis. Direct minimization of error rates in multivariate classification. Computational Statistics, 17:29–46, 2002. 12. Ursula Sondhauß and Claus Weihs. Standardized partition spaces. In Wolfgang H¨ardle and Bernd R¨onz, editors, Proceedings in Computational Statistics, pages 539–544, 2002. 13. Claus Weihs and Ursula Garczarek. Stability of multivariate representation of business cycles over time. Technical Report 20, Sonderforschungsbereich 475, Universit¨at Dortmund, 2002. 14. Claus Weihs and Torsten Hothorn. Determination of optimal prediction oriented multivariate latent factor models using loss functions. Technical Report 15, Sonderforschungsbereich 475, Universit¨at Dortmund, 2002.
Global Search through Sampling Using a PDF Benny Raphael and Ian F.C. Smith IMAC, EPFL-Federal Institute of Technology CH-1015 Lausanne Switzerland {Benny.Raphael,Ian.Smith}@epfl.ch
Abstract. This paper presents a direct search algorithm called PGSL - Probabilistic Global Search Lausanne. PGSL performs global search by sampling the solutions space using a probability density function (PDF). The PDF is updated in four nested cycles such that the search focuses on regions containing good solutions without avoiding other regions altogether. Tests on benchmark problems having multi-parameter non-linear objective functions revealed that PGSL performs better than genetic algorithms in most cases that were studied. Furthermore as problem sizes increase, PGSL performs increasingly better than other approaches. Finally, PGSL has already proved to be valuable for engineering tasks in areas of design, diagnosis and control. Keywords: Global search, optimization, PGSL, stochastic search, genetic algorithms.
1 Introduction A direct method for numerical optimisation is any algorithm that depends on the objective function only through ranking a countable set of function values [1]. Direct methods do not compute or approximate values of derivatives. They use the value of the objective function only to determine whether a point ranks higher than other points. According to this definition, genetic algorithms and simulated annealing are direct search techniques, whereas, techniques such as conjugate gradient are not. Direct search methods are important in engineering applications since in most practical situations, the objective function cannot be written as closed form mathematical expressions and derivatives cannot be easily computed. This paper presents a direct search method called Probabilistic Global Search, Lausanne (PGSL). PGSL results are compared with genetic algorithms for difficult bench mark problems. Trends with increasing problem size are also studied.
2 Existing Search Techniques Global search techniques [2, 3, 4] and particularly, stochastic search techniques [5, 6, 7], have recently attracted much attention. In spite of the advances in other techA. Albrecht and K. Steinhöfel (Eds.): SAGA 2003, LNCS 2827, pp. 71–82, 2003. © Springer-Verlag Berlin Heidelberg 2003
72
Benny Raphael and Ian F.C. Smith
niques, the most widely used global search techniques for practical applications are still simulated annealing [8], and genetic algorithms [9]. The following paragraphs contain brief summaries of selected search methods. Adaptive Random Search: Pure random search procedures have been used for optimization problems as early as 1958 [10]. These techniques are attractive due to their simplicity. However, they converge extremely slowly to a global optimum in parameter spaces of many dimensions. In order to improve convergence, "random creep" procedures are used in which exploratory steps are limited to a hyper-sphere centred about the latest successful point. Masri and Beki [11] have proposed an algorithm called Adaptive Random Search in which the step size of the random search procedure is optimized periodically throughout the search process. Controlled Random Search (CRS) [12,13] is another search method that samples points in the neighbourhood of the current point through the use of a probability density function. Local Search with Multiple Random Starts: Local search techniques involve iteratively improving upon a solution point by searching in its neighborhood for better solutions. If better solutions are not found, the process terminates; the current point is taken as a locally optimal solution. Since local search performs poorly when there are multiple local optima, a modification to this technique has been suggested in which local search is repeated several times using randomly selected starting points. This process is computationally expensive because after every iteration, the search re-starts from a point that is possibly far away from the optimum. Also search might converge to the same point obtained in a previous iteration. Furthermore, no information that has been obtained from previous iterations is reused. Random Bit Climbing (RBC) [14] is a form of local search in which neighbouring points are randomly evaluated and the first move producing an improvement is accepted for the next stage. Partition Methods: Partition methods iteratively divide the search space into a finite number of partitions (sub regions) ([15], [16]). A quantity known as the “promising index” is computed for each partition. An example of a promising index is the best value of the objective function among a fixed number of sample points that are generated randomly within the partition. The partition that has the maximum value of the promising index is chosen for further exploration. This method is similar to branch and bound methods. In branch and bound strategy, the search space is represented as a tree structure and branches where good solutions are unlikely to be found are eliminated from the search. Branching is usually performed through estimating the upper and lower bounds of the function in each branch. In partition methods, the branching strategy is based on the promising index and therefore, there is no need to estimate the bounds of the objective function.
3 Probabilistic Global Search Lausanne The Probabilistic Global Search Lausanne (PGSL) algorithm was developed starting from the observation that optimally directed solutions can be obtained efficiently through carefully sampling the search space without using special operators. The principal assumption is that better points are likely to be found in the neighbourhood
Global Search through Sampling Using a PDF
73
of families of good points. Hence, search is intensified in regions containing good solutions. The search space is sampled by means of a probability density function (PDF) defined over the entire search space. Each axis is divided into a fixed number of intervals and a uniform probability distribution is initially assumed. As search progresses, intervals and probabilities are dynamically updated so that sets of points are generated with higher probability in regions containing good solutions. The search space is gradually narrowed down so that convergence is achieved. Sub-domain cycle repeat NSDC times
Focusing cycle
repeat NFC times
Probability updating cycle repeat NPUC times
Sampling cycle
Evaluate NS samples
Fig. 1. Nested cycles in PGSL
The algorithm includes four nested cycles (Fig. 1): • • • •
Sampling cycle Probability updating cycle Focusing cycle Subdomain cycle In the sampling cycle (innermost cycle) a certain number of samples, NS, are generated randomly according to the current PDF. Each point is evaluated by the user-defined objective function and the best point is selected. In the next cycle, probabilities of regions containing good solutions are increased and probabilities decreased in regions containing less attractive solutions (Fig. 2). In the third cycle, search is focused on the interval containing the best solution after a number of probability updating cycles, by further subdivision of the interval (Fig. 3). In the subdomain cycle, the search space is progressively narrowed by selecting a subdomain of smaller size centred on the best point after each focusing cycle. This is done by multiplying the current width of each axis by a scale factor. Each cycle serves a different purpose in the search for a global optimum. The sampling cycle permits a more uniform and exhaustive search over the entire search space than other cycles. Probability updating and focusing cycles refine search in the neighbourhood of good solutions. Convergence is achieved by means of the subdomain cycle. The complete flowchart for the algorithm is shown in Fig. 4, Fig. 5 and Fig. 6. The terminating condition for all cycles, except the sampling cycle, is the
74
Benny Raphael and Ian F.C. Smith
completion of the specified number of iterations or the value of the objective function becoming smaller than a user-defined threshold.
Fig. 2. Evolution of the PDF of a variable after several probability updating cycles
Fig. 3. Evolution of the PDF of a variable after several focusing cycles
Start
Choose the complete domain as the current sub-domain; Set current best solution, CBEST, to NULL
Sub-domain cycle Complete NFC iterations of the focusing cycle (Fig. 5); select the best solution, SUBDOMAIN-BEST. If SUBDOMAIN-BEST is less than CBEST Then Update CBEST. Choose a smaller sub-domain centered around CBEST as the current sub-domain
Terminating condition Satisfied?
End
Fig. 4. Flow chart for the PGSL algorithm
Global Search through Sampling Using a PDF
75
Beginning of focusing cycle
Assume a uniform PDF throughout the current sub-domain, set SUBDOMAIN-BEST to NULL
Complete NPUC iterations of the probability updating cycle; select the best solution, PUC-BEST (Fig. 6).
If PUC-BEST is better than SUBDOMAIN-BEST Then Update SUBDOMAIN-BEST
Sub-divide the interval containing the PUC-BEST and redistribute probabilities according to an exponential decaying function
Terminating condition Satisfied?
End of Focusing Cycle
Fig. 5. Flow chart (continued). The focusing cycle
Important parameters involved in the algorithm are listed below: Number of samples, NS: The number of samples evaluated in the sampling cycle. Iterations in the probability updating cycle, NPUC: The number of times the sampling cycle is repeated in a probability updating cycle. Iterations in the focusing cycle, NFC: The number of times the probability updating cycle is repeated in a focusing cycle. Iterations in the subdomain cycle, NSDC: The number of times the focusing cycle is repeated in a subdomain cycle. Subdomain scale factors, SDSF1, SDSF2: The default factors for scaling down the axis width in the subdomain cycle. SDF1 is used when there is an improvement and SDF2 if there is no improvement. 3.1 Similarities with Existing Random Search Methods A common feature that PGSL shares with other random search methods such as adaptive random search (ARS) and controlled random search (CRS) is the use of a PDF
76
Benny Raphael and Ian F.C. Smith
(Probability Density Function). However, this similarity is only superficial. The following is a list of important differences between PGSL and other random methods. 1. Most random methods follow a "creep" procedure similar to simulated annealing. They aim for a point to point improvement by restricting search to a small region around the current point. The PDF is used to search within the neighbourhood. On the other hand, PGSL works by global sampling. There is no point to point movement. 2. The four nested cycles in PGSL have no similarities with characteristics of other algorithms. 3. Representation of probabilities is different. Other methods make use of a mathematical function with a single peak (eg. Gaussian) for the PDF. PGSL uses a histogram - a discontinuous function with multiple peaks. This allows for fine control over probabilities in small regions by subdividing intervals. 4. Probabilities are updated differently. The primary mechanism for updating probabilities in other methods is to change the standard deviation. In PGSL, the entire shape and form of the PDF can be changed by subdividing intervals as well as through directly increasing probabilities of intervals. Start of probability updating cycle
Set PUC-BEST to NULL
Sampling cycle: Evaluate NS samples. Select the best, BEST-SAMPLE
If BEST-SAMPLE is better than PUC- BEST, Update PUC-BEST
Increment the probability of the interval containing the PUC-BEST
Terminating condition Satisfied?
End of Probability Updating Cycle
Fig. 6. Flow chart (continued). The probability updating cycle
Global Search through Sampling Using a PDF
77
In spite of the apparent similarities with other random search methods, there are significant differences in the performance of PGSL. There is no evidence that random search methods such as ARS and CRS perform as well as genetic algorithms or simulated annealing for large problems. However, PGSL performs as well or better than these algorithms - results from the benchmark tests are presented in the next section. The subdomain cycle in PGSL is similar to partition methods (Section 2). However, the two methodologies differ in the following aspects: • In partition methods, a sub-region is contained entirely within the parent region. In PGSL, the next sub-domain is chosen such that the current best point is at its center (except when the point is located near the boundary). Therefore, the next subdomain may contain regions that were not contained within the previous subdomain; that is, sub-domains do not have a hierarchical structure. • The choice of the next sub-domain in PGSL is not based on any “promising index” like in partition methods. In partition methods, the set of sub-partitions are defined a-priori and these sub-partitions are sampled uniformly in order to compute promising indices. • Partition methods are oriented towards discrete optimization problems in which a graph representation of the solution space is natural. PGSL performs best when the variables are continuous and when discrete variables are reasonably ordered.
4 Comparison with Other Algorithms The performance of PGSL is evaluated using several benchmark problems by comparing it with three versions of genetic algorithms: simple genetic algorithm (ESGAT), steady state genetic algorithm [17] (Genitor) and CHC [18]. CHC stands for Cross generational elitist selection, Heterogeneous recombination (by incest prevention) and Cataclysmic mutation. De Jong [19] first proposed common test functions (F1-F5) with multiple optima to be used for evaluating genetic algorithms. However, it has been shown that local search can identify global optima of some functions [14]. More difficult test functions have been proposed [20]. These have higher degrees of non-linearity than F1-F5 and can be scaled to a large number of variables. Some of these functions are used for testing the performance of the PGSL algorithm. A short description of the test functions that have been used in the benchmark studies are given below: F8 (Griewank's function): It is a scalable, nonlinear, and non-separable function given by
f ( x i i =1, N ) = 1 +
N
∑
i =1
xi2 4000
−
N
∏
(cos( x i /
i ))
i =1
Expanded functions: Expanded functions [20] are constructed by starting with a primitive nonlinear function in two variables, F(x,y), and scaling to multiple variables using the formula,
EF ( x i i =1 , N ) =
N
N
∑ ∑ F (x , x i
j = 1 i =1
j
)
78
Benny Raphael and Ian F.C. Smith
The expanded functions are no longer separable and introduce non-linear interactions across multiple variables. An example is the function EF10 which is created using the primitive function F10 shown below:
[
F 10 ( x , y ) = ( x 2 + y 2 ) 0 . 25 sin 2 ( 50 ( x 2 + y 2 ) 0 .1 ) + 1
]
Composite functions: A composite function can be constructed from a primitive function F(x1, x2) and a transformation function T(x,y) using the formula
EF ( x i i = 1 , N ) = F ( T ( x n , x 1 )) +
N −1
∑
i =1
F ( T ( x i , x i + 1 ))
The composite function EF8avg is created from Griewank's function, F8, using the transformation function T(x,y) = (x+y)/2 The composite test function EF8F2 is created from Griewank's function, F8 using the De Jong function F2 as the transformation function. F2 is defined as
F 2 ( x , y ) = 100 ( x 2 − y 2 ) + (1 − y ) 2 The composite functions are known to be much harder than the primitive functions and are resistant to hill climbing [20]. 4.1 Description of the Tests Four test functions are used for comparison: F8, EF10, EF8AVG and EF8F2. All these test functions have a known optimum (minimum) of zero. It is known that local search techniques perform poorly in solving these problems [20]. Results from PGSL are compared with those reported for three programs based on genetic algorithms, namely, ESGAT, CHC and Genitor [20]. All versions of genetic algorithms used 22 bit gray scale encoding. For EF10, variables are in the range [-100,100]. For F8, EF8AVG and EF8F2 the variable range is [-512,511]. Results are summarised in Tables 1-4. Thirty trial runs were performed for each problem using different seed values for random numbers. In each trial., a maximum of 500,000 evaluations of the objective function is allowed. Performance is compared using three criteria. 1. Succ, the success rate (the number of trials in which the global optimum was found); 2. The mean solution obtained in all the trials. The closer the mean solution is to zero (the global optimum), the better the algorithm’s performance; 3. The mean number of evaluations of the objective function required to obtain the global optimum (only for trials in which the optimum was found). 4.1.1 Simple F8 Test Function Results for simple F8 test function are given in Table 1. Thirty trial runs were performed on problems with 10, 20, 50 and 100 variables. PGSL has a success rate of 100% for 50 and 100 variables; no version of GA is able to match this. (Surprisingly, the success rate is slightly lower for fewer variables.) However, the mean number of evaluations to obtain the optimum is higher than CHC and Genitor for this problem.
Global Search through Sampling Using a PDF
79
4.1.2 EF10 Test function Results for the extended function EF10 are summarised in Table 2. PGSL has a success rate of 27 out of 30 runs even for 50 variables. For all criteria, PGSL performs better than all versions of GAs. 4.1.3 EF8AVG Test function Results for the composite function EF8AVG are summarised in Table 3. For 20 and 50 variables, none of the algorithms is able to find the exact global optimum. For 10 variables the performance of CHC is comparable with that of PGSL. In terms of the mean value of the optimum, PGSL outperforms all other algorithms. 4.1.4 EF8F2 Test function Results for the composite function EF8F2 are given in Table 4. None of the algorithms is able to find the global optimum for this problem. However, in terms of the quality of the mean solution, PGSL fares better than the rest. Table 1. Results for Simple F8 test function. PGSL is compared with results reported in [20]. The global minimum is 0.0 for all instances Number of variables Successes ESGAT CHC Genitor PGSL Mean ESGAT Solution CHC Genitor PGSL Mean number of ESGAT evaluations. CHC Genitor PGSL
10 6 30 25 28 0.0515 0.0 0.00496 0.0007 354422 51015 92239 283532
20 5 30 17 29 0.0622 0.0 0.0240 0.0002 405068 50509 104975 123641
50 0 29 21 30 0.0990 0.00104 0.0170 0.0
100 0 20 21 30 0.262 0.0145 0.0195 0.0
182943 219919 243610
242633 428321 455961
Table 2. Results for the extended function EF10. The global minimum is 0.0 for all instances Number of variables Successes ESGAT CHC Genitor PGSL Mean ESGAT Solution CHC Genitor PGSL Mean number of ESGAT evaluations. CHC
Genitor PGSL
10 25 30 30 30 0.572 0.0 0.0 0.0 282299 51946
20 2 30 4 30 1.617 0.0 3.349 0.0 465875 139242
136950 61970
339727 119058
50 0 3 0 27 770.576 7.463 294.519 0.509639 488966
348095
80
Benny Raphael and Ian F.C. Smith Table 3. Results for EF8AVG test function. The global minimum is 0.0 for all instances
Number of variables Successes ESGAT CHC Genitor PGSL Mean ESGAT solution CHC Genitor PGSL Mean number of ESGAT evaluations. CHC Genitor PGSL
10 0 10 5 9 3.131 1.283 1.292 0.0151
20
50
8.880 8.157 12.161 0.1400
212.737 83.737 145.362 1.4438
222933 151369 212311
Table 4.Results for EF8F2 test function. The global minimum is 0.0 for all instances
Nb Var Mean solution
ESGAT CHC Genitor PGSL
10 4.077 1.344 4.365 0.123441
20 47.998 5.63 21.452 0.4139
50 527.1 75.0995 398.12 1.6836
4.2 Summary of Comparisons For the test functions F8 and EF10, PGSL enjoys a near 100% success rate in locating the global optimum even for large instances with more than 50 variables (Tables 1 and 2). The other algorithms are not able to match this performance. For the other two test functions (Tables 3 and 4), none of the algorithms is able to locate the global optima for instances larger than 20 variables. However, mean value of the minima identified by PGSL is much less than those found by other algorithms. Among the three implementations of GAs considered in this section, CHC performs better than the rest. In most cases, the quality of results produced by PGSL is better than CHC in terms of success rate and mean value of optima. However, PGSL requires a greater number of evaluations than CHC - especially for small problems. When the number of variables is increased, PGSL performs better than other algorithms. 4.3 Other Comparisons PGSL has been compared with algorithms such as ADA, (adaptive simulated annealing), GAS (a version of genetic algorithms) and INTGLOB (integral global optimization) using bench mark tests described in [22]. PGSL performs better than genetic algorithms and simulated annealing in 19 out of 23 cases that were studied. Details can be found in [23]. The capability of PGSL to identify the global optimum in the case of complex mathematical functions containing an exponential number of local optima was tested
Global Search through Sampling Using a PDF
81
using Lennard-Jones cluster optimization problem. The global minima were located without the use of gradients or problem specific heuristics for all instances up to 25 10 atoms [23]. This is remarkable considering that there are about 10 local optima for a cluster of 25 atoms. Beyond 25 atoms, PGSL was combined with local search using gradients in order to speed up the search. Reported global minima for all instances up to 74 atoms were found. The instance with 74 atoms (219 variables) required only 1269815 evaluations of the objective function whereas the objective function contains 20 more than 10 local minima. PGSL has also been applied to practical engineering tasks such as design, diagnosis and control. Its application to the design of timber shear wall structures resulted in a savings of about 10% in factory production costs [24]. PGSL has also been used in the control of tensegrity structures [26] and for model identification and calibration in the diagnosis of bridges [25].
5 Conclusions Although probabilistic methods for global search have been in use for about half a century, it is only recently that they have attracted widespread attention in the engineering and scientific communities. Considerable progress has been made in the area of direct search during the last decades. For example, the development of genetic algorithms and simulated annealing have spawned much activity. Genetic algorithms and simulated annealing are direct search methods and are well suited for practical applications where objective functions cannot be formulated as closed form mathematical expressions. PGSL is a new direct search algorithm. Its performance is comparable with and can be better than existing techniques. Bench mark tests indicate that it performs well even when objective functions are highly non-linear. Results are always better than the simple genetic algorithm and steady state genetic algorithm for the expanded test functions considered in this paper.
Acknowledgements This research is funded by the Swiss National Science Foundation (NSF) and the Commission for Technology and Innovation (CTI). We would like to thank K. De Jong and K. Shea for valuable comments and suggestions. We would also like to thank Logitech SA and Silicon Graphics Incorporated for supporting this research.
References 1. M. W. Trosset, I Know It When I See It: Toward a Definition of Direct Search Methods, SIAG/OPT Views-and-News, No. 9, pp. 7-10, (Fall 1997). 2. Ratschan S., Search Heuristics for Box Decomposition Methods, Journal of Global Optimization, 24, 3, pp. 35-49, 2002. 3. Voorhis T. V., A Global Optimization Algorithm using Lagrangian Underestimates and the Interval Newton Method, Journal of Global Optimization, 24, 3, pp. 349-370, 2002.
82
Benny Raphael and Ian F.C. Smith
4. Lewis, R.M., Torczon, V. and Trosset, M.W., Direct search methods: then and now, Journal of Computational and Applied Mathematics, 124, 191–207, 2000. 5. Reaume D. J., Romeijn H. E., and Smith R. L., Implementing pure adaptive search for global optimisation using Markov chain sampling, Journal of Global Optimization, 20: 3347, 2001. 6. Hendrix E.M.T., and Klepper O., On uniform covering, adaptive random search and raspberries, Journal of Global Optimization, 18, pp. 143-163, 2000. 7. Brachetti P., Ciccoli M. F., Pillo G., and Lucidi S., A new version of Price’s algorithm for global optimisation, Journal of Global optimisation, 10, pp. 165-184, 1997. 8. S. Kirkpatrick, C.Gelatt and M. Vecchi, Optimisation by simulated annealing, Science. pp. 220:673, (1983). 9. J. Holland, Adaptation in natural artificial systems, University of Michigan Press (1975). 10. S. H. Brooks, Discussion of random methods for locating surface maxima, Operations Research. 6:244-251 (1958). 11. S. F. Masri, and G. A. Bekey, A global optimization algorithm using adaptive random search, Applied mathematics and computation, Elsevier North Holland, Inc., Vol 7, pp. 353-375, (1980). 12. W. L. Price, A controlled random search procedure for global optimization, in Towards Global Optimization 2, L.C.W.Dixon and G.P.Szego (eds.), North-Holland, Amsterdam (1978). 13. P. Brachetti, M. F. Ciccoli, G. Pillo, and S. Lucidi, A new version of Price’s algorithm for global optimisation, Journal of Global optimisation, 10, pp. 165-184 (1997). 14. L. Davis, Bit-climbing, representational bias and test suite design, In Proceedings of the th 4 international conference on GAs (L.Booker and R.Belew, eds.), Morgan Kauffman (1991). 15. Z. B. Tang, Adaptive Partitioned Random Search to Global Optimization, IEEE Transactions on Automatic Control, Vol 39, p 2235-2244 (1994). 16. L. Shi and S. Olafsson Nested Partitions method for global optimization, Operations Research, Vol 48, pp. 390-407 (2000). 17. G. Syswerda, A study of reproduction in generational and steady-state genetic algorithms, Foundations of Genetic algorithms, (G.Rawlins, editor), Morgan-Kaufmann. pp.94-101 (1991). 18. L. Eshelman, The CHC adaptive search algorithm. Foundations of genetic algorithms, G.Rawlins (editor), Morgan-Kaufmann. Pp. 256-283, (1991). 19. K. De Jong, Analyis of the behaviour of a class of genetic adaptive systems. Ph.D. thesis, Univerisity of Michigan, Ann Arbor. (1975). th 20. D. Whiltley, Building better test functions, In Proceedings of the 6 international conference on GAs (L. Eshelman, editor), Morgan Kauffman (1995). 21. O. Martin, Combining simulated annealing with local search heuristics, Metaheuristics in combinatoric optimization, (G.Laporte and I.Osman editors) (1995). 22. M. Mongeau, H. Karsenty, V. Rouzé and J.-B. Hiriart-Urruty, Comparison of publicdomain software for black box global optimization. Optimization Methods & Software 13(3):203-226 (2000). 23. B.Raphael and I.F.C.Smith, A direct stochastic algorithm for global search, Applied Mathematics and computation, In press, 2003. 24. P. Svarerudh, B. Raphael and I.F.C Smith, Lowering costs of timber shear-wall design using global search, Engineering with Computers, Vol 18, No 2, pp 93-108 (2002). 25. Y. Robert-Nicoud, B. Raphael, I.F.C. Smith, Decision support through multiple models and probabilistic search, In Proceedings of Construction Information Technology 2000, Iceland building research institute (2000). 26. B. Domer, B. Raphael, K. Shea and I.F.C. Smith, Comparing two stochastic search techniques for structural control, Journal of Structural engineering, Vol. 129, No. 5, 2003.
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation Uwe Naumann and Peter Gottschling Mathematics and Computer Science Division, Argonne National Laboratory [email protected] Abstract. We report on new results in logarithmic simulated annealing applied to the optimal Jacobian accumulation problem. This is a continuation of work that was presented at the SAGA’01 conference in Berlin, Germany [16]. We discuss the optimal edge elimination problem in linearized computational graphs [15] in the context of linear algebra. We introduce row and column pivoting on the extended Jacobian as analogs to front and back edge elimination in linearized computational graphs. Neighborhood relations for simulated annealing are defined on a metagraph that is derived from the computational graph. All prerequisites for logarithmic simulated annealing are fulfilled for dyadic pivoting, which is equivalent to vertex elimination in linearized computational graphs [7]. For row and column pivoting we cannot yet give a proof that the corresponding elimination sequences are polynomial in size. In practice, however, the likelihood for an exponential elimination sequence to occur is negligible. Numerical results are presented for algorithms based on both homogeneous and inhomogeneous Markov chains for all pivoting techniques. The superiority of row and column pivoting over dyadic pivoting can be observed when applying these techniques to Roe’s numerical flux [17]. Keywords: Jacobian matrices, dyadic pivoting, row and column pivoting, (logarithmic) simulated annealing
1 Background Derivatives of vector functions that represent mathematical models of scientific, engineering, or economic problems play an increasingly important role in modern numerical computing. They can be regarded as the enabling key factor allowing for a transition from the pure simulation of the real world process to the optimization of some specific objective with respect to a set of model parameters. For a given computer program that implements an underlying numerical model automatic differentiation (AD) [3–6] provides a set of techniques for transforming the program into one that computes not only the function value for a set of inputs but also the corresponding first and higher derivatives. A large portion of the ongoing research in this field is aimed towards the improvement of the efficiency of the derivative code that is generated. Successful methods are often built on a combination of classical compiler algorithms and the exploitation of mathematical properties of the code. In this paper we consider the problem of minimizing the number of floating point operations performed by those parts of the derivative code that do not depend on the flow of control, for example, basic blocks [1]. Approximate solutions of the corresponding combinatorial optimization problem are obtained A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 83–97, 2003. c Springer-Verlag Berlin Heidelberg 2003
84
Uwe Naumann and Peter Gottschling
by exploiting the associativity of the chain rule. Before stating the problem formally as an elimination method in a system of linear equations we present a brief introduction to the principles of AD. We provide background information that we consider essential for the further understanding of the material covered by subsequent sections. Consider the following Fortran subroutine: SUBROUTINE EX (n,x) INTEGER :: n REAL, DIMENSION(n), INTENT(INOUT) :: x DO i=1,n-1 x(i)=x(i)*sin(x(i)*x(i+1)) x(i+1)=x(i)*x(i+1) x(i)=cos(x(i)) END DO END SUBROUTINE EX The final value of x = (x(1), . . . , x(n))T is computed from its start value by n-1 executions of the statements that form the body of the loop. The code implements a vector function F : IRn → IRn of the form y = F(x). After running the example code the values of y happen to be stored in x in our case. An example similar to the body of the loop was used in [16] to introduce vertex elimination techniques in linearized computational graphs and the resulting combinatorial optimization problem. Suppose that one is interested in the Jacobian matrix F of F, that is the matrix of the partial derivatives of the outputs with respect to the inputs. It is defined as ∂y j F = F (x) = ∂xi j,i=1,...,n where xi denotes the input value of x(i) and y j the corresponding output value. The forward mode of AD [6][Chapter 3] transforms the code semantically such that in addition to the function value the product of the Jacobian with some direction x˙ in the input space is computed. The result of this transformation is an implementation of the function ˙ x˙ ) ≡ (F(x), F (x) · x˙ ) . (y, y˙ ) = F(x, Given such a program the Jacobian itself can be computed by letting the directions x˙ range over the Cartesian basis vectors in IRn since, obviously, F = F · In , where In denotes the identity matrix in IRn . How is this transformation performed? The statements of the loop body can be decomposed into a sequence of scalar assignments of the results of all elemental arithmetic operators and intrinsic functions to locally unique intermediate variables z1, z2, z3. This code list can be augmented by statements that compute for all elemental assignments the local partial derivatives a, . . . , h of the left-hand side with respect to the arguments on the right-hand side1 . 1
Software tools for AD usually avoid the generation of trivial assignments of the form a=x(i+1). We decided to leave them in the code to facilitate an easier understanding of the concepts that forward mode AD is built on.
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation
85
DO i=1,n-1 a=x(i+1); b=x(i) z1=x(i)*x(i+1) c=cos(z1) z2=sin(z1) d=z2; e=x(i) z3=x(i)*z2 f=x(i+1); g=z3 x(i+1)=z3*x(i+1) h=-sin(z3) x(i)=cos(z3) END DO The resulting code is often referred to as the linearized code list. The next transformation step associates directional derivative components with all input, intermediate, and output variables, and it combines the well-know differentiation rules with the chain rule to generate a tangent-linear version of the code. SUBROUTINE D_EX (n,x,l,d_x) INTEGER :: n,l REAL, DIMENSION(n), INTENT(INOUT) :: x REAL, DIMENSION(n,l), INTENT(INOUT) :: d_x REAL :: z1,z2,z3,a,b,c,d,e,f,g,h REAL, DIMENSION(l) :: d_z1, d_z2, d_z3 DO i=1,n-1 a=x(i+1); b=x(i) d_z1=a*d_x(i,:)+b*d_x(i+1,:) z1=x(i)*x(i+1) c=cos(z1) d_z2=c*d_z1 z2=sin(z1) d=z2; e=x(i) d_z3=d*d_x(i,:)+e*d_z2 z3=x(i)*z2 f=x(i+1); g=z3 d_x(i+1,:)=f*d_z3+g*d_x(i+1,:) x(i+1)=z3*x(i+1) h=-sin(z3) d_x(i,:)=h*d_z3 x(i)=cos(z3) END DO END SUBROUTINE D_EX The directional derivative components are denoted by the d prefix. They are vectors in IRn as in the forward vector mode of AD [6][Chapter 3]. All derivative propagation
86
Uwe Naumann and Peter Gottschling
statements must be interpreted as vector operations of the form z˙ = a · x˙ + b · y˙ where z˙ , x˙ , y˙ ∈ IRn and a, b ∈ IR. The above code uses Fortran array operation syntax. Our measure of complexity of the tangent-linear code is the number of scalar floating-point multiplications performed in addition to the arithmetic operations that are required to built the linearized code list. Hence, we count only multiplications that are performed during the propagation of the directional derivatives. It is straight-forward to verify that the corresponding value for our example is equal to 8n(n − 1). The derivative propagation part of a single execution of the loop body performs 8n multiplications, n per local partial derivative a, . . . , h. The number of iterations is n − 1. Preaccumulation techniques are based on the observation that the local partial derivatives can be combined according to the chain rule [7] which may result in a decreased number of potential factors in the directional derivative propagation part of the corresponding tangent-linear code. This transformation is equivalent to a transformation of the linearized computational graph into a bipartite form as shown, for example, in [16]. For our example the preaccumulation of the local Jacobian reduces the number of multiplications in the directional derivative propagation to four2. SUBROUTINE D_EX_BB (n,x,l,d_x) INTEGER :: n,l REAL, DIMENSION(n), INTENT(INOUT) :: x REAL, DIMENSION(n,l), INTENT(INOUT) :: d_x REAL :: z1,z2,z3,a,b,c,d,e,f,g,h,ec,eca,ecb,j11,j12,j21,j22 REAL, DIMENSION(l) :: d_tmp DO i=1,n-1 a=x(i+1); b=x(i) z1=x(i)*x(i+1) c=cos(z1) z2=sin(z1) d=z2; e=x(i) z3=x(i)*z2 f=x(i+1); g=z3 x(i+1)=z3*x(i+1) h=-sin(z3) x(i)=cos(z3) ec=e*c eca=d+ec*a; ecb=ec*b j11=h*eca j12=h*ecb j21=f*eca j22=g+f*ecb 2
There are two inputs and two outputs, and the local Jacobian is dense. By inspection, we know that x(i) and x(i+1) are always distinct. In general, this information needs to be provided by alias / array section analysis [1].
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation
87
d_tmp=j11*d_x(i,:)+j12*d_x(i+1,:) d_x(i+1,:)=j21*d_x(i,:)+j22*d_x(i+1,:) d_x(i,:)=d_tmp END DO END SUBROUTINE D_EX_BB The preaccumulation of the Jacobian entries j11, j12, j21, j22 can be performed using vertex elimination techniques at a cost of seven multiplications as shown in [16]. The corresponding code computes intermediate local partial derivatives ec,eca, and ecb. The following derivative propagation requires 4n multiplications3. Consequently, the number of multiplications performed by the tangent-linear code that uses preaccumulation is (7 + 4n)(n − 1). For large n this represents an improvement by a factor that is close to two. Moreover, the preaccumulation itself can be done in different ways by exploiting the associativity of the chain rule. The improvements are not very large for small codes. However, they can become more signifcant if large parts of the program are subject to preaccumulation. An example is discussed in Section 5.2. The paper is structured as follows: In Section 2 we state the preaccumulation problem as a combinatorial optimization problem on the tangent-linear system of equations. The elimination techniques that are used to solve this system are introduced in Section 3 together with an example. The elimination problem is discussed in the context of simulated annealing in Section 4. Numerical test are presented in Section 5. The paper resumes with conclusions drawn in Section 6.
2 Problem Description The optimal preaccumulation of local Jacobians of basic blocks in tangent-linear and adjoint models of numerical simulation programs is a highly desirable feature of modern software tools for automatic differentiation [6]. Hard combinatorial optimization problems must be solved to determine near-optimal elimination sequences on the underlying linearized computational graphs [14] or, equivalently, on the extended Jacobian matrix (see below). Local heuristics have been developed to obtain good approximations to the solutions of these problems [2, 7, 12]. If the resulting derivative code is to be used extensively over a long period of time, then more expensive techniques, such as simulated annealing, can be employed to, possibly, achieve further improvements for crucial parts of the computation, for example, for CFD-kernels [17]. We consider numerical simulation programs that implement non-linear vector functions F : IRn ⊇ D → IRm : x → y = F(x) . The Jacobian matrix of F is denoted by F = F (x0 ) = 3
i=1,...,m ∂yi (x0 ) ∂x j j=1,...,n
.
The assignment to d tmp is required to assure correctness of the derivative code. This is due to x being both input and output of the basic block that represents the loop body.
88
Uwe Naumann and Peter Gottschling
AD provides a set of techniques for generating derivative code for F, such that, for example, F can be computed with machine accuracy. The code list of F ensures that each assignment creates a different variable name. A given input x determines the flow of control uniquely and the corresponding code list becomes a sequences of, in general, non-linear assignments v j = ϕ j (vi )i≺ j
,
where j = 1, . . . , q and q = p + m. The number of intermediate variables is denoted by p. Following the notation in [6], the set of arguments of ϕ j is denoted as {vi : i ≺ j}, that is i ≺ j if vi is an argument of ϕ j . In AD the elemental functions ϕ j are assumed to be the elementary arithmetic operators and intrinsic functions provided by the programming language that is used to implement F. Furthermore, we set xi ≡ vi−n , i = 1, . . . , n, zk ≡ vk , k = 1, . . . , p, and y j ≡ v p+ j , j = 1, . . . , m. For clarity of the following argument we assume that the dependent variables are mutually independent. Obviously, this must be the case for all independent variables too. Source transformation tools for AD can be used to generate tangent-linear and adjoint models automatically for a given program for F. As before, a given argument x of F determines the flow of control uniquely. Tangent-linear models associate directional derivatives v˙i with every code list variable for i = 1 − n, . . ., q and they compute v˙ j = ∑ c j,i · v˙i
,
i≺ j
for j = 1, . . . , q and where
∂ϕ j (vi )i≺ j . ∂vi For the purpose of this paper it is sufficient to introduce the formalism for tangent-linear models. Analogous results hold for adjoint models. Both tangent-linear and adjoint models rely on the existence of jointly continuous partial derivatives for all elemental functions ϕ j , j = 1, . . . , q, on open neighborhoods D j ⊂ IR|{i:i≺ j}| of their respective domains. A tangent-linear model represents a system of linear equations that can be written in matrix form as follows. x˙ z˙ , (1) =C· z˙ y˙ c j,i ≡
where C ∈ IRq×(n+p) is the extended Jacobian that is defined as j=1,...,q
C = (c j,i )i=1−n,...,p
(2)
with local partial derivatives c j,i . The computation of y˙ = F · x˙ can be interpreted as the solution of Equation (1) for y˙ in terms of x˙ . To do so we consider elimination techniques on C. Following the notation in [6] the extended Jacobian C can be partitioned as follows. BL C= , (3) RT
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation
89
where B ∈ IR p×n, L ∈ IR p×p, R ∈ IRm×n , and T ∈ IRm×p . Since the structure of C is induced by a code list the matrix L = (l j,i ) j,i=1,...,p must be strictly lower diagonal, that is l j,i = 0 if i ≥ j. Solving Equation (1) for y˙ in terms of x˙ is regarded as the elimination of all non-zero elements in B, L, and T.
3 Elimination Techniques 3.1 Dyadic Pivoting Dyadic pivoting (DP) is equivalent to vertex elimination in linearized computational graphs as introduced in [7]. In C some j ∈ {1, . . . , p} is picked and the outer product of the jth row with the jth column is added to the corresponding submatrix of C that is spanned by the ( j + 1)th, . . . , qth rows and by the (1 − n)th, . . . , ( j − 1)th columns. All elements in both the jth row and the jth column are set to zero. This simultaneous row and column elimination is expressed by the following equation. C j = C + Cej eTj C − Cej − CeTj
,
(4)
where e j is the jth Cartesian basis vector in IRq and ej is the (n + j)th Cartesian basis vector in IRn+p. The proof for the correctness of Equation (4) can be found in [7]. It comes as an immediate consequence of the chain rule. The problem of finding an order of the pivots that minimizes the fill-in has been shown to be NP-complete in [9]. The closely related optimal Jacobian accumulation problem (OJA) for vertex elimination in linearized computational graphs is also conjectured to be intractable. If this conjecture is true, then the same applies to the dyadic pivoting problem in extended Jacobians. The two special orderings that pick j according to j = 1, . . . , p and j = p, . . . , 1 are essentially equivalent to the sparse forward and reverse mode of AD, respectively, as pointed out in [6]. Logarithmic simulated annealing has been applied successfully to the vertex elimination problem in linearized computational graphs in [16]. The two pivoting techniques to be introduced next represent refinements of dyadic pivoting. Their use in simulated annealing algorithms and the definition of appropriate neighborhood relations are discussed in Section 4. 3.2 Row Pivoting Row pivoting is equivalent to front edge elimination in linearized computational graphs as introduced in [14]. The name is motivated by the choice of a particular element inside a given row. In C a pair of pivots ( j, i), such that j ∈ {1, . . ., p}, i ∈ {1 − n, . . ., p}, j > i, is picked, that is the ith element in the jth row of C, and the product of c j,i and the jth column is added to the ith column of C. The entry c j,i is set to zero. If the elimination of c j,i results in the jth row becoming equal to the zero vector in IRn+p, then the entire jth column is also set to zero. This procedure is expressed by the following equation. C( j,i) = C + eTj CeiCej − eTj Cei
.
(5)
90
Uwe Naumann and Peter Gottschling
In addition,
C( j,i) = C( j,i) − C( j,i) ej
if eTj C( j,i) = 0
.
The latter is required to avoid unnecessary floating-point operations during the elimination procedure. Otherwise, the solution of Equation (1) for y˙ would also give a solution for z˙ in terms of x˙ . Although this might be useful derivative information in some cases we are not interested in its computation in the current context. The proof for the correctness of Equation (5) is an immediate consequence of a front edge elimination sequence being a special face elimination sequence as described in detail in [15]. The proof is based on the ideas presented in [7]. 3.3 Column Pivoting In column pivoting we choose an element inside a given column. This technique is equivalent to back edge elimination in linearized computational graphs [14]. In C a pair of pivots (i, j) such that i ∈ {1, . . . , p}, j ∈ {1, . . . , q}, and j > i, is picked, that is the jth element in the ith column of C, and the product of c j,i and the ith row is added to the jth row of C. The entry c j,i is set to zero. If the elimination of c j,i results in the ith column becoming equal to the zero vector in IRq , then the entire ith row is also set to zero. Formally, C(i, j) = C + eTj Cei eTi C − eTj Cei
,
(6)
where the notation is the same as in Section 3.2. Again, C(i, j) = C(i, j) − eTi C(i, j)
if C(i, j) ei = 0
.
Remark 1 Dyadic pivoting is a special case of both row and column pivoting. Picking j as a dyadic pivot is equivalent to choosing row pivots ( j, i) such that i = 1 − n, . . . , j − 1. Similarly, the elimination resulting from making j the pivot can be performed by choosing column pivots ( j, i) such that i = j + 1, . . ., q. Notation 1 Mixed sequences of row and column pivots are denoted by RCP. The extended Jacobian that is obtained by applying some RCP [(i1 , j1 ), . . . , (ik , jk )] to C is denoted by C[(i1 , j1 ),...,(ik , jk )] . An RC-pivot (i, j) is a row pivot if i > j, and it is a column pivot if i < j. The application of some dyadic pivot sequence (DP) [i1 , . . . , ik ] to C is denoted by C[ i1 , . . . , ik ]. RCP’s and DP’s can be mixed. For example, C[(i, j),i] , where i < j is the extended Jacobian that is derived from C by choosing (i, j) as a column pivot followed by the elimination resulting from making i the dyadic pivot. Remark 2 The elemental arithmetic operations in Equation (4)–(6) are fused multiply adds (fma) of the form ck,i = ck,i + ck, j c j,i . Our objective is to compute a sequence of pivots that minimizes the number of fma’s required to compute the Jacobian starting from a given extended Jacobian. For DP’s this problem is equivalent to the optimal vertex elimination problem in linearized computational graphs [7]. When considering RCP’s one is solving the optimal edge elimination problem in linearized computational graphs [14].
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation
91
3.4 Example
We consider the extended Jacobian C of the lion graph [14] representing the linearized computational graph of a vector function F : IR2 → IR4 . C is given as
c1,−1 0 0 C= 0 0 0
c1,0 0 0 0 0 0
0 c2,1 0 0 c3,2 0 c4,2 0 c5,2 c6,1 c6,2 0
.
The Jacobian F can be accumulated based on the dyadic pivot sequence [1, 2] by applying Equation (4) to get
0
0
c2,0 c2,−1 c c c c2,0 3,2 2,−1 3,2 C[1,2] = c c c2,0 c 4,2 2,−1 4,2 c5,2 c2,−1 c5,2 c2,0 c6,−1 + c6,2 c2,−1 c6,0 + c6,2 c2,0
00 0 0 0 0 , 0 0 0 0 00
where c2,−1 = c2,1 c1,−1 , c2,0 = c2,1 c1,0 , c6,−1 = c6,1 c1,−1 , and c6,0 = c6,1 c1,0 . The Jacobian F appears as the (4 × 2)-matrix in the lower left corner of C[1,2] . Its accumulation based on the given dyadic pivot sequence requires twelve fma’s. Alternatively, F can be computed as C[2,1] at exactly the same cost4 . Choosing ( j, i) = (1, 0) as a row pivot and performing the corresponding elimination transforms C into c1,−1 0 0 0 0 c2,1 c1,0 c2,1 0 0 0 0 c3,2 C[(2,0)] = 0 0 c4,2 0 0 0 0 c5,2 0 c6,1 c1,0 c6,1 c6,2 4
Note that the number of non-zero entries in both C and in C[1,2] is equal to eight. The preaccumulation of this local Jacobian would therefore not lead to a decrease in the number of fma’s performed by the directional derivative propagation. Consequently, preaccumulation would not be a good idea if the lion graph represented some basic block in a larger program. The accumulation of the local Jacobian itself is, of cause, cheaper when using elimination techniques. Classical AD requires at least 16 fma’s to compute F as F · I2 in forward mode. The advantages of elimination techniques result from the implicit exploitation of the structural sparsity of the extended Jacobian and the ability to exploit the associativity of the chain rule.
92
Uwe Naumann and Peter Gottschling
at the cost of two fma’s. With (i, j) = (2, 6) as a column pivot we get c1,−1 c1,0 0 0 0 0 c2,1 0 0 0 0 c3,2 C[(2,6)] = 0 0 0 c4,2 0 0 0 c5,2 0 0 c6,1 + c6,2c2,1 0 by performing one fma. While any of the two possible dyadic pivoting sequences results in an overall cost of twelve fma’s the accumulation of F as C[(2,6),1,2] requires only eleven fma’s. This example was used in [14] to prove the superiority of mixed row and column pivoting over dyadic pivoting. Little is known about the theoretical discrepancy that may result from a comparison of the optimal RCP sequence with the optimal DP sequence. Practical tests suggest that this discrepancy is likely to be small. One of the motivations for the research presented in this paper is the desired ability to compare RCP and DP strategies for various test problems.
4 Simulated Annealing Simulated annealing has been applied to the computation of optimal dyadic pivot sequences first in [13]. In [16] we proposed new ideas on how to make logarithmic cooling schedules for inhomogeneous Markov chains work on this problem. As in [16] neighborhood relations on the configuration space V are defined by rearrangements transforming a sequence of pivots σ ∈ V into σ ∈ Nσ ⊆ V where Nσ denotes the neighborhood of σ. These transitions are denoted by [σ, σ ] Acceptance probabilities are associated with all feasible transitions [σ, σ ]. They are defined by 1, if C (σ ) ≤ C (σ), A[σ, σ ] = (7) C (σ )−C (σ) e− T , otherwise, where C (σ) denotes the cost, that is the number of fma’s, associated with the sequence of pivots σ. The control parameter T can be interpreted as the temperature in the annealing process [10, 11]. We consider two neighborhoods, one for DP and RCP, respectively. Neighborhood 1 Let [ j1 , . . . , jk ] be a DP. Do one of the following with probability 0.5 : 1. Choose a feasible jk and remove it from the sequence. Feasibility of jk with respect to this backward step is guaranteed if [ jk +1 , . . . , jk ] can be applied to C[ j1 ,..., jk −1 ] . 2. Select the next dyadic pivot jk+1 in C[ j1 ,..., jk ] at random (forward step). Formally,
[ j1 , . . . , jk ] →
[ j1 , . . . , jk −1 , jk +1 , . . . , jk ] [ j1 , . . . , jk , jk+1 ]
The neighborhood for RCP is defined analogous.
.
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation
93
Fig. 1. Evolution of Cost function for Lion Graph
Neighborhood 2 [(i1 , j1 ), . . . , (ik , jk )] →
[(i1 , j1 ), . . . , (ik −1 , jk −1 ), (ik +1 , jk +1 ), . . . , (ik , jk )] [(i1 , j1 ), . . . , (ik , jk ), (ik+1 , jk+1 )]
.
The cooling schedule is defined by T = T (k) =
Γ , ln(k + 2)
for k = 0, 1, . . . , n¯
,
in Equation (7) for a maximal number of n¯ cooling steps. The choice of Γ is based on experimental results. It is discussed in Section 5.2. For Neighborhood 1 the conditions for asymptotic convergence of the logarithmic simulated annealing algorithm [8] can be proven by taking an approach similar to the one in [16]. For RCP the necessary polynomiality of the computation of the objective function could not be verified so far. Restrictions on the feasibility of RC-pivot choices that preserve overall optimality are the subject of ongoing research. In practice, the design of the simulated annealing algorithm based on the logarithmic cooling schedule makes the occurrence of exponential sequences of pivots highly unlikely.
5 Computational Experiments The tests results presented in this section were obtained using our template library ANGEL (angellib.sourceforge.net). It provides various strategies for computing DP’s and RCP’s, including heuristics and several simulated annealing algorithms. 5.1 Lion The lion graph is used in [14] to prove the superiority of RC-pivoting over dyadic pivoting. As shown in Section 3.4 both sequences of dyadic pivots result in an overall cost of twelve fma’s. The use of RCP’s can reduce this cost to eleven fma’s. Although being an
94
Uwe Naumann and Peter Gottschling
academic example, the lion graph and its extended Jacobian represent an obvious test case for methods that claim to compute Jacobians at an optimal or near-optimal cost. So far, no technique, besides exhaustive search, is known that automatically finds an optimal RC-pivot sequence. Moreover, there is not a lot of experience in algorithms for computing near-optimal RC-pivot sequences. The extension of Markowitz-type heuristics for DP to RCP in [2] resulted in an observable superiority of RCP over DP in only a few examples. To find an optimal pivot sequence, simulated annealing combined with a logarithmic cooling schedule (Γ = 5) is applied. Figure 1 shows the development of the objective function. Optimal RCP’s are found repeatedly even in the high temperature phase. Obviously, straight-forward random search would probably have found the minimum too. Nevertheless, any method for optimizing RCP sequences must be able to solve the lion problem as a basic feasibility test. 5.2 Roe’s Flux In this application from computational fluid dynamics fluxes between two cells of a finite-volume flow solver are calculated using Roe’s flux difference scheme [17]. The extended Jacobian is derived from the computational graph that was generated by the software tool EliAD [18]. The best known sequence of pivots so far results in an overall cost of 962 fma’s for accumulating the (10 × 5)-Jacobian. Table 1. Values of the objective function for different Γ (simulated annealing – SA) and T (fixed temperature Metropolis – FT) FT T /Γ 0.1 0.2 0.5 1 2 5 10 20
DP 904 904 908 894 897 936 1039 1148
RCP 885 887 887 886 984 1204 1258 1286
SA DP 899 904 908 879 880 888 887 912
RCP 881 887 887 885 878 882 902 993
Table 1 compares the values of the objective function for a Metropolis algorithm with fixed temperature (value in left column should be interpreted as T ) and simulated annealing (value in left column should be interpreted as Γ). Both algorithms were applied to DP and RCP, and 10,000 iterations were performed. Both algorithms are strongly influenced by the choice of Γ/T . Low values reduce the ability to escape from local minima. High values lead to a random-search-like behavior. Values for Γ/T in the range between 1 and 2 yield the lowest costs. The evolution of the objective function over 20,000 iterations is displayed in Figure 2. Further asymptotic improvements can be observed for the logarithmic cooling
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation
95
Fig. 2. Evolution of Cost Function for Roe’s flux
schedule. DP’s converge faster initially and their cost varies stronger during the runtime of the algorithm. This is the result of one dyadic pivot constituting multiple row or column pivots. After 1,000,000 iterations of our simulated annealing algorithm with logarithmic cooling (Γ = 5) the best RCP was able to compute the Jacobian by using 861 fma’s which represents an improvement by over ten per cent.
6 Conclusion The use of RCP in simulated annealing results in a considerably widened search space compared to DP. Hence, even for graphs where the optimal RCP has a lower cost than the optimal DP it is not clear whether the result of running a simulated annealing algorithm for RCP results in an improvement of the best elimination sequence know so far. Furthermore, switching from DP to RCP may increase the runtime of the simulated annealing algorithm significantly. Nevertheless the use of such algorithms is justified if the corresponding Jacobian code is likely to account for a large portion of the computational effort of a given numerical algorithm. Moreover, the generation of optimal Jacobian code is expected to be a trade-off between a low number of arithmetic operations and the efficiency of the corresponding memory access pattern. Further investigations are required to built the optimal Jacobian code in general. From the theoretical point of view, the difficulty to reduce accumulation costs by switching from DP to RCP can be interpreted as an indication for the cost discrepancy
96
Uwe Naumann and Peter Gottschling
between optimal DP and optimal RCP not being very large. This conjecture is supported by our numerical results.
Acknowledgments This work was supported by U.K. Engineering and Physical Sciences Research Council under Grant GR/R38101/01. This work was also supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract W-31-109ENG-38.
References 1. A. Aho, R. Sethi, and J. Ullman. Compilers. Principles, Techniques, and Tools. AddisonWesley, Reading, MA, 1986. 2. A. Albrecht, P. Gottschling, and U. Naumann. Markowitz-type heuristics for computing Jacobian matrices efficiently. In Proceedings of International Conference on Computational Science. Springer, LNCS, 2003. To appear. 3. M. Berz, C. Bischof, G. Corliss, and A. Griewank, editors. Computational Differentiation: Techniques, Applications, and Tools, Proceedings Series, Philadelphia, 1996. SIAM. 4. G. Corliss, C. Faure, A. Griewank, L. Hascoet, and U. Naumann, editors. Automatic Differentiation of Algorithms – From Simulation to Optimization, New York, 2002. Springer. 5. G. Corliss and A. Griewank, editors. Automatic Differentiation: Theory, Implementation, and Application, Proceedings Series, Philadelphia, 1991. SIAM. 6. A. Griewank. Evaluating Derivatives. Principles and Techniques of Algorithmic Differentiation. Number 19 in Frontiers in Applied Mathematics. SIAM, Philadelphia, 2000. 7. A. Griewank and S. Reese. On the calculation of Jacobian matrices by the Markovitz rule. In [5], pages 126–135, 1991. 8. B. Hajek. Cooling schedules for optimal annealing. Mathem. of Operations Research, 13:311–329, 1988. 9. K. Herley. A note on the NP-completeness of optimum Jacobian accumulation by vertex elimination. Presentation at: Theory Institute on Combinatorial Challenges in Computational Differentiation, 1993. 10. P. Van Laarhoven and E. Aarts. Simulated Annealing: Theory and Applications. Reidel, 1988. 11. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21:1087–92, 1953. 12. U. Naumann. An enhanced Markowitz rule for accumulating Jacobians efficiently. In K. Mikula, editor, ALGORITHMY’2000 Conference on Scientific Computing, pages 320– 329. Slovak University of Technology, Bratislava, Slovakia, September 2000. 13. U. Naumann. Cheaper Jacobians by Simulated Annealing. SIAM J. Opt., 13(3):660–674, 2002. 14. U. Naumann. Elimination techniques for cheap Jacobians. In [4], pages 247–253, 2002. 15. U. Naumann. Optimal accumulation of Jacobian matrices by elimination methods on the dual computational graph. Mathematical Programming, 2003. To appear.
Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation
97
16. U. Naumann and P. Gottschling. Prospects for Simulated Annealing in Automatic Differentiation. In K. Steinh¨ofel, editor, SAGA 2002 - Stochastic Algorithms, Foundations and Applications, volume 2264 of LNCS. Springer, Berlin, 2001. 17. P.L. Roe. Approximate Riemann solvers, parameter vectors, and difference schemes. Journnal of Computational Physics, 43:357–372, 1981. 18. M. Tadjouddine, S. Forth, J. Pryce, and J. Reid. Performance issues for vertex elimination methods in computing Jacobians using Automatic Differentiation. In Proceedings of the ICCS 2000 Conference, volume 2330 of Springer LNCS, pages 1077–1086, 2002.
Quantum Data Compression John A. Vaccaro1, Yasuyoshi Mitsumori2,3, Stephen M. Barnett4 , Erika Andersson4, Atsushi Hasegawa2,3, Masahiro Takeoka2,3, and Masahide Sasaki2,3 1
Quantum Physics Group, STRC, University of Hertfordshire, College Lane Hatfield AL10 9AB, UK 2 Communications Research Laboratory, Koganei, 4-2-1 Nukuikita, Koganei, Tokyo 184-8795, Japan 3 CREST, Japan Science and Technology Corporation, 3-13-3 Shibuya, Tokyo 150-0002, Japan 4 Department of Physics and Applied Physics, University of Strathclyde, Glasgow G4 0NG, Scotland
Abstract. The last two decades has witnessed the emergence a new paradigm in information theory based on quantum theory. We review a fundamental element of quantum information theory, source coding, which entails the compression of quantum data. We also briefly outline a recent experiment that demonstrates this fundamental principle. Keywords: Quantum information theory, quantum data compression, quantum source coding, linear optics.
1 Introduction A new paradigm in information theory based on the principles of quantum theory has emerged in the last two decades. The most widely known goal of this new research area is to produce a quantum computer. This is a device which has been shown to have the ability, in principle, to solve some difficult problems efficiently [1]. A great deal of effort is underway to build the fundamental gates (quantum gates) needed for its operation and develop the theory. Already there have been significant milestones such as the factoring of the number 15 by a quantum computer based on nuclear magnetic resonance [2] and a quantum gate between two ions in an ion trap [3] to name but two. This new theory has also had an impact on data security where techniques for the secure distribution of a random key (quantum key distribution) [4]. The security is guaranteed, essentially, by the physical impossibility to clone quantum information and so eavesdroppers can be detected, in principle. The efficient storage and transmission of information lies right at the heart of information theory. Indeed, the information content of a message and the minimum requirements to represent that information is central to Shannon’s seminal paper [5] and forms his noiseless coding theorem. A few years ago Schumacher [6] and Jozsa [7] developed the quantum version of Shannon’s theorem. In this paper we briefly review quantum noiseless coding and describe some possible experimental implementations as well as some recent experimental results. We begin in Section 2 with a review of some of the basic principles of quantum information theory applicable here, and we discuss A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 98–107, 2003. c Springer-Verlag Berlin Heidelberg 2003
Quantum Data Compression
99
the Schumacher quantum noiseless coding theorem and coding schemes in Section 3. In Section 4 we discuss experimental implementations and conclude in Section 5.
2 Review of Essentials of Quantum Information Theory Qubits. In classical information theory, each letter in a binary message can be a 0 or a 1. The defining property of classical messages is that a letter can only be one of these numbers. Quantum messages, in contrast, are not so restricted. We now review how a quantum memory element, the qubit, can be in a state representing a superposition of 0 and 1. One way to appreciate the meaning of a superposition is to consider a physical system in this state. There are many examples available, but for our purposes the state of a photon is the most useful. A photon is a particle of light. The granularity of light is not familiar to our daily experience due to the high rate of photons arriving in our eyes each second under normal circumstances. Nevertheless, devices called photodetectors can detect individual photons. Photodetectors can be operated in a way to give a voltage spike (or an audible “click” by a speaker) whenever a photon is detected. The devices we consider for manipulating photons consists of mirrors and beam splitters and combiners. A typical mirror consists of a reflective coating (usually a metal layer) on a glass slide. A perfectly reflecting mirror has a sufficiently thick coating to reflect all the light incident on it. In the absence of a coating, i.e. just the glass, none (or a negligible amount) of the light is reflected; the light is simply transmitted through through the glass. In the intermediate regime, a coating can be made sufficiently thin that, for example, 50% of an intense beam of light is reflected and 50% is transmitted. Such a semi-silvered mirror is called a 50-50 beam splitter, since it splits the beam equally into two. Imagine a single photon directed at a 50-50 beam splitter. To determine which path the photon takes, we can place a photodetector in each of the reflected and the transmitted paths as shown in the Fig. 1a. The photon will be found to be either transmitted or reflected with equal probability. One, and only one, of the photodetectors will “click”. We cannot predict a priori which path the photon will take, and each photon encountering the beam splitter does so independently of all other photons. We label the reflected path as a logical 0 and the transmitted path as a logical 1. This simple demonstration illustrates how a photon can represent a bit of information. (a)
(b) photodetector
incident photon
incident photon
transmitted path 1
photodetector
0
0 reflected path beam splitter
mirror
beam combiner
beam splitter 1
photodetector
mirror
Fig. 1. A single photon representing 1 qubit.
photodetector
100
John A. Vaccaro et al.
On the other hand, we can recombine the reflected and transmitted beams using mirrors and a further semi-silvered mirror (a beam combiner) as shown in the Fig. 1b. Light has a wave light property in addition to its granularity. In fact, light is electromagnetic radiation and the waves are due to the oscillations in the electric and magnetic fields. The crucial feature deciding how the two beams recombine is given by the relative position of the crests and troughs of the waves. If the lengths of the two possible paths 0 and 1 are such that the crests of both waves arrive simultaneously at the beam combiner, the photons will emerge in the downwards direction as shown in the figure. Alternatively, if the crests of one wave coincides with troughs of the other at the beam combiner, the photons will emerge in the upwards direction indicated by the dotted line in the figure. By slightly changing the relative path lengths, the photons can be switched from the downwards to the upwards path. This implies that when the photons arrive at the beam combiner, they have information of both path lengths. This phenomenon occurs for single photons indicating that a single photon exists simultaneously in both paths. We are driven to conclude that, when the photon is between the beam splitter and beam combiner, the photon represents both logical values 0 and 1 simultaneously. This state of the photon is called a superposition state. A bit with this added quantum feature is called a quantum bit, or qubit. The crucial point to be made here is that if we measure (i.e. examine) the path a photon is travelling along, we get a definite (but stochastic) answer pertaining to a single path. In other words, if we ask which logical value is stored in a qubit, we find a 0 or 1. If we don’t ask which path, but allow the photon to proceed and undergo some manipulation (e.g. be recombined), we find evidence that the photon has represented both logical values 0 and 1 simultaneously. We mention in passing that it is this basic superposition property of qubits that gives rise to the enormous potential of quantum computers. In contrast to a n-bit memory that can store one of 2n numbers, a n-qubit device can represent all 2n numbers simultaneously. Moreover it is possible to perform operations on the 2n numbers in parallel. However a full discussion of these ideas is beyond the scope of this work. Formal Framework. The logical elements 0 and 1 are represented in quantum information theory using the symbols |0 and |1 for two orthonormal vectors. A superposition state is represented as 1 |ψ = u |0 + v |1
(1)
The coefficients u and v can be complex. Fig. 2a illustrates the state |ψ for real coefficients. In the photon example, the complex arguments represent phase changes (or time delays) of the associated optical waves. The set of all superposition states forms a complex vector space. Indeed, the representation of |ψ as a column vector is simply u → − ψ = . An inner product ψ|φ on this space is defined for two arbitrary vectors v |ψ and |φ as follows. First we note that since the states |0 and |1 are orthonormal and their inner products are 0|0 = 1|1 = 1 0|1 = 0|1 = 0 .
Quantum Data Compression
101
If |φ = a |0 + b |1 and |ψ is given by Eq. (1) then the inner product between |ψ and |φ is defined to be ψ|φ = u∗ a + v∗b . where * indicates complex ∗ conjugation. Inthe column vector representation, ψ|φ ≡ → − →∗ − → − u a − → ∗ ψ · φ where φ = and φ = . v∗ b
(a)
(b)
1 v
1
u
0
L+ L-
0
Fig. 2. A vectorial representation of the state vectors |ψ = u |0 + v |1 and |L± = α |0 ± β |1.
It is usual to insist that the sum of the squares of the moduli of the coefficients is equal to unity |u|2 + |v|2 = 1, that is, all state vectors have a unit norm. This restricts the set of vectors to a subset of the vector space. A measurement to determine which logical value is stored in the qubit yields the results 0 and 1 with probabilities P(0) = 0|ψ 2 = |u|2 and P(1) = 1|ψ 2 = |v|2 , respectively. The operations allowed by quantum theory are those which take a state vector of unit norm into another state vector of unit norm. This brief review of the formal framework is sufficient for our purposes of describing quantum source coding. The reader is referred to standard texts on the subject for more details [8].
3 Quantum Source Coding Classical and Quantum Messages Messages comprise a sequence of letters Li , L2 , L3 , · · · . Each classical letter Ln belongs to an alphabet (or set) of N letters Ln ∈ {a, b, c, · · · }. The task of source coding is to represent the message with the shortest sequence of symbols such as the binary digits 0 and 1. Essentially, common letters are represented as short sequences of symbols and uncommon ones as longer sequences. Shannon’s noiseless coding theorem shows that the length of the shortest coded message is bounded below by MH bits, where M is the number of letters in the message and H is the Shannon entropy which is given by H = − ∑ P(n) log2 P(n) n
102
John A. Vaccaro et al.
for a source that produces messages with the letter probabilities P(a), P(b), P(c), · · · . H is the average information content per letter. It takes its maximum value of log2 N when all letters are equally likely, in which case no compression is possible. For example, an alphabet of N = 256 equally-likely letters would require 8 bits, or 1 byte, per letter to encode; this is equivalent to the standard ASCII coding. Quantum coding [6] applies to quantum messages comprising a sequence of quantum letter states |Ln belonging to an alphabet {|L1 , |L2 , |L3 , · · · } with corresponding probabilities P1 , P2 , P3 , · · · . If the letter states form an orthogonal set, the analysis of the quantum message is precisely that as for a classical message. In particular, if the letter states are equally likely, P1 = P2 = P3 = · · · , no compression is possible. New quantum features arise, however, when the letter states are not orthogonal. Significantly, compression is possible even in the case of equally-likely letter states. We restrict our discussion to qubit letters, |Ln = αn |0 + βn |1, for the brevity. The state of each letter can be represented as a matrix, called a density matrix, in the following way: |Ln Ln | = |αn |2 |0 0| + αn β∗n |0 1| + α∗n βn |1 0| + |βn |2 |1 1| or using the 2 × 2 matrix representation, the matrix Rn = (α∗n , β∗n )T (αn , βn ). The average letter state is given by the weighted sum of the density matrices of the letters: ρˆ = ∑ Pn |Ln Ln | n
=
∑ Pn|αn |2
|0 0| +
n
+
∑ Pn α∗n βn
∑ Pnαn β∗n
n
|1 0| +
n
∑ Pn|βn|2
|0 1|
|1 1|
n
In terms of 2 × 2 matrices, the average letter state is given by the positive semi-definite ˆ of the density operator ρˆ repmatrix Rav = ∑n Pn Rn . The von Neumann entropy S(ρ) resents a measure of the amount of statistical uncertainty in a quantum state. It is given by ˆ = − ∑ λi log2 λi s(ρ) i
where λi are the (positive) eigenvalues of the matrix Rav . The quantum noiseless coding theorem [6, 7] implies that by coding the quantum ˆ qubits are enough to encode each block in the message in blocks of K letters, KS(ρ) limit K → ∞. Jozsa and Schumacher considered letter states of the form [7]: 1 |L± = α |0 + β± |1
(2)
where β± = ±β, α2 + β2 = 1. For clarity we assume α and β are real numbers. These states are illustrated in Fig. 2b. Let the letter states occur with equal likelihood so that the density operator representing the average letter state is ρˆ = α2 |0 0| + β2 |11|. ˆ = −α2 log2 α2 − β2 log2 β2 . If the letThe corresponding von Neumann entropy is S(ρ) 2 2 ter states are orthogonal L |L+ = α − β = 0 which gives α2 = β2 = 12 . In this case
Quantum Data Compression
103
ˆ = 1 and so 1 qubit is needed to encode each letter faithfully. Clearly a message of S(ρ) equally-likely, orthogonal letter states cannot be compressed to a smaller code. However, the von Neumann entropy of ρˆ is 0.4690 bits for the case α2 = 0.9 [7]. According to the quantum noiseless coding theorem, in the limit of large block sizes Alice needs approximately 1/2 qubit per letter state to faithfully transmit the message to Bob. If the coding is restricted to finite length blocks, the encoding and subsequent decoding will introduce errors. The exercise then is to consider the fidelity of the coding-decoding operation. 2 Qubit Blocks. Let us examine the compression of a block of 2 qubits. In an analogous manner to the possible numbers, 00, 01, 10, 11, able to be stored in 2 bits, the basis states of a 2-qubit system are |00, |01, |10 and |11. Here we have written the tensor (outer) product of the states of two qubits |n ⊗ |m as |nm. Thus the 2 letter states can be written as |Ls1 ⊗ |Ls2 = (α |0 + βs1 |1) ⊗ (α |0 + βs2 |1) = α2 |00 + αβs2 |01 + αβs1 |10 + βs1 βs2 |11
(3)
where sn ∈ {+, −}. For β < α the most likely 3-dimensional subspace Λ2 spanned by the largest 3 eigenvalues can be shown to be the subspace spanned by |00, |01 and |10. The 2-qubit block can be encoded approximately onto Λ2 as follows. The procedure which gives the highest fidelity is to perform a measurement on the 2-qubit system to see if the state lies in Λ2 or in the subspace Λ2 spanned by the remaining basis states, which, in this case is just |11. The probabilities for these two results are PΛ2 = 1 − β2 = α2 and PΛ2 = β2 , respectively. If the state lies in Λ2 , the state |00 is used as the coded block state, otherwise the (normalized) letter state after projection onto Λ2 , |C2 =
1 2 (α |00 + αβs2 |01 + αβs2 |10) , α
is used. The factor 1/α in the last expression is required for a unit norm. The coded block state lies in a 3 dimensional space, which is effectively a log2 3 = 1.58 qubit system. The decoding for this scheme is simple: no action is taken. The extension of the 2 qubit letter state |Ls1 ⊗ |Ls2 onto Λ cannot be recovered and this incurs errors. The fidelity of the coding-decoding operation is given by the average of the square of the inner products of the original 2-qubit letter state Eq. (3) with all possible decoded states. Formally the expression is 2 2 1 F2 = ∑ PΛ2 (Ls1 | ⊗ Ls2 |) |C2 + PΛ2 Ls1 | ⊗ Ls2 |00 . 4 s1 ,s2 We find the value of F2 = 1 − β4 = 0.99 for α2 = 0.90. This scheme uses 0.79 qubits per letter state. 3 Qubit Blocks. We now consider blocks of 3 letters. This is the case considered by Jozsa and Schumacher [7] and we review it briefly.
104
John A. Vaccaro et al.
The 3 letter states can be written as |Ls1 ⊗ |Ls2 ⊗ |Ls3 = α3 |000 + α2 βs3 |001 + α2 βs2 |010 + αβs2 βs3 |011 +α2 βs1 |100 + αβs1 βs3 |101 + αβs1 βs2 |110 + βs1 βs2 βs3 |111
(4)
In this case the 3 qubit block state can be encoded, approximately, onto the state of 2 qubits in the following way. First we identify the most likely 4-dimensional subspace Λ3 ; this is spanned by |000, |001, |010 and |100. The encoding is carried out by performing a measurement to determine if the block state Eq. (4) lies in Λ3 or the subspace Λ3 spanned by the remaining basis states |011, |101, |110 and |111. If it is found in the former subspace, the (normalized) state after the measurement is 1 (α |000 + βs3 |001 + βs2 |010 + βs1 |100) . |C3 = 1 + 2β2 This state is transformed into a 2-qubit state using the unitary transform Uˆ which maps |100 → |011, |011 → |100 and leaves all other basis states unchanged. Ignoring the first qubit (which is in the state |0) then leaves the remaining 2-qubit system in the state C3 = 1 (α |00 + βs3 |01 + βs2 |10 + βs1 |11) 1 + 2β2 If the measurement yields the result that the block state Eq. (4) lies in the subspace Λ3 , the state |00 is used as the 2-qubit coded block state. The probabilities for the outcomes of the measurement are PΛ3 = α4 (1 + 2β2) and PΛ3 = 1 − PΛ3 for the subspaces Λ3 and Λ3 , respectively. The decoding is performed by preparing a new qubit in the state |0 and performing the inverse unitary operation Uˆ −1 on the system of 3 qubits (i.e. on the new qubit and the 2-qubit coded block). The result is the state |C3 with probability PΛ3 and the state |000 with probability PΛ3 . The fidelity of the decoded block is found to be F3 = α8 (1 + 2β2)2 + α6 β4 (1 + 2α2 ) = 0.97 for α2 = 0.90. This scheme uses 0.67 qubits per letter state, and so it is slightly more efficient that the 2-qubits scheme, although it also has lower fidelity. This method can be taken further encoding blocks of K letters for larger values of K.
4 Physical Implementations We now illustrate how the previous schemes could be implemented using a linear optical scheme, that is using single photons, beam splitters (and combiners) and mirrors. As shown in section II, a qubit will be represented by a single photon travelling in a superposition of two possible paths. To produce a qubit with the coefficients α and β, we use beam splitters for which the transmission and reflection probabilities are |α|2 and β2 (or vise versa, depending on their position in the optical circuit). The sign of the coefficient β− = −β is produced using a piece of glass to delay the wave in the 1 path by half a wavelength, as shown in Fig. 3a.
Quantum Data Compression
105
extra photon
(a)
(b)
mirror
incident photon
0
incident photon
optional delay
mirror
00
00
01
01
01
10
10
10
1
00
11 encoding
11 decoding
Fig. 3. Implementing (a) a single letter state and (b) the 2 qubit scheme.
2 Qubit Scheme. A 2 qubit system requires the photon to be in a superposition of 4 paths, as suggested by Eq. (3). This is produced by splitting the 0 and 1 paths into a further two paths each and using glass delays as appropriate as shown in Fig. 3b. The compression of the 4 state system into a 3 state system is performed using a photodetector to determine the presence or absence of a photon in path 11. This corresponds to the measurement to see if the block state is in the subspaces Λ2 or Λ2 . The detection of a photon indicates the latter case and an extra photon is switched into path 00 to represent the coding of the state |00, as indicated by the dotted line in Fig. 3b. The decoding is produced simply by making available an empty path representing the state |11. 3 Qubit Scheme. The 3 qubit block can be produced by further splitting each of the 4 paths into two. This allows a single photon to be in a superposition of 8 paths. The unitary transformation Uˆ is produced simply by interchanging the two paths 100 and 011 as shown in Fig. 4. The projection onto Λ3 or Λ3 is produced using the 4 photodetectors. If one of the photodetectors records a photon (indicating the block state lies in Λ3 ), an extra photon is switched into the optical path 00. The encoded state lies in a 4 dimensional space, which is the state-space of a 2-qubit system. The decoding is carried out in a reverse manner. The inverse unitary operation Uˆ −1 is again produced by interchanging two paths and 4 (dark) paths are added to reconstruct an 8 path system. We have recently performed an experimental implementation of the 3 qubit scheme that, while it is different to the one described here, it is based on essentially the same principles. The experiment makes use of an extra degree of freedom available to photons that we have ignored up to now. The electromagnetic waves comprising the photons oscillate transversely to the direction of travel. The orientation of the oscillations is called the polarization of the photon and gives rise to an intrinsic polarization qubit. Thus a 3 qubit system can be constructed from 2 path qubits (i.e. 4 paths) and 1 polarization qubit (2 orthogonal polarization directions). Details can be found in [9]. The extension of our 2 and 3-qubit optical schemes to blocks of larger K is straight forward requiring 2K optical paths. However the scheme quickly becomes experimentally difficult to perform owing to the degree of control required over the many optical paths.
106
John A. Vaccaro et al. extra photon a a b
incident photon
b a
a
b
b
a a b
b a b
000
00
000
001
01
001
010
10
010
011
11
011
100
100
101
101
110
110
111 encoding
111 decoding
Fig. 4. Implementing the 3 qubit scheme.
Rather than use a single photon to represent a multiple qubit block, a more scalable system comprises one photon for each qubit. However the various operations needed for coding and decoding require nonlinear interactions between the photons. Such nonlinear interactions between individual photons have not yet been produced experimentally, although there are proposals for engineering them using available technology.
5 Conclusion The emergence of quantum information theory presents many exciting challenges, both experimentally and theoretically. It opens up a new paradigm in information theory forcing us to completely review what we understand to be information. Source coding is an essential element of information theory, for which it is known that a classical message of equally-likely letters is not compressible. In contrast, the Schumacher quantum noiseless source coding theorem shows that a quantum message of equally-likely, but non-orthogonal letter states is indeed compressible. We have reviewed this issue beginning with an intuitive description of qubits and we have given examples of experimental implementations using single photons. This work was supported by the British Council, the Royal Society of Edinburgh, the Scottish Executive Education and Lifelong Learning Department and the EU Marie Curie Fellowship program.
Quantum Data Compression
107
References 1. P.W. Shor, SIAM Journal of Computation, 26, 1484-1509, (1997). 2. L.M.K. Vandersypen, M. Steffen, G. Breyta, C.S. Yannoni, M.H. Sherwood, and I.L. Chuang, Nature 414, 883 (2001). 3. F. Schmidt-Kaler, H. H¨affner, M. Riebe, S. Gulde, G.P.T. Lancaster, T. Deuschle, C. Becher, C.F. Roos, J. Eschner, R. Blatt, Nature 422, 408 (2003). 4. N. Gisin, G.G. Ribordy, W. Tittel, H. Zbinden, Rev. Mod. Phys. 74, 145 (2002). 5. C.E. Shannon, Bell System Technical Journal, 27, 379 (1948). 6. B. Schumacher, Phys. Rev. A51, 2738 (1995). 7. R. Jozsa and B. Schumacher, J. Mod. Opts. 41, 2343 (1994). 8. See e.g. M.A. Nielsen and I.L. Chuang, Quantum Computation and Quantum Information (Cambridge University press, Cambridge, 2000). 9. Y. Mitsumori, J.A. Vaccaro, S.M. Barnett, E. Andersson, A. Hasegawa, M. Takeoka and M. Sasaki, “Experimental demonstration of quantum source coding”, quant-ph/0304036.
Who’s The Weakest Link? Nikhil Devanur, Richard J. Lipton, and Nisheeth Vishnoi Georgia Institute of Technology, Atlanta, GA 30332, USA {nikhil,rjl,nkv}@cc.gatech.edu
Abstract. In this paper we consider the following problem: Given a network G, determine if there is an edge in G through which at least c shortest paths pass. This problem arises naturally in various practical situations where there is a massive network (telephone, internet), and routing of data is done via shortest paths and one wants to identify most congested edges in the network. This problem can be easily solved by one all pair shortest path computation which takes time O(mn), where n is the number of nodes and m the number of edges in the network. But for massive networks - can we do better? It seems hard to improve this bound by a deterministic algorithm and hence we naturally use randomization. The main contribution of this paper is to a give a practical solution (in time significantly less than O(mn)) to a problem of great importance in industry. Keywords: Shortest Path, massive networks, approximation, random sampling.
1 Introduction Recent times have seen the emergence of huge networks, for instance, the internet, telephone networks, peer to peer networks, the World Wide Web, etc. Of particular interest are the internet and the telephone networks ([16]), which present a horde of practical problems. Routing is a key element in making these huge networks work efficiently. The most common routing scheme used is the shortest-path-routing, i.e. the route between two nodes is the shortest path joining them. The advantage of using shortest paths1 is that all the routes can be computed in O(mn) time (where m is the number of edges and n the number of nodes) (see [4, 7, 8, 18]). Maintenance of such huge networks is an immense challenge. The problem that arises in this scenario is that of identifying congested edges/nodes or hot-spots or bottlenecks. This is because these are the places which are most prone to breakdown. Moreover, any problem at these spots causes maximum damage. Another objective is load-balancing in order to reduce delays. This is done by identifying the congested edges/nodes and rerouting some routes. So it is only natural that identifying such edges/nodes is of top priority. 1
Even though we only consider shortest-path-routing in this paper, we do not use any particular property of the shortest path. Our techniques will work for most routing schemes, but the interesting ones are those in which a route can be computed in O(m) time. For any such routing scheme, all the routes can be computed in O(mn2 ) time. Notice that one cannot hope to do better than O(mn) time, since for a tree, at least O(n2 ) = O(mn) time is required.
A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 108–116, 2003. c Springer-Verlag Berlin Heidelberg 2003
Who’s The Weakest Link?
109
Congestion of an edge (a node) is defined as the number of routes that pass through an edge (a node). Assume that there is a unique route between any two nodes. For the sake of simplicity of presentation, we consider only simple, undirected graphs, with unweighted edges. One can consider many problems related to congestion in a network. This paper is centered around the following problems: 1. 2. 3. 4.
Is there an edge with congestion at least c? Find the most congested edge. More generally, find the top t most congested edges/nodes. Given an edge, estimate its congestion.
Note that if we can solve 1, then we can solve 2 by binary search. In fact 1 seems to be the most useful and general problem one might want to solve. This is exactly the problem we address in this paper. This problem arises in industry (in particular is of interest to Telecordia [16]) where one wants to identify most congested edges in the network. One may construct simple examples where deterministic algorithms don’t perform much better than Ω(mn). So the natural approach is to use randomness. Although these techniques are very simple, they are very powerful and have been used to give important results: [5, 9, 12, 13, 15] and [14]. Similar ideas using randomization were used by Karger et al. in [10] and [11] to estimate the value of max-flow. Since we use randomization naturally leads us to consider a slight relaxation of the problem: Given parameters c and ε, if there exists an edge with congestion ≥ c(1 + ε), then output YES with probability ≥ 1 − 1/poly(n). If for all edges, the congestion is ≤ c(1 − ε), then output NO with probability ≥ 1 − 1/poly(n). Note that Problem 2 can also be approximately solved using the above relaxation. We also consider those graphs where there is a “clear” separation between the set of most congested edges and the rest, and we are needed to output the edges among the most congested. This and similar problems are omnipresent in industry and to our surprise, have not been systematically studied in the Computer Science literature- to the best of our knowledge. In this paper we provide a formal setting for some of these problems and give extremely practical algorithms using elementary ideas. We make no attempts to optimize the constants in this paper. 1.1 Formal Setting Let G = (V, E) be an undirected, unweighted, simple graph with n := |V |, m := |E|. Definition 1. Given an edge e ∈ E, define σ(e), the congestion of an edge, to be the number of shortest paths passing through e. To make this well defined, we label the vertices and use the lexicographically least shortest path for each pair of vertices. One can also define the congestion of a node similarly: Definition 2. Given a node v ∈ V define σ(v), the congestion of a node, to be the number of shortest paths containing v as an interior point.
110
Nikhil Devanur, Richard J. Lipton, and Nisheeth Vishnoi
As before, we label the nodes and consider the lexicographically shortest path. Also note that ∑ σ(e) = 2σ(v) + n − 1 e:e∼v
Definition 3. Let P be any set of shortest paths. Define σP (e) to be the number of shortest paths in the set P that pass through e. Definition 4. Let g(G) := maxe∈E σ(e) be the maximum congestion in the graph G. Definition 5. Suppose that there exist δ > 0 and a function f (n), such that E is partitioned into two classes, (1) S1 := e ∈ E : σ(e) ≥ n1+δ (1 + f (n)) n1+δ S2 := e ∈ E : σ(e) ≤ . (2) 1 + f (n) Then we say that there is an f -separation at δ in G. That is, we assume that there is a separation between the congestion of edges in S1 and that of the others, S2 . The problems that we want to solve are2 : Problem 1. Given c, is there an edge e ∈ E with σ(e) ≥ c? Problem 2. Find an edge e ∈ E with the maximum congestion, i.e., σ(e) = g(G). Or more generally, find t most congested edges. Problem 3. Given that there is an f -separation at δ in G, find the set S1 , as defined in (1) (and hence S2 ). Problem 4. Given an edge e ∈ E, estimate σ(e), its congestion. 1.2 Our Results We outline the main results3 obtained in this paper. We give an algorithm for solving Problem 1. The following theorem proves the correctness and establishes its running time: Theorem 1. For all ε > 0, there exists an algorithm A running in time O(km) where n n2 · c such that k = log ε2 – If ∃e ∈ E, σ(e) ≥ c(1 + ε) then Pr [A outputs YES ] ≥ 1 − n−2 – If ∀e ∈ E, σ(e) ≤ c(1 − ε) then Pr [A outputs NO ] ≥ 1 − n−2 An immediate corollary of the above theorem gives a solution to Problem 2. 2 3
We define all the problems w.r.t the congestion on edges. However, they can also be defined w.r.t the congestion on vertices. As with the problems, the results are also stated w.r.t the congestion on edges, and the corresponding results for congestion on vertices follow immediately.
Who’s The Weakest Link?
111
Corollary 1. Given a graph G such that g(G) n log3 n there exists4 an algorithm running in time o(mn) which finds an edge e such that 1 1 Pr σ(e) < 1 − g(G) < log n n A variant of the previous algorithm solves Problem 3 Theorem 2. given a graph G with an f -separation at δ, there is an algorithm such that the probability that it outputs S1 correctly is at least 1 − n−2, and runs in time O(km) n · n1−δ and ε = 1/ f (n) − 1/ f 2(n) where k = log ε2 Again, we get as a corollary that Problem 4 can be solved. Corollary 2. Given G = (V, E), e ∈ E, with σ(e) n log3 n, there exists an algorithm ˜ which runs in time o(mn) and finds σ(e) such that ˜ Pr [|σ(e) − σ(e)| ≥ σ(e)/ log n] ≤ n−4 1.3 Organization of the Paper In section 2 we give an algorithm to solve Problem 1, and prove Theorem 1. We give a slight variation of the previous algorithm in Section 3 that proves Theorem 2. In the same section we compare our algorithm with the deterministic O(mn) algorithm. In Section 4, using results from the theory of random graphs, we show how to further improve the running time. Section 5 considers a coupon-collection approach to the problem. In section 6 we present related open problems.
2 Main Result 2.1 Algorithm Given an undirected graph G = (V, E) on n vertices and m edges, the number of shortest paths is n2 . The main idea of the algorithm is random sampling followed by finding if there exists an edge whose observed congestion is greater than some cut-off. The algorithm picks k paths at random, uniformly and independently5. k will be determined later. Given k, define the “cut-off” to be kc c(k, n, c) := n . 2
The algorithm is as follows: 4 5
For functions f (n), g(n), f (n) g(n) is the same as g(n) = o( f (n)). One way to do this is pick two vertices randomly without replacement and consider the path with those as endpoints.
112
Nikhil Devanur, Richard J. Lipton, and Nisheeth Vishnoi
Algorithm 2 Algorithm for detecting an edge with high congestion Input: The graph G = (V, E), c Choose k paths at random, uniformly and independently. Let the set of paths chosen be P Compute all the shortest paths and σP (e) for each e ∈ E if for some edge e ∈ E, σP (e) ≥ c(k, n, c) then Output: YES else Output: NO end
Each of the shortest paths in P can be computed in O(m) time, so it takes O(km) time to compute all the shortest paths in P. Note that in the same amount of time, σP (e) can be computed for all e ∈ E. So the total running time of the algorithm is O(mk). 2.2 Analysis We need to determine k, the number of paths to be picked. Also, we need to analyze the probability of success. In fact, k will depend on the probability of success desired, besides c and f . Let Xe := σP (e), ∀e ∈ E be random variables where P is the set of paths picked by the (randomized) algorithm. Each Xe can be written as a sum of k independent and identical Bernoulli trials, Xe = Xe1 + Xe2 + · · · + Xek where Xei is 1 if the ith path chosen by the algorithm passes through e and 0 otherwise. σ(e) Pr[Xei = 1] = n 2
kσ(e) µ := E[Xe ] = n 2
Lemma 1. For all ε > 0, if k≥ Then
8 log n n2 and ∃e ∈ E, σ(e) ≥ c(1 + 2ε), · ε2 c
Pr[e is not picked by the algorithm] < n−4
Proof. All probabilities are taken over the choice of the set of paths chosen, P. Pr[e is not picked] = Pr [Xe < c(k, n, c)] The version of Chernoff bound we use is [1], Pr[Xe < (1 − ε)µ] < exp(−µε2 /2)
Who’s The Weakest Link?
113
For e, µ≥
kc(1 + 2ε) n 2
kc ⇒ µ(1 − ε) ≥ n = c(k, n, c) 2
⇒ Pr[Xe < c(n, k, c)] ≤ Pr [Xe < (1 − ε)µ] ≤ exp(−µε2 /2) Moreover,
2 n2 8 log n n2 ≥ 4 log n · 2 · k≥ ε2 c ε c ε2 µ ε2 kc ≥ n ≥ 4 log n ⇒ 2 2 2
⇒ Pr[Xe < c(n, k, c)] ≤ exp(−µε2 /2) ≤ exp(−4 logn) = n−4 Similarly, the probability that some edge with low congestion is picked by the algorithm can also be bounded: (We skip the proof.) Lemma 2. For all ε > 0, if k≥ Then
12 log n n2 and ∀e ∈ E, σ(e) ≤ c(1 − 2ε) · ε2 c Pr[e is picked by the algorithm] < n−4
The proof of Theorem 1 is an easy application of the above lemmas. 2
n n · c . For each edge e Proof ( of Theorem 1). Consider Algorithm 2.1 with k = 12 εlog 2 with σ(e) ≥ c(1 + ε) or σ(e) ≤ c(1 − ε), the probability that it ends up resulting in the wrong answer is n−4 . Since there are at most n2 edges, the probability that one of them ends up resulting in the wrong answer is at most n−2 . So with probability at least 1 − n−2 the algorithm outputs correctly.
2.3 Comparison To get a running time better than O(mn), we need c to be asymptotically bigger than n log n. (n is large so any reasonably growing function like log n will do.) Another important fact about these massive graphs is that they are sparse. Thus we may assume that m = O(n). Hence the maximum congestion is Ω(n). In particular, if c ∼ n1+δ then 1−δ
the running time would be O mn ε2 log n . For δ = 1/2 and n = 106 , our algorithm is guaranteed to be roughly 1000 times faster than the deterministic algorithm. This is a significant saving in the running time. Moreover simulations suggest that the running time is much less than our guarantee. Also if the maximum congestion is O(n), then it means that no edge is congested too much, and the graph is “nice”. Hence our algorithm can be used to detect that too!
114
Nikhil Devanur, Richard J. Lipton, and Nisheeth Vishnoi
3 Finding Separation Suppose that we are given that there is an f -separation at δ in G. Using essentially the same technique as in the previous section, one can identify the set S1 of most congested edges. Define the “cut-off” to be kn1+δ c(k, n, δ) := n . 2
Algorithm 3 Algorithm for finding separation Input: The graph G = (V, E), δ, f () Choose k paths at random, uniformly and independently. Let the set of paths chosen be P Compute all the shortest paths and σP (e) for each e ∈ E Output: The edges with σP (e) ≥ c(k, n, δ).
We state, without proof, the corresponding parameters that give the required result: Lemma 3. If k≥ Then
1 12 log n 1−δ 1 − 2 ·n and ε = 2 ε f (n) f (n)
Pr[e ∈ S1 is not picked by the algorithm] < n−4 Pr[e ∈ S2 is picked by the algorithm] < n−4
To prove Theorem 2 consider Algorithm 3 with k = 12εlogn · n1−δ and ε = 1/ f (n) − 2 2 1/ f (n). We need f (n) to diverge as n goes to infinity. This means that 1 + 1/ f (n) gets close to 1 as n tends to infinity. In other words, the separation need not be very “strong”, but should exist. This is fairly reasonable. However, we need to know where (i.e., the δ) and by how much (i.e., the f ) the separation occurs.
4 Further Optimization Using Random Graphs A (further) reduction in running time might be obtained by the fact that the time taken to compute single pair shortest path is the same as that for single source shortest path computation. So one might choose an appropriate set of vertices and run single source shortest path computations only on those vertices. Consider a graph H on the same set of vertices as G, with an edge between u and v if and only if the algorithm chose the shortest path between u and v. Clearly, it is enough to run single-source shortest path algorithms on any vertex cover of H. Since finding a minimum vertex cover is NP-Hard, we find a 2-approximation to the minimum vertex cover by finding a maximal matching in H, [17].
Who’s The Weakest Link?
115
Erd¨os and R´enyi in [6] define the random graph model G (n, p) which consists of all graphs on n vertices, in which the edges are chosen independently and with probability p. Since we pick the shortest paths uniformly at random, H belongs to G (n, p) where p = k/ n2 and k is the number of samples. For the parameters in our case, where p = o(1/n), the size of the minimum vertex cover is Ω(k) (see [3]). Unfortunately, it turns out that this does not give an order of magnitude improvement. It does, however, give an improvement in the constants.
5 A Heuristic Based on Coupon Collection In this section we give a heuristic for detecting the edge with maximum congestion based on the idea of coupon collection. The main idea is explained via the following toy problem: “given a bin with b balls of k different colors, where bi are the number of balls of color i. What is argmaxi bi ?” The coupon collection based approach for this problem is the following: “For a parameter t, which will depend upon bi ’s, sample with replacement from the bin, and output the color which is the first color to repeat t times.” Two remarks are in order: 1. This technique will in general not give good estimates for the values bi ’s. 2. The probability of success and the running time of this algorithm is a function of bi ’s. Of course, the random sampling technique used in the previous sections applies here. But this technique might give faster estimates in cases where bi ’s are extremely nonuniform. For a full technical discussion on this problem and its varians refer to the book by Blom, Holst and Sandell [2]. The problem of identifying the edge e such that σ(e) = g := maxe ∈E σ(e ) is a generalization of the coupon collection scheme spelled above. In our setting the balls are the paths and the colors are the edges. Each ball can have multiple colors. Hence for a given parameter t we sample paths until some edge repeats t times, and output that edge. There are simple graphs like Kn where this heuristic behaves very badly. Another case where this heuristic will fail is when the number of edges with congestion slightly less than g outnumber by far the number of edges with congestion g. But for most graphs like the following, this heuristic is very quick. This gives a very quick heuristic to detect such cases. We leave as an open problem to fully analyze the running time and success probability of this heuristic.
6 Conclusion and Open Problems In this paper, we considered the problem of finding if there exists an edge with congestion ≥ c for given c. We give fast and practical algorithms which are also simple to implement. One of the main contributions of this paper is to formalize this extremely important problem and put it in a theoretical framework. Any improvement in the running time would be of significance. Another interesting question is to find σ(e) given e, in time better than O(mn). We partially answer the question in Section 4. A deterministic solution to this problem would be very interesting.
116
Nikhil Devanur, Richard J. Lipton, and Nisheeth Vishnoi
Fig. 1. A Good Graph
References 1. N. Alon, J. Spencer, The Probabilistic Method, Wiley Interscience, 2000. 2. G. Blom, L Holst, D Sandell, Problems and Snapshots from the World of Probability, Springer-Verlag, 1994. 3. B. Bollobas, Random Graphs, Cambridge University Press, 2001. 4. E.W. Dijkstra, A note on two problems in connection with graphs, Numerische Mathematik, 1, (1976),83-89. 5. M. Dyer, A. Frieze, R. Kannan, A random polynomial algorithm for approximating the volume of convex bodies, Journal of the ACM, (1991), 1-17. 6. P. Erd¨os, A. R´enyi, On the evolution of random graph, Publ. Math. Inst. Hung. Acad. Sci., 5, (1960), 17-61. 7. R.W. Floyd, Algorithm 97: Shortest Path, Communications of the ACM, 5, (1962), 345. 8. D.B. Johnson, Efficient algorithms for shortest paths in sparse networks, Journal of the ACM, 24, (1977), 1-13. 9. N.L. Johnson, S. Kotz, Urn Models and Their Applications, John Wiley, New York, 1977. 10. D.R. Karger, Better Random Sampling Algorithms for Flows in Undirected Graphs. Proc. SODA 1998. 11. D.R. Karger, M.S. Levine, Random sampling in residual graphs, Proc. ACM STOC, 2002. 12. R.M. Karp, M. Luby, Monte Carlo algorithms for the planar multi-terminal network reliability problem, Journal of Complexity, 1, (1985), 45-64. 13. R.M. Karp, M. Luby, N. Madras, Monte Carlo approximation algorithms for enumeration problems, Journal of Algorithms, 10, (1989), 429-448. 14. R. Motwani, P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995. 15. K. Mulmuley, Computational Geometry, An Introduction Through Randomized Algorithms, Prentice Hall, 1994. 16. Telecordia. Private Communication, 2002. 17. V.V. Vazirani, Approximation Algorithms, Springer Verlag, 2001. 18. S. Warshall, A theorem on Boolean matrices, Journal of the ACM, 9, (1962), 11-21.
On the Stochastic Open Shop Problem Roman A. Koryakin Sobolev Institute of Mathematics of SB RAS, Acad. Koptyug pr. 4, 630090 Novosibirsk, Russia [email protected]
Abstract. We consider the open shop problem with m machines and n jobs and random operation processing times with distributions Fi j . There exists a polynomial time algorithm A (based on the compact vector summation technique) constructing a schedule (generally speaking, unfeasible) of the length equal to the maximum machine load for every instance of the problem. We show that for fixed m and increasing n and for a wide class of distributions Fi j , the schedule constructed by A is almost always feasible (and therefore, optimal). Thus, we have an efficient algorithm almost always solving the problem to the optimum. Keywords: Open shop, polynomial time algorithm, statistically optimal algorithm, random processing times.
1 Introduction and Results We consider the open shop problem with m machines and n jobs. A polynomial time algorithm A with running time O(n2 m2 ) producing a schedule of the length equal to the maximum machine load M for every instance of the problem was suggested by Sevastianov in [2] (1995). We give a short description of the algorithm below. Let pi j denote the processing time of job j on machine i; Mi =
n
max Mi ; pmax = max{pi j | i = 1, . . . , m; ∑ pi j ; M = i=1,...,m
j = 1, . . . , n} .
j=1
Without loss of generality, we assume pmax = 1 and Mi = M (i = 1, 2, . . . , m). We (1) (m−1) ) ∈ Rm−1 , where associate with each job j = 1, 2, . . . , n the vector d j = (d j , . . . , d j (i)
d j = pi j − pm j . Obviously, ∑ d j = 0. Thus, the vector family {d j | j = 1, . . . , n} can be applied by Lemma 2 (see Sect. 3), which implies that there is a permutation π = (π1 , . . . , πn ) satisfying ∑kj=1 dπ j ≤ s Cm−1 for each k = 1, . . . , n, where Cm−1 and the norm s are defined in Lemma 2. Then k k (1) ∑ pi1 π j − ∑ pi2 π j ≤ Cm−1 ; i1 , i2 = 1, . . . , m ; k = 1, . . . , n . j=1 j=1 k
0 We define ∆0 = ∑ j=1 p1,π j and
∆i = max
k=1,...,n
k
k−1
j=1
j=1
∑ piπ j − ∑ pi+1,π j
,
i = 1, . . . , m,
A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 117–124, 2003. c Springer-Verlag Berlin Heidelberg 2003
(2)
118
Roman A. Koryakin
where pm+1, j means p1 j and k0 is the value of k that delivers the maximum to the right part of (2) for i = 1. It follows from (1) and Lemma 2 that 1 , i = 1, . . . , m . (3) m−1 The algorithm A consists of five steps. Let si j denote the starting time of the operation of job j on machine i. 0 ≤ ∆i ≤ Cm−1 + 1 ≤ m − 1 +
A LGORITHM A Step 0. First we construct on each machine such a schedule that the n jobs are processed with no idle time in the interval [0, M] in the order π: siπ1 := 0; siπ j := siπ j−1 + piπ j−1 ,
i = 1, . . . , m; j = 2, . . . , n.
Step 1. The i-th machine schedule (i = 1, . . . , m) is shifted forward by the value (∆0 + · · · + ∆i−1): siπ j := siπ j + (∆0 + · · · + ∆i−1),
i = 1, . . . , m; j = 1, . . . , n.
Step 2. The i-th machine schedule (i = 3, . . . , m) is shifted to the right by the value δ3 + . . . + δi , where δi is the minimum shift of the i-th machine schedule to the right to have the beginning of some operation being equal to ki M (where ki = min{k = 1, 2, . . . | siπn + piπn ≤ kM} − 1): siπ j := siπ j + (δ3 + . . . + δi ),
i = 3, . . . , m; j = 1, . . . , n.
Step 3. The i-th machine schedule (i = 1, . . . , m) is shifted to the left by the value (ki − 1)M: siπ j := siπ j − (ki − 1)M, i = 2, . . . , m; j = 1, . . . , n. Step 4. The i-th machine schedule (i = 1, . . . , m) is cut at the moment M and the rest is placed to the moment 0: siπ j := siπ j
mod M,
i = 1, . . . , m; j = 1, . . . , n.
After the steps 0—4, we have on each machine a schedule of length M (generally speaking, unfeasible). By [2], the sufficient condition for the schedule constructed algorithm A to be feasible is M ≥ ∆1 + . . . + ∆m + δ3 + . . . + δm , or due to δi < 1 (i = 3, . . . , m) and (3): M ≥ m2 pmax .
(4)
While formulating the stochastic open shop problem with n jobs and m machines, we assume all pi j being independent (and in general, differently distributed) random variables. Let µi j and σ2i j denote the mean and the variance of pi j correspondingly, B2in =
n
∑ σ2i j ,
j=1
Ain =
n
∑ µi j
j=1
(i = 1, . . . , m).
On the Stochastic Open Shop Problem
119
Let us consider several conditions on the distributions of pi j . (B1 ) All random variables {pi j | i = 1, 2, ..., m; j = 1, 2, ...n} (maybe, with the exception of a finite number of them) have nondegenerate distributions (i.e., σ2i j > 0) and σ2i j < ∞,
i = 1, 2, . . . , m;
j = 1, 2, . . . , n.
(B2 ) For any τ > 0 and i = 1, 2, . . . , m, the Lindeberg condition holds: 1 n→∞ B2 in lim
n
∑ E((pi j − µi j )2 ; |pi j − µi j | > τBin ) = 0.
j=1
(B3 ) For every i = 1, 2, . . . , m, the following asymptotic relations hold for some ε ∈ (0, 1) and n → ∞: Ain → ∞, Bin = o (Ain ) , n1+ε Ep2in = o A2in . It turns out that with n → ∞ inequality (4) holds almost sure, i.e., lim P M ≥ m2 pmax = 1, n→∞
(5)
whenever pi j satisfy (B1 ) − (B3). It should be noted that M and pmax are increasing with n → ∞, and therefore, depend on n. Now we formulate the main result. Theorem 1. For the stochastic open shop problem with increasing number of jobs n and fixed number of machines m satisfying (B1 ) − (B3), the algorithm A almost always constructs an optimal schedule, the makespan being equal to the maximum machine load. Let us take a more close look at the conditions (B1 ) − (B3 ). First, note that the conditions (B1 ) − (B3 ) are regarded for each machine independently. Therefore, adding or subtracting a finite number of machines in the problem changes nothing in the Theorem 1 formulation. Although the condition (B1 ) is not tight, it is satisfied by all known distributions (Poisson, Gauss, exponential, uniform, and others). The condition (B3 ) is not that burdensome. For example, it is easy to see that it holds for identically distributed random variables with mean µ and variance σ2 < ∞. Indeed, √ in this case we have Bin ∼ n, Ain ∼ n and n1+ε Ep2in ∼ n1+ε (n → ∞); therefore, the condition (B3 ) turns into √ n = o(n), n1+ε = o(n2 ), n → ∞, which always holds. The Lindeberg condition (B2 ) is necessary: we can give an example of pi j distributions not satisfying (B2 ) (and satisfying (B1 ),(B3 )) for which equation (5) does not hold.
120
Roman A. Koryakin
Example 1. Let for each i = 1, . . . , m random variables pi j ( j = 1, 2, . . .) be distributed as follows: √ 1/ j, 1/√ j2 , with probability 1 −√ pi j = (6) 1/ j, with probability 1/ j. In this case, Epi j = k−2 − k−5/2 + k−1 ,
Ep2i j = k−4 − k−9/2 + k−3/2,
(7)
and the condition (B1 ) holds obviously. It follows from (7) that with n → ∞ Ain ∼ ln n,
B2in ∼ n−1/2,
Ep2in ∼ n−3/2 ,
(8)
i.e., the condition (B3 ) holds. Relations (7), (8) imply that the distribution (6) does not meet the Lindeberg condition (B2 ). Let us show that we cannot apply Theorem 1 to the distribution (6). Obviously, pmax = 1 and for each i = 1, . . . , m by Lemma 4 (see Sect. 3), we have: P
n
∑ pi j ≥ m
j=1
2
1 ≤ 4E m
1 = 4 m
n
2
∑ pi j
j=1 n
1 ∑ Ep2i j + m4 j=1
n
n
∑ Epi j ∑ Epik − Epi j
j=1
k=1
1 3 ∼ 4 1 + n−1/2 + n−1 ≤ 4 , m m and relation (5) does not hold. 2 Since the conditions (B1 ) − (B3 ) of theorem 1 are not always easy to check, we suggest a condition (B4 ) instead which is weaker but more easy to check. (B4 ) On the i-th machine (i = 1, . . . , m), the operation processing times are identically distributed random variables with mean µi and variance σ2i < ∞. It can be easily shown that (B4 ) implies (B1 ) − (B3 ). Corollary 1. For the stochastic open shop problem with increasing number of jobs n and fixed number of machines m satisfying (B4 ), the algorithm A almost always constructs an optimal schedule, the makespan being equal to the maximum machine load.
2 Proofs Proof of Theorem 1. First, note that relation (5) holds as soon as for each i = 1, . . . , m the relation (i) (9) lim P Mi ≥ m2 pmax (n) = 1 n→∞
On the Stochastic Open Shop Problem
121
(i)
holds, where pmax (n) = max{pi j | j = 1, . . . , n}. Thus, from now on we fix i and prove that for some increasing sequence of real numbers {Cin }∞ n=1 and with n → ∞ the events (i) {Mi ≥ Cin } and {m2 pmax (n) ≤ Cin } occur almost always: lim P (Mi ≥ Cin ) = 1;
(10)
(i) lim P m2 pmax (n) ≤ Cin = 1,
(11)
n→∞
n→∞
From (10),(11) and (9), relation (5) follows immediately. The sequence {Cin }∞ n=1 is defined by the following lemma. Lemma 1. There exists a sequence {Cin }∞ n=1 of real numbers such that the following asymptotic relations hold for n → ∞: Cin = o (Ain ) ,
Cin → ∞, (12) 2 n1+ε Ep2in = o Cin . (13) α Proof. Let Cin = n1+ε Ep2in /A2in Ain , where α ∈ 0; 13 . The sequence {Cin }∞ n=1 obviously satisfies (12) and (13). 2 Due to the independence of the operation processing times pi j ( j = 1, . . . , n), we have: n (i) (14) P m2 pmax (n) ≤ Cin = ∏ P m2 pi j ≤ Cin . j=1
Using Lemma 4 (see Sect. 3), we get the following inequality: m2 Ep2i j P m2 pi j > Cin ≤ , 2 Cin
( j = 1, . . . , n),
which with (14) implies the asymptotic bound for n → ∞: 2 Ep2 n m i j (i) P m2 pmax (n) ≤ Cin ≥ ∏ 1 − . 2 Cin j=1
(15)
By definition of the sequence {Cin }, each multiplier in the right part of (15) is nonnegative for n large enough. Let us show that n m2 Ep2i j = 1, (16) lim ∏ 1 − 2 n→∞ Cin j=1 which is equivalent to n m2 Ep2i j lim ∑ ln 1 − = 0. 2 n→∞ Cin j=1
(17)
122
Roman A. Koryakin
Using the bound ln(1 − δ) ≥
δ δ−1
(for δ ∈ (0, 1)), we get
m2 Ep2i j ∑ ln 1 − C2 ≤ j=1 in n
n
δ j (n)
∑ 1 − δ j (n) ,
j=1
2 . Since δ (n) → 0 (n → ∞), there exists such number j that where δ j (n) = m2 Ep2i j /Cin j 0 1 − δ j (n) ≥ 1/2 for any j ≥ j0 . Then
m2 Ep2i j ∑ ln 1 − C2 ≤ in j=1 n
j0 −1
∑
j=1
n δ j (n) + 2 ∑ δ j (n). 1 − δ j (n) j= j0
The first sum in the right part of the last relation obviously tends to 0 (n → ∞). By Lemma 3 (see Sect. 3) the second sum converges to 0 as well: n
∑ δ j (n) = 0. n→∞ lim
j= j0
Lemma 3 is applicable here, since by (13) we have ∞
Ep2in < ∞. 2 n=1 Cin
∑
Thus, relation (17) is proved, and therefore, relation (11) holds. Let us show now that for n → ∞ the limit of the probability P (Mi ≥ Cin ) is equal to 1, or equivalently, that the limit of the probability P (Mi < Cin ) is equal to 0. Note that n ∑ j=1 pi j − Ain Cin − Ain P (Mi < Cin ) = P < . Bin Bin Then by Theorem 2 (see Sect. 3), we have lim P (Mi < Cn ) = Φ
n→∞
Cin − Ain Bin
.
The right part of the last relation is equal to 0 as soon as lim
n→∞
Cin − Ain = −∞, Bin
which holds due to the condition (B3 ) and by definition of the sequence {Cin } (from lemma 1). Thus, for n → ∞ the probability P (Mi < Cin ) tends to 0, which means that P (Mi ≥ Cin ) → 1, and therefore, relation (10) holds. From (10),(11) and (9), relation (4) holds almost always, hence, the algorithm A almost always constructs a feasible schedule with the makespan being equal to the maximum machine load. Theorem 1 is proved. 2
On the Stochastic Open Shop Problem
123
Proof of Corollary 1. To prove the corollary we need to show that the condition (B4 ) implies the conditions (B1 ) — (B3 ). The implication (B4 ) ⇒ (B1 ) is obvious. Let us check that Lindeberg condition (B2 ) holds for each i = 1, . . . , m: 1 nσ2i
n
∑ E((pi j − µi)2 ; |pi j − µi| > τσi
√ n)
j=1
=
√ 1 E((pi1 − µi )2 ; |pi1 − µi| > τσi n) → 0 (n → ∞) . σ2i
To prove the implication (B4 ) ⇒ (B3 ), it is sufficient to note that under the corollary conditions, A2in ∼ n2 , Bin ∼ n, n1+ε Ep2in ∼ n1+ε (for each i = 1, ..., m). Corollary 1 is proved. 2
3 Appendix Compact Vector Summation Problem. Let X = {x1 , . . . , xn } ⊂ Rm be a finite family of vectors satisfying n
∑ xi = 0;
xi s ≤ 1,
i = 1, . . . , n,
i=1
where the norm s is specified by its m-dimensional unit ball Bm : Bm = {x = (x(1) , . . . , x(m) ) ∈ Rm : |x(i) | ≤ 1, |x(i) − x( j)| ≤ 1; i, j = 1, . . . , m}. The problem of Compact Vector Summation (CVS) is to derive a permutation π = (π1 , . . . , πn ) of {1, . . . , n} minimizing the function fX (π) = max xπ1 + xπ2 + · · · + xπk s . k=1,...,n
Lemma 2 ([3]). An algorithm with running time O(m2 n2 ) can be designed that for the vector family X and for the norm s (defined above) computes a permutation π for which the vectors from X can be summed within the ball of radius Cm = m − 1 + 1/m (i.e., fX (π) ≤ Cm ). See [3] and [4] to find the proof of the Lemma 2 and more details about CVS-problem. Kronecker Lemma. We drop out the proof of the following lemma. ∞ Lemma 3 (Kronecker). Let {ck }∞ k=1 be a sequence of real numbers and {bk > 0}k=1 be an increasing sequence of positive real numbers such that bk → ∞ (k → ∞). If ∞ c k ∑ < ∞, k=1 bk
then
1 n ∑ ck = 0. n→∞ bn k=1 lim
124
Roman A. Koryakin
Chebyshev Inequality. One of the form of the well-known (in probabilities theory) Chebyshev Inequality is as follows. Let ξ be a random variable such that P(ξ ≥ 0) = 1 and g(x) ≥ 0 be a monotonically increasing function. Lemma 4 ([1], p.75). For any ε > 0, P(ξ ≥ ε) ≤
Eg(ξ) . g(ε)
Central Limit Theorem. Let {ξn }∞ n=1 be a sequence of independent random variables and for each ξk there exist the mean Eξk = ak and the variance Dξk = σ2k < ∞. Φ(x) is the Gauss distribution function with parameters (0, 1). Theorem 2 ([1], p.153). Let {ξn }∞ n=1 satisfy Lindeberg condition, i.e., for any τ > 0 n 1 2 2 2 lim ∑ E (ξk − ak ) ; |ξk − ak | > τ σ1 + ... + σn = 0. n→∞ σ2 + ... + σ2 n k=1 1 Then with n → ∞,
P
ξ1 + ... + ξn − (a1 + ... + an) < x → Φ(x) σ21 + ... + σ2n
uniformly by x.
Acknowledgements I express my sincere thanks to Sergey Sevastianov and Alexander Bystrov for very useful conversations concerning the above problem and their contribution in improving the style of the paper.
References 1. Borovkov A.A.: Probabilities Theory. Amsterdam: Gordon and Beach 1998. 2. Sevastianov S.V.: Vector Summation in Banach Space and Polynomial Time Algorithms for Flow Shops and Open Shops. Mathematics of Operations Research 20 (1995), 90-103. 3. Sevastianov S.V.: On a Compact Vector Summation. (Russian) Diskretnaya Matematika (Moscow) 3 (3), 66-72. 4. Sevastianov S.V.: On Some Geometric Methods in Scheduling Theory: a Survey. Discrete Appl. Math. 55 (1994), 59-82.
Global Optimization – Stochastic or Deterministic? Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson Numerical Optimisation Centre, University of Hertfordshire, Hatfield AL10 9AB, England [email protected]
Abstract. Using an aircraft routing problem as a case-study, this paper reports on some practical experience with stochastic and deterministic methods for global optimization. Results show that the deterministic method DIRECT [12] is found to be more reliable than the competing techniques THJ [1] or ECTS [2]. Keywords: Global optimization, direct search techniques, deterministic and random search methods, aircraft routing.
1 Introduction This paper deals with the Aircraft Routing Problem which involves finding a flight path from a given origin to a given destination, taking account of obstacles such as geographical features and “no-fly zones” separating incoming and outgoing traffic near an airport. In military terms, the obstacles might also include regions around a threat, such as a radar or missile site. In practice, routing problems will usually include manoeuvrability limits together with constraints on rendezvous times. Further refinements are possible: for instance, in the military context, an optimum route could take account of “visibility”, exploiting the terrain to hide the aircraft as much as possible. A more sophisticated form of routing problem could be posed for multiple aircraft, possibly of various types, flying different missions in the same geographical area. More discussion about practical aspects of aircraft routing is given in [10] and [11] in which a heuristic routing algorithm is described. This algorithm seeks to minimise a “route cost” made up of elements such as distance, fuel usage and measures of exposure to threats and proximity to obstacles. The approach in [10] can broadly be described as a “genetic” algorithm which builds up routes in a step-by-step fashion. New trial branches are formed which “fan-out” from the current position and those which seem more promising are held in a list of candidate routes. These candidates are also subjected to randomly chosen modifications such as the addition or deletion of extra turning points. If such modifications lead to an improved route they are retained; otherwise they are discarded. Convergence of this approach has not been studied; and in practice it is expected to provide good feasible routes rather than optimal ones. This, however, is regarded as sufficient for many situations.
This work has been supported by BAESYSTEMS, Rochester, England.
A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 125–137, 2003. c Springer-Verlag Berlin Heidelberg 2003
126
Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson
In this paper we shall consider approaches which are more closely related to classical nonlinear optimization. We consider a problem formulation leading to an optimization calculation which can be tackled by general-purpose methods. This problem has an objective function that may not be (easily) differentiable and which turns out to admit several local minima. Therefore we shall need to employ a direct-search global optimization method. A number of such techniques have been proposed, and the main purpose of this paper is to assess their suitability for use in software for real-life representations of the routing problem.
2 A Simplified Route Model (SRM) We shall consider first the problem of finding the the ground-plan (flat earth) of an optimum route which avoids regions that we shall simply refer to as “threats”. A route will be defined by its (given) start and end points and by a number of intermediate waypoints. The co-ordinates of these waypoints will be our optimization variables; and we assume that the flight path follows straight lines between waypoints. A simple definition of an optimal route is one in which the distance flown is as short as possible, subject to suitable avoidance of the threats. Therefore, for any choice of waypoints, we must calculate first the overall route-length L and then determine how much of the route lies within any of the threats. If the j-th leg of a route passes through the i-th threat, remaining inside it for a distance l ji , say, then the “cost” of that route C can be expressed as C = L + ∑ ρi ∑ l 3ji i
(1)
j
where ρi is a penalty parameter associated with the i-th threat. The choice of 3 as the exponent in the penalty term ensures that C has continuous first derivatives at the boundary of circular threats (as explained in [15]): this is not the case with the more familiar squared penalty term. Our aim will be to choose the waypoints so as to minimize C, hence taking into account both the need to reduce the flight path length and to respect the threats. The balance between these factors will depend on the choice of the parameters ρi . If the i-th threat is a geographical feature then ρi should be large; but if threat i represents some risky but not impossible region then a more moderate value of ρi would allow a shorter route to make an acceptably brief incursion into some danger area. In some of the examples considered later, we will represent the threats as circles. This means that the calculation of path lengths l ji lying inside each threat could be done analytically. However, in order to be able to deal with more realistic threats with non-smooth boundaries we shall, in practice, calculate the lengths l ji by a sampling and interpolation method. When threats have irregular boundaries (e.g., when they are part of the terrain) then it will not be possible to compute derivatives of l ji with respect to the waypoints. In what follows, we augment (1) by considering two more factors which increase the realism of our examples. In practice we cannot permit routes to make arbitrarily sharp turns; and furthermore we might want to impose some minimum separation on
Global Optimization – Stochastic or Deterministic?
127
the waypoints. Hence, if φ j denotes the angle between stages j and j + 1 and if l j is the length of stage j, we can include extra penalty terms in C, to give C=
n+1
m
n−1
n+1
j=1
i=1
j=1
j=1
∑ (l j + ∑ ρi l 3ji ) + ∑ ν(φmax − φ j )2− + ∑ µ(l j − lmin )2− .
(2)
Here µ and ν are penalty parameters and the subscript “− ” indicates that the expression in brackets is regarded as having the value zero unless it is negative. (In contrast to the threat-violation term, we do retain continuity of derivatives of C by handling the stage-length and turn angle constraints with a standard quadratic-loss term.) Numerical experiments have shown that (2) can admit several local minima. This is not unexpected: in practical terms it means that, once we have found a “good” route which passes on one side of a threat there may be no continuously improving sequence of perturbations of that route which result in a “better” one passing on the other side of the same threat. It follows that we need to use optimization methods which actively seek global, rather than local, optima. Moreover, since (2) may not be differentiable when threats are non-circular we must consider direct-search optimization techniques. The two-dimensional SRM just outlined can easily be extended to three dimensions. Threats can be modelled by cylinders or by sections of spheres or ellipsoids and climb angle constraints can be included in the same way as limits on turning angle (see [15]). Time considerations can also be included so that waypoints are of the form (x, y, z,t).
3 A Realistic Route Model (RRM) As mentioned in the introduction, practical aircraft routing problems will involve more constraints than those appearing in the SRM described above. Moreover, optimality will probably involve more than just the minimization of distance travelled. Later in this paper we shall consider some experience with a comprehensive route cost function [16] which includes many more factors than (2). These relate, for instance, to exposure to military threats and constraints on fuel usage. Threats and no-fly zones are still modelled by simple shapes likes circles and squares: but terrain information from a database is used to calculate penalties for flying too close to physical obstacles (and sometimes for being too visible). Precise details of the RRM cost function (such as relative weights on different components and how these components are evaluated) cannot be given here: but it can be assumed that it reflects current operational practice in the defence industry.
4 Direct Search Optimization Methods In this section we consider some direct search methods for the general unconstrained minimization problem Min f (x)
128
Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson
where x ∈ Rn . (In practice the search will usually be confined to some region defined by simple bounds on the elements of x.) The methods we discuss are quite recent proposals from the literature, representing alternatives to the popular genetic algorithm approaches (such as [10]). One is obtained by adding randomization and tabu search ideas to an existing deterministic local search method; another is based on a (controlled) random exploration; and the third is a deterministic algorithm which systematically searches the region of interest in a quite effcient way, on the basis of information gathered about the objective. 4.1 Tabu Search Hooke and Jeeves (THJ) This algorithm is described in [1] and is obtained by adding tabu search ideas to the Hooke and Jeeves algorithm for unconstrained optimization. These basic ideas can be described individually. The Hooke and Jeeves Method. The Hooke and Jeeves algorithm [9] consists of two major sections – exploratory moves and pattern moves. Exploration in the neighbourhood of the current solution estimate x(k) , say, uses moves with fixed step length κ along coordinate axis directions. If a move finds a better point x+ (i.e. one where f (x+ ) < f (x(k) )) then it is termed successful and further exploratory moves are based upon x+ . If a move is unsuccessful then a step of length −κ is tried. If some explorations are successful and yield the overall result x++ then the new solution estimate for the next exploration cycle involves calculation of a pattern step x pat = x++ + (x++ − x(k) ). The new base point is given by x(k+1) = x pat if f (x pat ) < f (x++ );
x(k+1) = x++ otherwise
If all the explorations fail then a new exploration cycle is started about x(k) with κ halved. The algorithm terminates when κ becomes less than a pre-set tolerance δ. Tabu Search. Tabu searches can be used to guide local techniques (like the Hooke and Jeeves method) away from local minima and hence assist in the search for the global optimum. The basic approach is due to Glover, [7], [8] and is based on keeping a tabu list of forbidden search directions. Each iteration involves a move to a new point which lies in the restricted region which can be reached only by non-tabu steps. The tabu list could, for instance, be based on the most recent t moves in order to prevent recently explored regions from being revisited. Glover gives a comprehensive discussion of techniques that can be used to create and update tabu lists. He considers refinements such as the use of “aspiration levels” to allow tabu lists to be sometimes over-ridden – e.g., if a tabu move turns out to yield an appreciable reduction in function value. Moreover, he notes that the careful use of probabilistic ideas in the choice of moves can introduce a useful element of randomness
Global Optimization – Stochastic or Deterministic?
129
which provides what he calls an “escape hatch” from a too systematic set of exploration rules. Tabu search ideas have mainly been used in optimization problems involving discrete-valued variables; but they may also be applied to global optimization of functions of continuous variables. Tabu Search Hooke and Jeeves (THJ). The algorithm in [1] – like that in [9] – uses both exploration and pattern phases. In order to seek global, rather than local, optima it (a) makes use of exploratory search directions that are randomly chosen rather than being parallel to the axes; (b) performs a one-dimensional global minimization along each such direction; and (c) uses tabu search ideas to prevent iterates returning to a neighbourhood of a local optimum that has already been sampled. Specifically, the k-th iteration of the algorithm performs m cycles, each involving r global minimizations along randomly generated directions. The first cycle explores random directions away from x(k) ; and if z1 is the point with lowest function value after r global searches then the second cycle is centred upon z1 and yields a further point z2 , and so on. At the end of m cycles the algorithm does a pattern move to give x(k+1) = zm + λ(zm − x(k) ) where λ is found by another global line search. The generation and use of random search directions is modified by use of a “tabu list”. This records the t most recently used exploration directions; and a new random direction is rejected if it is the negative of a member of the current tabu list. This is done to prevent the exploration returning to a region it has just visited. (In fact, this simple tabu list idea can be over-ridden if it seems beneficial to do so, see [1]). 4.2 ECTS The acronym ECTS denotes Enhanced Continuous Tabu Search. The algorithm is described in [2] and [14]; and it resembles THJ both in its use of tabu search ideas and its reliance on randomly generated points. ECTS begins with a randomly generated initial guess for a solution x(1) . In general, iteration k starts with a diversification phase. This involves a set of hyperspheres with radii h1 , h2 , ..., hη centred upon x(k) (as in [14]). Neighbouring points x1 , x2 , ..., xη are then generated in the “shells” between the hyperspheres. This choice of neighbours is largely random, but also involves a tabu list containing the last t solution estimates. No neighbouring point of x(k) can lie within a distance dt of a point in the tabu list. The neighbour xj , say, which has the lowest function value becomes the next iterate x(k+1) even if f (xj ) > f (x(k) ). If in fact f (xj ) >> f (x(k) ) then x(k) is considered to be “promising” since our sampling of the neighbourhood suggests it could be close to a local minimum. It is therefore added to a “promising list” – provided it is not within a certain distance d p of an existing member of the list. If a specified number of diversification stages fails to yield a new promising point, ECTS proceeds to an intensification stage centred upon the element in the promising
130
Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson
list with least function value. Intensification is essentially the same process as diversification except that it is carried out within smaller hyperspheres. 4.3 DIRECT DIRECT is a deterministic approach which, at first sight, resembles an exhaustive search. It is described in [12] and a noteworthy feature is the way it uses Lipschitz constant arguments to decide, at each iteration, which regions of the solution space are worth exploring. DIRECT begins with a given “hyperbox” defined by its centre point, c0 , where the objective function is f0 = f (c0 ), and by an n-vector of displacements s0 . Thus the initial hyperbox covers the range (c0i ± s0i ) for i = 1, ...n. This search region is then split into smaller hyperboxes by repeated subdivision, as described below. For each hyperbox, j, (= 1, ..., J) we have a centre c j (with function value f j ) and a vector of semi-sidelengths s j . Hyperboxes are grouped according to size, δ j , measured as the distance from centre to any corner. We shall suppose that among the J hyperboxes there are only KJ ≤ J different size values. When the subdivision procedure is applied to an existing hyperbox characterized by (c j , f j , s j , δ j ) it only shrinks the longest edges. If there is a unique longest edge then DIRECT replaces the existing box j by three new ones, constructed by trisecting the appropriate side. (If several edges of hyperbox j which all have the same “longest” length, then the trisection process is repeated for each of them.) At each iteration of DIRECT, some of the current hyperboxes j = 1, ..., J are selected for further subdivision. Note that we need only examine KJ of the hyperboxes – i.e. for each of the different δ j -sized candidates we just consider the one with the smallest f -value at the centre. The aim is to explore the search region efficiently by computing extra function values only in regions which seem “potentially optimal”. Hence DIRECT checks, for each hyperbox j, whether there exists any Lipschitz constant such that box j could contain a lower function value than any other box. If this is not the case then box j is not judged to be worth further subdivision (on the current iteration). This selection process in DIRECT can be likened to the use of a tabu list except that it works to predict regions to be avoided rather than working on the basis of not re-visiting previously explored regions. It is worth mentioning that we have found it beneficial to run DIRECT in restart mode (see [3], [4]). This means that, instead of performing I consecutive iterations we stop the process after 2I iterations and begin again in a new hyperbox centred on the best point found so far. It often happens that performing 2 × 2I (or even 4 × 4I ) iterations in this way will lead to a better point (and at lower computational cost) than I straight iterations. There are two factors that may explain the advantages of the restart approach. Firstly, it reduces the number of candidate boxes that have to be considered over the whole I iterations; and secondly, by re-centering the search on a good estimate of the solution, it is likely to reduce the number of potentially optimal boxes that are identified after the first (and subsequent) restarts. Variants of DIRECT. The algorithm outlined above will be referred to as the basic version of DIRECT. A number of modifications have been proposed. Gablonsky [6]
Global Optimization – Stochastic or Deterministic?
131
suggests that it is beneficial on some iterations to use aggressive searches in which the potential optimality test is omitted and all KJ candidate hyperboxes are subdivided. DIRECT-1 is a version of the algorithm that uses an aggressive search if 50n hyperbox subdivisions have taken place without a significant reduction in the best value of f found so far. DIRECT-1 also puts a lower limit on box size, so that boxes with δ j < 10−3 are not subdivided any further. Both DIRECT and DIRECT-1 may subdivide potentially optimal boxes by trisecting along several edges. DIRECT-2 follows a suggestion by Jones [13] that it is more efficient to trisect along only one edge and is therefore less expensive per iteration (although it may need more iterations).
5 Numerical Experience In this section we report some comparative trials between THJ, ECTS and the variants of DIRECT. The first group of tests involves problems involving artificial circular threats and based on the SRM cost function; the second set deals with the RRM cost function and involves genuine terrain data and realistic operating constraints. Full details of these (and other) numerical experiments are given in [15]; and here we summarise the main conclusions. 5.1 SRM Examples In this section we consider two scenarios in both of which the same set of ten circular threats have to be avoided. Five waypoints are allowed which means that our optimization problem has ten variables. There are three locally optimal solutions, illustrated in Figure 1. Note that the best and second-best routes have nearly the same length (about 37.4 and 37.8 km, respectively). The initial guessed solution for scenario 1 has the waypoints equally spaced along the straight line from the departure point to the destination. (Routes like this are easily constructed automatically and could be useful, in practice, for providing initial guessed solutions.) The results quoted in Table 1 are for a classical sequential unconstrained minimization approach to (2) in which C is (approximately) minimized for a sequence of increasing values of the penalty parameters ρ, µ and ν. The tables show the number of minimization cycles (nc) needed to obtain a solution where the total route-length inside the constraints is less than 0.1 km. They also give the corresponding value of C and the total number of function evaluations needed (n f ). In Case 1 no constraints are imposed on turn angle and stage-length; Case 2 involves a lower limit on stage length of 1 km. The introduction of the stage-length limit does not have a very substantial effect on results and in both cases DIRECT-1 and DIRECT come closest to the global optimum. The other methods have done no better than find the second best local solution. Scenario 2 also has three local solutions, shown in Figure 2. Comparative results for the various global optimization methods are shown in Table 2. Case 1 involves no constraints on stage-length or turn angle; Case 2 prohibits turn angles of more than 31o and stage-lengths of less than 1 km. The starting guessed values for the waypoints are equi-spaced along the straight line between the prescribed end-points of the mission.
132
Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson
Fig. 1. Global and local optimum routes for SRM scenario 1. Table 1. Algorithm performance on SRM scenario 1. THJ ECTS DIRECT nc C∗ n f nc C∗ n f nc C∗ n f Case 1 4 37.7 13692 5 37.8 23218 5 37.5 25082 Case 2 4 37.7 13692 4 37.7 15645 4 37.5 23137
DIRECT-1 nc C∗ n f 6 37.4 22356 6 37.4 22364
DIRECT-2 nc C∗ n f 7 37.7 13843 5 37.7 12260
Table 2 shows that, in Case 1, THJ is trapped in the locally optimal Route 2 while the best route found by ECTS does not seem to be optimal at all. DIRECT appears to be homing in on Route 2 but has not yet found it. Only DIRECT-1 and DIRECT-2 are close to the global solution. In Case 2, THJ obtains a route which is feasible with respect to the threats but does not satisfy the turn angle constraint. ECTS does not even find a feasible path through the threats. The solutions returned by DIRECT and DIRECT-1 are feasible but they are in the vicinity of the locally optimal Route 2. Only DIRECT-2 provides the correct global solution. Broad conclusions from these and other more extensive results in [4] are that THJ often outperforms ECTS and is sometimes competitive with DIRECT. However the deterministic approach in DIRECT proves to be the most reliable in that it regularly provides a feasible and (at least) locally optimal route while the stochastic methods can sometimes fail on both counts.
Global Optimization – Stochastic or Deterministic?
133
Fig. 2. Global and local optimum routes for SRM scenario 2. Table 2. Algorithm performance on SRM scenario 2. THJ ECTS nc C∗ n f nc C∗ n f Case 1 4 41.8 13712 5 49.8 24181 Case 2 4 44.4 13707 fails - see text
DIRECT nc C∗ n f 4 42.3 19819 4 45.2 20717
DIRECT-1 nc C∗ n f 6 40.9 23338 5 45.2 16599
DIRECT-2 nc C∗ n f 7 40.3 16443 5 41.7 15377
5.2 RRM Examples A second set of problems involves the calculation of more realistic routes using authentic terrain data in addition to circular threats and no-fly zones. A large number of trials of THJ, ECTS and DIRECT have been carried out using this data with four different scenarios and a full discussion can be found in [15]. To give an initial idea of the problems we show, in Figures 3, 4 and 5, the best routes obtained by DIRECT, THJ and ECTS for RRM scenario 1 in which eight waypoints are to be optimized. Clearly DIRECT is able to find a much better solution for RRM scenario 1 than either THJ or ECTS. This particular result was obtained after several trial runs of DIRECT, using different search hyperbox sizes and different numbers of iterations between re-starts. However, while it did not take many attempts before we had “tuned”DIRECT to obtain a good route, we were never able to adjust parameters in THJ or ECTS to make them give better answers than the ones we have illustrated.
134
Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson
Fig. 3. Best route found by DIRECT for RRM scenario 1. Table 3. Algorithm performance on RRM scenarios 1–4.
Scenario 1 Scenario 2 Scenario 3 Scenario 4
THJ C∗ nf 4260 23198 10775 52960 8065 39107 863 21901
ECTS C∗ nf 5200 33198 5473 33118 3441 65118 931 17718
DIRECT C∗ nf 578 9292 1114 17399 576 48762 814 12430
OR C∗ n f 570 1120 1555 878 -
Table 3 shows the optimal values of the route cost function and the numbers of function evaluations needed for scenarios 1-4. In this table we also show the best value of route cost function obtained by the genetic algorithm of Hewitt et al [10], [11] – denoted by OR. (Unfortunately the number of function calls needed by this algorithm
Global Optimization – Stochastic or Deterministic?
135
Fig. 4. Best route found by THJ for RRM scenario 1.
has not been reported.) In each case we can see that – even though many tests of THJ and ECTS were carried out [15] – DIRECT and OR consistently return the best results. Furthermore, DIRECT is usually more economical. Performance of DIRECT on scenario 3 is particularly noteworthy since it finds a solution considerably better than anything obtained over many trials of the OR algorithm [16]. In this case, DIRECT has been able to position the waypoints so that the route flies along a narrow valley between the threats which was missed by all the other route-finding approaches. 5.3 Discussion of Results On the basis of the numerical results reported above we can make a number of observations. At least one of the variants of DIRECT has found – or at least come close to – the global solution of all the problems. Unfortunately there is no particular version that regularly outperforms the others. Overall, however, the purely deterministic approach which underlies DIRECT and its variants seems to provide more consistent behaviour than does the purely random sampling method used in ECTS. On the SRM problems, at least, THJ seems more reliable than ECTS; and it is interesting to observe can be
136
Mike C. Bartholomew-Biggs, Steven C. Parkhurst, and Simon P. Wilson
Fig. 5. Best route found by ECTS for RRM scenario 1.
viewed as a “hybrid”which uses randomly generated search directions along with a deterministic method for one dimensional global minimization. From a user’s point of view, THJ and ECTS are handicapped by having a large number of user-defined parameters while DIRECT has relatively few. This may partly explain why, on the RRM examples, it was reasonably easy to obtain good results from DIRECT while many trial runs of THJ and ECTS failed to yield comparable solutions. (Of course, we can make the obvious remark that all the methods are likely to be sensitive to the values selected for maximum numbers of iterations or function evaluations. The chances of success for a search for a global minimum are bound to increase if the exploration is made more exhaustive.)
6 Further Comments and Conclusions In this paper we have given a summary of evidence that led to the choice of DIRECT as the preferred global optimization method in a study of methods for aircraft routing [15]. Further experience, reported in [3], [5],[15], shows that DIRECT performs well when applied to routing problems that are more complex than those considered here. In particular, DIRECT has successfully been used to find routes for groups of aircraft in both two- and three-dimensions. Such problems are, of course, subject to constraints on separation distance between aircraft: hence different members of the group
Global Optimization – Stochastic or Deterministic?
137
have to make turns at different points and sometimes take paths on different sides of some obstacles. Multi-ship problems can be formulated in a number of ways. One possibility is to seek optimal routes for all aircraft simultaneously: but this means that for a ten-waypoint problem for three aircraft in three-dimensional space DIRECT would have to deal with 90 variables. Jones [13] does not recommend the use DIRECT for such a large problem; and therefore we have found that it is more efficient to compute routes for each aircraft separately: of course, this is done in such a way that the computation of each route takes account of the position of all previous aircraft [5],[15].
References 1. K.S. Al-Sultan and M.A. Al-Fawzan, A Tabu Search Hooke and Jeeves Algorithm for Unconstrained Optimization, European Journal of OR, 103, 198-208, 1997 2. R. Chelouah and P. Siarry, Tabu Search Applied to Global Optimization, European Journal of OR, 123, 256-270, 2000 3. M.C. Bartholomew-Biggs, S.C. Parkhurst and S.P. Wilson, Using DIRECT to solve an aircraft routing problem, Computational Optimization and Applications, 21, 311-323, 2002. 4. M.C. Bartholomew-Biggs, S.C. Parkhurst and S.P. Wilson, Global optimization approaches to an aircraft routing problem, European Journal of OR, 144, 417-431, 2003. 5. M.C. Bartholomew-Biggs, S.C. Parkhurst and S.P. Wilson, A routing problem involving multiple aircraft, Numerical Optimization Centre Technical Report, University of Hertfordshire, 2003. 6. J.M. Gablonsky, An Implementation of the DIRECT Algorithm, Tech. Report CRSC-TR9829, Center for Research in Scientific Computation, North Carolina State University, Raleigh, NC, 1998. 7. F. Glover, Tabu Search - Part I, ORSA Journal on Computing, 1, 190-206, 1989 8. F.Glover, Tabu Search - Part II, ORSA Journal on Computing, 2, 1, 4-31, 1990 9. R. Hooke and T.A. Jeeves, Direct Search Solution of Numerical and Statistical Problems, J.A.C.M. 8, 212-229, 1961 10. C. Hewitt and S.A. Broatch, A Tactical Navigation and Routeing System for Low-level Flight, Technical Report, GEC-Marconi Avionics, Rochester, Kent, U.K. (AGARD, Italy, 1992) 11. C. Hewitt and P. Martin, Advanced Mission Management, Technical Report, GEC-Marconi Avionics, Rochester, Kent, U.K. (IEE - FITEC, 1998) 12. D.R.Jones, C.D. Perttunen and B.E. Stuckman, Lipschitzian Optimization without the Lipschitz Constant, Jounal of Optimization Theory and Applications, 79, 157-181, 1993 13. D.R. Jones, The DIRECT Global Optimization Algorithm, Encyclopedia of Optimization, Kluwer Academic Publishers, Boston, 2002. 14. P. Siarry and G. Berthau, Fitting of Tabu Search to Optimize Functions of Continuous Variables, Int. J. Num. Meth. Eng., 40, 2449-2457, 1997. 15. S.P. Wilson, Aircraft routing using global nonlinear optimization, PhD Thesis, University of Hertfordshire, May 2003. 16. BAESYSTEMS Rochester, Private communication.
Two-Component Traffic Modelled by Cellular Automata: Imposing Passing Restrictions on Slow Vehicles Increases the Flow Paul Baalham and Ole Steuernagel Dept. of Physical Sciences, University of Hertfordshire, College Lane, Hatfield, AL10 9AB, UK [email protected]
Abstract. We use a computer-based cellular automaton to study two-component flow, mimicking (fast) passenger and (slow) cargo vehicles, on a circular unidirectional two-lane highway without on-ramps and exits. The global flow rates for different overall densities and mixing ratios between fast and slow cars are determined. We study two main scenarios: two-component traffic without passing restriction (uncontrolled flow) and traffic in which slow vehicles are prohibited to pass (controlled flow). We find that controlling the flow should considerably increase multi-lane highway capacity. Keywords: Traffic flow, cellular automata, environmental regulation.
1 Introduction Modelling road traffic behaviour using cellular automata has become a well-established method to model [1–3], analyze [1, 3–5], understand [1–6] and even forecast [3, 7] the behaviour of real road traffic [7, 8]. A well-established and popular cellular automaton model is due to Nagel and Schreckenberg [1, 3]. It is known to be ‘minimal’ [6] in the sense of containing just the necessary rules to simulate realistic phenomena such as the spontaneous formation of jams on busy roads [6], throttling of traffic flow on busy roads [8], and the spontaneous emergence of density waves [7, 9]. Although exact analytical results for this [1] and related systems [7] are typically not available [3, 4, 7, 10], the automata’s evolution rules are simple, straightforward to understand, computationally efficient and sufficient to emulate much of the behaviour of observed traffic flow [8]. Cellular automaton traffic simulations of the Nagel-Schreckenberger-type have thus proven useful and popular [1–7, 11]. Here, we present results of such a simulation for two-component traffic on a twolane highway which is closed to a loop without on-ramps and exits [11]: our computational model is defined by a two dimensional array (number of lanes) of L sites (position on road of length L). This setup was chosen for its simplicity. Each site may either be occupied by one vehicle, or is empty. We assume the twocomponent traffic to consist of faster cars and slower (transport) vehicles with different A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 138–145, 2003. c Springer-Verlag Berlin Heidelberg 2003
Two-Component Traffic Modelled by Cellular Automata
139
attainable maximal speeds vmax, F > vmax, S . This is the only characteristic that distinguishes the two types of vehicles. For simplicity, we do not include other realistic effects, such as effects due to different vehicle lengths, widths, and their different acceleration rates, for instance. After a presentation of our automaton’s evolution rules in section 2 we reproduce the fundamental diagrams known from previous works [1–7, 11] in section 3. In section 4 we study traffic flow under a passing restriction for slow vehicles which displays our main result that prohibiting slow vehicles from passing leads to increased flow and hence increased capacity of multi-lane highways. In section 5 we present an outlook on future work and our conclusions.
2 The Evolution Rules For simplicity and for the sake of easy comparison with published work we employ only the four necessary evolution rules laid down by Nagel and Schreckenberger, and simply cite from their work [1]: “Each vehicle has an integer velocity with values between zero and vmax . For an arbitrary configuration, one update of the system consists of the following four consecutive steps, which are performed in parallel for all vehicles: 1) Acceleration: if the velocity v of a vehicle is lower than vmax and if the distance to the next car ahead is larger than v + 1, the speed is advanced by one [v → v + 1]. 2) Slowing down (due to other cars): if a vehicle at a site i sees the next vehicle at site i + j (with j ≤ v), it reduces its speed to j − 1 [v → j − 1]. 3) Randomization: with probability p, the velocity of each vehicle (if greater than zero) is decreased by one [v → v − 1]. 4) Car motion: each vehicle is advanced v sites.” In modelling a highway, we treat unidirectional two-lane traffic, rule ‘2)’ therefore has to be modified by a suitable rule that allows cars to avoid a slower car ahead of them by changing lanes. Accordingly [2], we devise the alternative passing rule: 2) Lane change or slowing down (due to other cars): if a vehicle at a site i sees the next vehicle at site i + j (with j ≤ v), it changes lanes or, if blocked by a third car k in the other lane at a distance smaller than that car’s speed vk > i − k, reduces its speed to j − 1 [v → j − 1]. Note, that this passing rule describes considerate drivers: the lane changing step is only executed if cars do not force approaching traffic to slow down. We chose this implementation of the passing rule in order to avoid unrealistically confrontational lane changing behaviour. An inconsiderate lane changing behaviour would, moreover, unduly exaggerate the possible flow improvements due to imposing passing restrictions on slow traffic that we are studying here. The above rules imply equal probabilities for lane changes, we therefore arrive at a symmetric distribution of traffic across both lanes if no flow control is imposed (for lane-population inverting mechanisms see [2, 3]).
140
Paul Baalham and Ole Steuernagel
3 Two-Component Traffic without Passing Restriction In order to quantify our results we consider the following global quantities: 1 N ∑ ∑ vi, j, near + vi, j, f ar N i=1 j=S,F
average speed
V =
overall density
ρ = ρS + ρF =
and total flow
J = ρV ,
NS + NF ≤1 2L
(1) (2) (3)
where the indices S and F refer to slow and fast vehicles, near and far refer to the two 1 for the flow. The counting lanes which also gives rise to the normalization factor 2L index i over the total number of cars N = NS + NF on the modelled road of length L provides us with a global average along the entire road. This circumvents some subtle problems associated with the biases of various local flux measures [7]. Note that our definition of the flow J does not discriminate between fast and slow cars. One could argue that slow cars should have a large weighting because they tend to transport more material. One could equally argue they should count less since they effectively only move material as opposed to people, we therefore decided to give them equal weight to the fast cars. 3.1 Slow Vehicles Only It is well known [1, 6, 7] that traffic flow shows two main phases: the homogeneous flow phase for low traffic density ρ in which the overall flow JH is proportional to traffic density and effective maximum speed. Since the effective maximum speed is given by vˆ = vmax − p with p the random deceleration probability introduced in evolution rule ‘3)’, we find [6] JH = ρ(vmax − p) .
(4)
The other phase corresponds to the jammed state with the flow [6] JJ = (1 − ρ)(1 − p) .
(5)
This expression can be understood as the product of the remaining free road (1 − ρ) [with the tacit assumption that for a very congested road most traffic is trapped in a jam, i.e. ρ f lowing ≈ (1 − ρ)] and the probability for a vehicle to emerge from the front of a jam, i.e. the drive-off probability 1 − p. The jam thus acts as a continuous reservoir determining the vehicle flux, expression (5) is consequently independent of the average maximal velocity vˆ = vmax − p. Fig. 1. confirms the above flow expression for JH and JJ ; see dotted lines in the plot. It, moreover, shows a comparison between single and two lane highways. In this context, let us emphasize that the normalization constant of the density for the single 1 as in the two-lane expression (2) and lane case is, of course, given by L1 rather than 2L the entire rest of this paper. We kept all other conditions identical and find that in the two-lane case the global flow rate is higher than for one-lane highways because vehicles can somewhat avoid an impasse by lane changes, see Fig. 1.
Two-Component Traffic Modelled by Cellular Automata
141
Fig. 1. Fundamental diagram plots of the relative flow J as a function of the global road occupation density ρ for one component traffic (slow vehicles only) with vmax = 5 and random deceleration probabilities p = 0.3 (higher curves) and p = 0.5 (lower curves). Compared are the cases single lane road ‘o’ and two lane road ‘+’. The dotted lines represent the theoretical curves derived for free flow JH and jammed flow JJ on one-lane highways, see Eqs. (4) and (5).
3.2 Fast Vehicles as Well
The effect of adding faster cars (with vmax = 10) leads to increased flow in the free-flow regime but once the traffic starts to jam this difference diminishes. This effect is displayed in FIG. 2. It shows the fundamental diagram for two-component traffic with two different mixing ratios between slow and fast vehicles. Only one random decceleration probability (p = 0.3) is employed in this plot. One finds that the homogenous flow region now shows a new transition. At very low densities the flow is homogenous for both slow and fast vehicles JHH = ρF vˆF + ρSvˆS ;
(6)
here, in accordance with equation (4), vˆ = vmax − p is the average maximal velocity. At slightly higher densities the much lower flow regime JJH = ρvˆS , dominated by the free flow of slow vehicles only, takes over. This is due to the fact that the faster vehicles jam whenever a bottleneck forms because one slow vehicle passes another [2]. It is this very observation on which our work focusses and from which we derived the idea that slow vehicles have to be controlled (have to be prohibited from passing) in order to increase multi-lane highway capacities. We, therefore, now turn our attention to the case of such ‘controlled flow’.
142
Paul Baalham and Ole Steuernagel
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
J
7 0
0.05
0.1
0.15
0.2
Fig. 2. Fundamental plots of the relative flow J as a function of the road occupation density ρ for two-component traffic on a two-lane road with vmax,S = 5 and vmax,F = 10, and random ˆ 90% : 10% and ρF /ρS = deceleration probability p = 0.3. Displayed are the cases ρF /ρS = 9/1 = 3/2 = ˆ 60% : 40%, namely top line with data points labelled ‘’ and red lower line. The two dotted lines describe the flow rates JHH and JJH , see Eq. (6) and text thereafter.
4 Controlled Flow: Slow Vehicles Must Not Pass We want to compare the cases studied in the previous section with a scenario where slow (cargo) vehicles are not allowed to pass, i.e., are restricted to the ‘near’ lane. The corresponding rules of our model are therefore modified as follows: Firstly, the initial distribution of slow vehicles is confined to the near lane. Note, that we always define densities ρ with respect to the entire road surface (2L cells) just as in Eq. (2) above. The respective density distributions are ρS = ρS, near = 0, ..., 12 and ρS, f ar = 0. Consequently, the occupancy σS,near = NS,near /L of slow cars in the near lane is their overall density doubled: σS, near = 2ρS = 0, ..., 1; this obviously leads to the upper limit ρS = 12 . For the fast cars we chose an unbiased lane occupation ratio proportional to the remaining space, that is ρF, near = ρF · (1 − σS,near )/(2 − σS,near ) and ρF, f ar = ρF · 1/(2 − σS,near), since we define densities with respect to the entire road surface this implies ρF, near + ρF, f ar = ρF . Secondly, and more importantly, slow vehicles are not allowed to pass, so the ‘nopassing rule’ is implemented by substituting evolution rule 2) by rule 2) for slow vehicles only (the fast ones are still allowed to change lanes) thus confining all slow vehicles to the near lane. The imposition of the no-passing rule for slow vehicles shows a considerable increase of the global flow rate (see Fig. 3 two top curves) as compared to the case of slow vehicles being allowed to pass other vehicles (see figure 3 two bottom curves, or the same two curves in figure 2). For further quantification Fig. 4 displays the relative difference of the increase in flow rate defined as . J2 −1 . (7) ∆j = J One can see from Fig. 4 that, with our choice of parameters, the introduction of the no-passing rule for slow vehicles at a mixing ratio NF /NS = ˆ 90% : 10% increases the relative flow by up to 55%.
Two-Component Traffic Modelled by Cellular Automata
143
0.8
J 0.6 0.4 0.2
7
0 0
0.1
0.2
Fig. 3. Same as figure 2 (lower two lines) (vmax,S = 5 and vmax,F = 10) plus additionally scenario ‘2’ with same parameters but slow vehicles now obey no-passing rule (two top lines) thus conˆ 90% : 10% for black broken siderably increasing the flow. Vehicle mixing ratios ρF /ρS = 9/1 = ˆ 60% : 40% for thin red solid lines. lines and ρF /ρS = 3/2 =
60%
M 40%
20%
7
0% 0
0.1
0.2
Fig. 4. Relative difference of the flow rates displayed in Fig. 3: ∆ j = (J2 /J) − 1. Vehicle mixing ˆ 90% : 10% for lower broken line and ρF /ρS = ˆ 60% : 40% for top thin red solid ratios ρF /ρS = lines.
Even in the case of a much smaller differential of vmax,F /vmax,S = 10/8 (rather than 10/5 considered before) for the maximum-speeds we still witness a significant enhancement of the global flow rate by more than 10%, see Fig. 5.
5 Outlook and Conclusions We expect that modifications of the model presented here should typically amplify the beneficial effects of confining slow (cargo) traffic. Such modifications to our model
144
Paul Baalham and Ole Steuernagel
M 16% 12% 8% 4%
7
0% 0
0.05
0.1
Fig. 5. Relative difference of the flow rates ∆ j = (J2 /J) − 1, just as displayed in Fig. 4, but with a smaller maximum-speed differential, namely vmax,S = 8 and vmax,F = 10. Vehicle mixing ratios ˆ 90% : 10% for lower dotted line and ρF /ρS = ˆ 60% : 40% for top thin red solid lines. ρF /ρS = The residual noise due to finite size effects in our simulation is clearly visible in these plots.
could include slower acceleration for the slow and heavy (cargo) vehicles. These are typically also longer than light and fast (passenger) vehicles thus reducing the available road space. Since it takes a longer time to pass a long vehicle, particularly if another slow and slowly accelerating long vehicle is passing, the inclusion of these two straightforward modifications should further emphasize the beneficial effects of barring slow vehicles from passing. One could also change the composition of the traffic in order to simulate more realistic multi-component traffic and modify the evolution rules to include technical and psychological effects [6, 7]. In either case we believe the considerable increase in road capacity of multi-lane highways due to the restriction of slow (cargo) vehicles from passing should persist. We conclude that one should seriously consider to perform field trials to establish whether slow (cargo) vehicles should always be restricted to the ‘near’ lane in order to increase road capacity of multi-lane highways at no extra cost.
Acknowledgment We acknowledge support by a ‘University of Hertfordshire Student Grant’.
References 1. 2. 3. 4.
K. Nagel and M. Schreckenberg, J. Phys. I 2, 2221 (1992). K. Nagel, D. E. Wolf, P. Wagner, and P. Simon, Phys. Rev. E 58, 1425 (1998). D. Chowdhury, L. Santen, and A. Schadschneider, Phys. Rep. 329, 199 (2000). A. Schadschneider and M. Schreckenberg, J. Phys. A 26, L679 (1993).
Two-Component Traffic Modelled by Cellular Automata
145
5. M. Schreckenberg, A. Schadschneider, and K. Nagel, Phys. Bl. 52, 460 (1996). 6. M. Schreckenberg, R. Barlovi´c, W. Knospe, and H. Kl¨upfel, Statistical Physics of Cellular Automata Models for Traffic Flow, pp. 113-126 in: Computational Statistical Physics, Eds. K. H. Hoffmann, M. Schreiber, (Springer, Berlin, 2001). 7. D. Helbing, Rev. Mod. Phys. 73, 1067 (2001). 8. B. Kerner and H. Rehborn, Phys. Rev. Lett. 79, 4030 (1997); L. Neubert, L. Santen, A. Schadschneider, and M. Schreckenberg, Phys. Rev. E 60, 6480 (1999). 9. A. D.Mason and A. W. Woods, Phys. Rev. E 55, 2203 (1997). 10. M. Schreckenberg, A. Schadschneider, K. Nagel, and N. Ito, Phys. Rev. E 51, 2939 (1995). 11. E. G. Campari and G. Levi, Eur. Phys. J. B 17, 159 (2000).
Average-Case Complexity of Partial Boolean Functions Alexander Chashkin Faculty of Mechanics and Mathematics, Moscow State University, Moscow, 119992 Russia [email protected]
Abstract. The average-case complexity of partial Boolean functions is considered. For almost all functions it is shown that, up to a multiplicative constant, the average-case complexity does not depend on the size of the function’s domain but depends only on the number of tuples on which the function is equal to unity. Keywords: Partial Boolean function, average-case complexity, straight-line program, randomized straight-line program.
1 Introduction Boolean circuits are one of the basic computational models for analyzing the complexity of Boolean functions. An important feature of this model is that every circuit on any tuple of arguments executes the same number of elementary operations, i. e., Boolean circuits are used to analyze the worst-case complexity of functions. In some situations, however, the number of elementary operations executed in the worst case fails to be a realistic measure of complexity. Frequently it is more important to know the number of operations averaged over all possible arguments. This characteristic can differ significantly from the number of operations in the worst case. In this work the average-case complexity of partial Boolean functions is considered under the assumption that the tuples in the domain of functions are uniformly distributed. Functions are computed by straight-line programs with a conditional stop. These programs are a generalization of Boolean circuits and are a natural model of straight-line computations, i. e., computations that involve no conditional branches or indirect addressing but can be terminated under a certain condition before the program has reached its end. Such computations can be described as follows. They are executed by a processor with a memory consisting of separate cells, which are denoted by xi , yj , and z. The cells xi contain the values of the independent variables xi . The cells yj are used to store the results of intermediate computations and are referred to as internal variables. The cell z is used to store the result of program execution and is referred to as the output variable. The processor operation is controlled by a program that consists of a sequence of computational and stop instructions. In unit time, every computational instruction computes the value of a two-place Boolean function, whose arguments are taken from certain memory cells, referred to as inputs of this instruction. The computation result is stored in a memory cell, which is referred to as output of this instruction.
This work was supported by the Russian Foundation for Basic Research (project no. 02-0100985) and the program of supporting scientific schools (project no. SSh-1807.2003.01).
A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 146–156, 2003. c Springer-Verlag Berlin Heidelberg 2003
Average-Case Complexity of Partial Boolean Functions
147
A stop instruction can terminate program execution. Every such instruction is written as Stop(a), where its argument a is xi , yj , or z. If the argument is equal to unity, then the processor terminates program execution. If the argument is equal to zero, then the next instruction of the program is executed. Let x = (x1 , . . . , xn ) be a tuple of independent Boolean variables. The execution time and the result of a program P on the tuple x are denoted by TP (x) and P(x), respectively. The quantity T (P) = 2−n ∑ TP (x), where the sum is over all binary tuples of length n, is called the average execution time of P. If an n-place Boolean function f is such that f (x) = P(x) for any binary n-tuple x, then the program P is said to compute f . The quantity T ( f ) = min T (P), where the minimum is over all programs computing f , is called the average-case complexity of f . A program P that computes f and is such that T (P) = T ( f ) is called a minimal program. As an example, consider programs P∨ and P& that compute the disjunction and conjunction of four variables x1 , . . . , x4 , respectively: P∨ : z = x1 ∨ x2 Stop(z)
P& : y = x1 &x2 z=0
z = x3 ∨ x4
Stop(y) z = x3 &x4
The first instruction in P∨ computes the disjunction of x1 and x2 and declares that the computed value is the result of program execution. The second instruction terminates the computation if the disjunction is equal to unity. If x1 ∨ x2 = 0, then the third instruction computes the disjunction of x3 and x4 and declares its value to be the result of program execution. It is easy to see that the second instruction in P∨ terminates the computations on 12 tuples (each of them contains unity at least at one of the first two places), i. e., P∨ executes two instructions on 12 tuples and three instructions on the other four tuples. Consequently, T (P∨ ) =
1 1 (2 · 12 + 3 · 4) = 2 . 16 4
The first instruction in P& computes the negation of the conjunction of x1 and x2 . The second instruction declares zero to be the result of program execution. The third instruction terminates the computations if the value computed by the first instruction is unity. If this is not the case, then the fourth instruction computes the conjunction of x3 and x4 and declares that the computed value is the result of program execution. It is easy to see that P& executes three instructions on 12 tuples and four instructions on the other four tuples. Consequently, T (P& ) = 3 14 . Note that a straight-line program containing no stop instructions and computing a function other than an independent variable is a usual Boolean circuit over the basis
148
Alexander Chashkin
consisting of all two-place Boolean functions. Therefore, the average-case complexity T ( f ) of any Boolean function f (x1 , . . . , xn ) that essentially depends on at least two variables is no less than its circuit size L( f ).
2 Previous Results The average-case complexity of Boolean functions was analyzed in [4]–[6]. It was shown in [4] that the average-case complexity of almost all n-place Boolean functions is equal to their circuit size up to a multiplicative constant. It was also shown in [4] that for an n-place Boolean function the ratio of its circuit size to its average-case complexity cannot exceed (up to a multiplicative constant) 2n /n, and there are functions for which this quantity is attained. In [5] asymptotically tight formulas were derived for the complexity of almost all Boolean vector functions whose components increase in number together with the number of their arguments. In particular, it was shown that the average-case complexity of almost every (n, n)-vector function is asymptotically equal to 2n−2 . The average-case complexity of computing Boolean functions by straight-line programs with random-number generators was studied in [6]. Reliable programs, which always compute the true value of the function in question, and programs computing the desired value with some probability were considered. In both cases, it was shown that random-number generators do not noticeably reduce the average execution time for a large fraction of the arguments. More exactly, for any Boolean function of n arguments and of circuit size L, there exists a domain for which the ratio of L to the average execution time required by reliable programs does not exceed n. For randomized programs with a probability different from 1, this ratio may only slightly be greater than n. The average-case complexity of some simplest symmetric Boolean functions was considered in [7].
3 Partial Boolean Functions: General Case Let D ⊆ {0, 1}n, f : D → {0, 1}, P be a program computing f , i. e., P(x) = f (x) for all x ∈ D. The average execution time TD of P on the domain D is defined as TD (P) = |D|−1
∑ TP (x).
x∈D
The average-case complexity (average execution time) of f on D is defined as TD ( f ) = min TD (P), where minimum is over all programs computing f . It is well known [1, 2] that the circuit size L( f ) of a partial Boolean function f : D → {0, 1}, where D ⊆ {0, 1}n, satisfies the asymptotic equality (as n → ∞) L( f ) ∼
|D| + O (n). log2 |D|
(1)
Let us show that a similar result holds (up to a multiplicative constant) for the averagecase complexity.
Average-Case Complexity of Partial Boolean Functions
149
Theorem 1. Let D ⊂ {0, 1}n, |D| > n log n. Then there exist constants c1 and c2 such that: (i) for almost every partial Boolean function f : D → {0, 1} TD ( f ) ≥
c1 |D| ; log2 |D|
(ii) for every partial Boolean function f : D → {0, 1} TD ( f ) ≤
c2 |D| . log2 |D|
Proof. (i) Let f be a Boolean function of n variables and P be a program computing f . Every binary tuple x ∈ D, regarded as the binary representation of a positive integer, is assigned its index NP,D (x) such that 1 ≤ NP,D (x) ≤ |D| and, for any x, y ∈ D, NP,D (x) < NP,D (y) if TP (x) < TP (y) or if TP (x) = TP (y) and x < y. Let us estimate the number of Boolean functions such that each of them has an average-case complexity not exceeding |D| . 16 log2 |D|
(2)
Suppose that f is such a function and P is a program computing f and having an average execution time on D that does not exceed (2). Let x0 ∈ D, NP,D (x0 ) = 12 |D|, and P be an initial subprogram of P that can compute the values of f on all tuples with indices not exceeding NP,D (x0 ). Then, by the definition of average-case complexity, we have T (P) =
1 1 1 TP (y) ≥ TP (x0 ). ∑ TP (y) > |D| ∑ |D| y∈D 2 y∈D | N(y)>N(x )
(3)
0
Therefore, TP (x0 ) < 2T ( f ). Since T ( f ) does not exceed (2), it is easy to see that TP (x0 ) <
|D| 2|D| = . 16 log2 |D| 8 log2 |D|
(4)
Every function is uniquely determined by the first TP (x0 ) instructions of its minimal program P and by the binary vector of length no greater than 12 |D| that consists of the values of f on those arguments from D for which the execution time of P is greater than the execution time of this program on x0 . Denote by N0 the number of distinct programs consisting of no more than TP (x0 ) instructions. Then the number of functions whose average-case complexity does not exceed (2) is bounded from above by N0 2|D|/2+1 . Let us estimate N0 . Any program P is defined by the list of its instructions, each of which is uniquely defined by the following data: • the type of the instruction – there are two instruction types: computational instructions and stop instructions; • the two-place Boolean function fi computed by the computational instruction (for stop instructions this information is omitted) – altogether there are 16 different twoplace Boolean functions;
150
Alexander Chashkin
• the index of the variable (output or internal) that is the output of the computational instruction (for stop instructions this information is omitted) – if P consists of L instructions, then the total number of internal and output variables does not exceed L and, without loss of generality, we assume that the internal variables are indexed by numbers from 1 to L − 1, and the output variable is assigned the index L; • the indices of the variables (independent, internal, or output) that are the inputs of instructions – we assume that the independent variables are indexed by numbers from L + 1 to L + n; hence, the total number of pairs of indices is no greater than (L + n)2 . Therefore, the number N of distinct programs consisting of L instructions satisfies L (5) N ≤ 2 · 16 · L · (L + n)2 ≤ (4(L + n))3L . Substituting TP (x0 ) for L in (5) and taking into account (4), we find that 3TP (x0 )
N0 ≤ (4 (TP (x0 ) + n))
≤ 4
|D| +n 8 log2 |D|
3|D|/(8 log2 |D|)
≤ 23|D|/8 .
Consequently, the number of functions whose the average-case complexity does not |D| exceed 16 log |D| is no greater than 2
23|D|/8 2|D|/2+1 = 27|D|/8+1 = o 2|D| .
Thus, the average-case complexity of almost every partial Boolean function defined on |D| D is no less than 16 log . The first inequality in the theorem is proved. The second 2 |D| inequality easily follows from (1).
4 Partial Boolean Functions of a Given Weight Let D ⊆ {0, 1}n and f : D → {0, 1}. Set N1 ( f ) = ∑x∈D f (x) and N0 ( f ) = |D| − N1 ( f ). The number N1 ( f ) is called the weight of f . It is well known [3] that for log2 n ≤ N1 (h) ≤ 2n−1 the circuit size of a complete n-place Boolean function h satisfies n log2 N12(h) L(h) ≤ (6) n (1 + o(1)) log2 log2 N12(h) if N1 ( f ) ≤ 12 |D|. An analogous result holds for partial Boolean functions. It is well known that the circuit size of almost every partial Boolean function defined on D satisfies log2 N|D| N1 ( f ) log2 N2|D| ( f ) 1 1( f ) . (7) L( f ) |D| log2 N1 ( f ) log2 log2 N1 ( f )
A comparison of (7) and Theorem 2 proved below shows that, for almost all complete n-place Boolean functions whose weight is a polynomial in n, the average- and worstcase complexities can differ (up to a multiplicative constant) by a factor of n. Another
Average-Case Complexity of Partial Boolean Functions
151
curious consequence of Theorem 2 is that the average-case complexity of almost every partial Boolean function depends (up to a multiplicative constant) only on the number of tuples on which it equals unity and does not depend on the size of its domain. Lemma 1. Let D0 , D1 ⊆ {0, 1}n, m = log2 |D1 | + 2. Then there exists a linear operator L : {0, 1}n → {0, 1}m such that 1 |{(x, y) | x ∈ D0 , y ∈ D1 , L (x) = L (y)}| ≤ |D0 |. 2 Proof. Denote by F(n, m) the set of all linear operators L : {0, 1}n → {0, 1}m . Obviously, F(n, m) = 2nm . It is easy to see that for arbitrary distinct tuples x, y from {0, 1}n there are exactly 2n−1 linear functions f (with zero free terms) such that f (x) = f (y). Therefore, in F(n, m) there are 2nm−m distinct operators L such that L (x) = L (y). It follows that 2−nm ∑ 2nm−m = 2−m |D0 ||D1 | x∈D0 ,y∈D1
is the average of the number of pairs (x, y), on which the values of an operator from F(n, m) are identical. Consequently, in F(n, m) there is an operator L such that L (x) = L (y) on at most |D0 ||D1 | 1 = |D0 | 2−m |D0 ||D1 | < 2|D1 | 2 pairs (x, y), x ∈ D0 , y ∈ D1 . The lemma is proved. Lemma 2. Let D ⊂ {0, 1}n, and a function f : D → {0, 1} be such that n log2 n ≤ N1 ( f ) ≤ 14 |D|. Then there exists a function h : D → {0, 1} such that h ≥ f and 1 ( f ) 2 log2 4N N1 ( f ) L(h) ≤ (8) 4N1 ( f ) , log2 log2 N ( f ) 1
and B = {x ∈ D | h(x) = f (x) = 0} consists of at least 12 N0 ( f ) tuples. Proof. Set D0 = {x ∈ D | f (x) = 0} and D1 = {x ∈ D | f (x) = 1}. Let m = log2 |D1 |+ 2. Suppose that L is the linear operator from Lemma 1. Define a function g : {0, 1}m → {0, 1} such that 1, if ∃ x ∈ D1 such that y = L (x), g(y) = 0, otherwise, and a function h : D → {0, 1} such that 1, if ∃ y ∈ D1 such that L (y) = L (x), h(x) = 0, otherwise. It is obvious that N1 (g) ≤ N1 ( f ) and h(x) = g(L (x)). Therefore, L(h) ≤ L(g) + L(L ). Since m = log2 N1 ( f ) + 2 and N1 ( f ) ≥ n log2 n, it follows from (6) that 1 ( f ) log2 4N N1 ( f ) L(g) ≤ 1 ( f ) (1 + o(1)), log2 log2 4N N (f) 1
152
Alexander Chashkin
and the results of [3] imply L(L ) ≤ log2
2n log2 |D1 | log2 n .
Since
4N1 ( f )
N1 ( f ) 1 ( f ) log2 log2 4N N1 ( f )
>
2n log2 |D1 | log2 n
for |N1 ( f )| ≥ n log2 n, we conclude that (8) is true. By the definition of h, it is easy to see that h(x) ≥ f (x) for any x from D. By Lemma 1 1 |{(x, y) | x ∈ D0 , y ∈ D1 , L (x) = L (y)}| ≤ |D0 |, 2 therefore, the values of f and h coincide on at least a half of the tuples from D0 . The lemma is proved. Theorem 2. Let D ⊆ {0, 1}n, n log2 n ≤ N1 ( f ) ≤ 12 |D|. Then there exist constants c3 and c4 such that: (i) for almost every partial Boolean function f : D → {0, 1} TD ( f ) ≥
c3 N1 ( f ) ; log2 N1 ( f )
(ii) for every partial Boolean function f : D → {0, 1} TD ( f ) ≤
c4 N1 ( f ) . log2 N1 ( f )
Proof. The proof of (i) is much the same as the proof of (i) in the previous theorem. For this reason we prove only (ii). Let f be an arbitrary Boolean function defined on D ⊆ {0, 1}n. If N1 ( f ) > 14 |D|, then the desired assertion follows from Theorem 1. Assume that N1 ( f ) ≤ 14 |D|. Let D0 be the domain consisting of the tuples on which f is zero and D1 be the domain consisting of the tuples on which f is equal to unity. Applying Lemma 2, we obtain a function h1 : D → {0, 1} and the set B1 consisting of all tuples x ∈ D such that h1 (x) = 0, i. e., B1 = {x ∈ D | h1 (x) = f (x) = 0}. Lemma 2 implies that h1 ≥ f and 1 |B1 | ≥ N0 ( f ), 2 1 ( f ) 2 log2 4N N1 ( f ) L(h1 ) ≤ 4N1 ( f ) . log2 log2 N ( f )
(9) (10)
1
Define the sets C1 = {x ∈ D | h1 (x) = 1, f (x) = 0} = D0 \ B1 , R1 = {x ∈ D | h1 (x) = 1} = D1 ∪C1 , and consider a function f1 defined on R1 and such that f1 (x) = f (x) for all x ∈ R1 . It is easy to see that f1 (x) = 1 ⇐⇒ f (x) = 1 and f1 (x) = 0 ⇐⇒ x ∈ C1 . Therefore, N1 ( f1 ) = N1 ( f ) and N0 ( f1 ) = |C1 |, and it follows from (9) that 1 |C1 | < |D0 |. 2
(11)
Average-Case Complexity of Partial Boolean Functions
153
Applying Lemma 2 to f1 yields a new function h2 : R1 → {0, 1} and the set B2 consisting of all tuples x ∈ R1 such that h2 (x) = 0. By Lemma 2, h2 ≥ f1 and 1 1 |B2 | ≥ N0 ( f1 ) = |C1 |, 2 2 1 ( f ) 2 log2 4N N1 ( f ) L(h2 ) ≤ 1 ( f ) . log2 log2 4N N (f)
(12) (13)
1
Define the new sets C2 = {x ∈ R1 | h2 (x) = 1, f1 (x) = 0} = C1 \ B2 , R2 = {x ∈ R1 | h2 (x) = 1} = D1 ∪C2 , and a new function f2 defined on R2 and such that f2 (x) = f (x) for all x ∈ R2 . Once again, it is easy to see that f2 (x) = 1 ⇐⇒ f (x) = 1 and f2 (x) = 0 ⇐⇒ x ∈ C2 . Therefore, N1 ( f2 ) = N1 ( f ) and N0 ( f2 ) = |C2 |, and (12) implies that 1 1 |C2 | = |C1 | − |B2| < |C1 | < |D0 |. 2 4 Set f0 = f , C0 = D0 and R0 = D. The process of generating functions fi−1 , hi and domains Bi ,Ci , Ri such that fi−1 : Ri−1 → {0, 1}, fi−1 (x) = f (x) for all x ∈ Ri−1 , hi : Ri−1 → {0, 1},
hi (x) ≥ fi−1 (x) for all x ∈ Ri−1 , 1 Bi = {x ∈ Ri−1 | hi (x) = fi−1 (x) = 0}, |Bi | ≥ |Ci−1 |, 2 1 Ci = {x ∈ Ri−1 | hi (x) = 1, fi−1 (x) = 0}, |Ci | < |Ci−1 |, 2 Ri = {x ∈ Ri−1 | hi (x) = 1} = D1 ∪Ci ,
(14)
is continued until a domain Cs appears such that |Cs | < 3|D1 |. It is easy to see that L(hi ) ≤
2 log2
4N1 ( f )
log2 log2
N (f)
1 4N1 ( f ) ≤
N1 ( f )
8N1 ( f ) log2 N1 ( f )
(15)
for i = 1, 2, . . . , s. Now we describe a program P computing f on an arbitrary tuple y. That program consists of Boolean circuits computing the functions hi , and its execution on a tuple y from D can be described as follows. First the value of h1 (y) is computed. If h1 (y) = 0, then the program halts and f (y) = 0 is set. If h1 (y) = 1, then the value of h2 (y) is computed. If h2 (y) = 0, then the program halts and f (y) = 0 is set. If h2 (y) = 1, then the value of h3 (y) is computed, etc. If hs−1 (y) = 1, then f (y) is computed by a program computing a partial function hs defined on a domain whose size is at most 4|D1 | = 4N1 ( f ).
154
Alexander Chashkin
Let us estimate the average execution time of P from above. By the definition of Ci and Ri (see (14)), we have s−1
s−1
i=1
i=1 s−1
∑ |Ri−1| = <
∑ (|D1 | + |Ci−1|) < ∑
i=1
1 1 s−1 13 |Cs−1 | + i−1 |D0 | < |D0 | + 2|D0| ≤ |D0 |. s−1 3 2 3·2 6
Therefore, using (15), we obtain 1 s−1 TD (P) ≤ ∑ L(hi )|Ri−1| + L(hs)4|D1 | ≤ |D| i=1 s−1 8N1 ( f ) ≤ ∑ |Ri−1 | + 4|D1| ≤ |D| log2 N1 ( f ) i=1 13 24N1 ( f ) 8N1 ( f ) |D0 | + 4|D1| ≤ . ≤ |D| log2 N1 ( f ) 6 log2 N1 ( f ) The theorem is proved with c4 = 24.
5 Boolean Functions of a Given Weight and Randomized Programs An instruction a = ξ, where a is an internal or output variable, is called random. A random instruction generates 0 or 1 with equal probabilities. A program containing random instructions is said to be randomized. Randomized programs were discussed in detail in [6]. The execution time of a randomized program P on a tuple x, denoted by tP (x), is the number of instructions executed before the program is terminated. It is easy to see that tP (x) is a variate. The average execution time of P on a tuple x is the quantity TP (x) = M(tP (x)), where M denotes expectation. The average execution time of randomized program P on a domain D is TD (P) = |D|−1 ∑ TP (x). x∈D
A randomized program P is said to compute a partial Boolean function f : D → {0, 1} with a reliability of 1 − ε if Pr(P(x) = f (x)) ≤ ε for any x ∈ D. The average execution time for a function f : D → {0, 1} is defined as TDε ( f ) = min TD (P), where the minimum is over all programs computing f with a reliability of 1 − ε.
Average-Case Complexity of Partial Boolean Functions
155
Below is a simple example of a randomized program, which shows that the use of random instructions may reduce the average-case complexity of Boolean functions. Let f be a n-placed complete Boolean function whose average-case complexity increases with n, and let P be a minimal program for f . Consider the randomized program P∗ obtained from P by adding three new instructions, of which two are random and one is a stop instruction: z=ξ y=ξ Stop(y) P. It is easy to see that P∗ computes f with a reliability of that 1 1 1 TP∗ (x) = · 3 + TP (x) + 3 ∼ TP (x). 2 2 2
3 4
and for any tuple x it holds
3
Therefore, we have T 4 ( f ) ≤ T (P∗ ) ∼ 12 T ( f ). In [6] it was shown that for any n-placed Boolean function f such that L( f ) ≥ n4 and for any ε ≤ 1/4, there exists a domain D ⊆ {0, 1}n such that ∗ TDε ( f ) ≥ L( f ) q · alog q , (16) where q = log2 (2n+4 /L( f ) log2 L( f )), log∗ q = k if 0 < log2 log2 . . . log2 q ≤ 1 and a is k times
a constant. Theorem 2 above shows that the right-hand side in (16) is close to the maximum possible value. Let f be an n-placed complete Boolean function of weight w, whose circuit size is maximal among all functions of this weight. It follows from (7) that n
L( f )
w log2 2w . log2 w
It is easy to see that TDε ( f ) ≤ TD ( f ) for any D ⊆ {0, 1}n and any ε > 0. Theorem 2 implies that for any domain D ⊆ {0, 1}n c4 w . TD ( f ) ≤ log2 w Consequently, for f and the domain D appearing in (16) we have n w log2 2w ∗ c4 w q · blog q ≤ TDε ( f ) ≤ , log2 w log2 w
where q = log(2n+4 /w) and b is a constant. Up to a multiplicative constant, the left∗ and right-hand sides of the last inequality differ by a factor of no more than blog n . It ∗ is easy to see that for any positive integer constant k the function blog n increases more slowly than log2 log2 . . . log2 n as n → ∞. k times
156
Alexander Chashkin
References 1. A. E. Andreev, On Circuit Complexity of Partial Boolean Functions, Diskret. Mat. 1989, 1 N 4, 36–45 (in Russian). 2. L. A. Sholomov, On the Implementation of Partial Boolean Functions by Circuits, Problemy kibernetiki, (Moscow: Nauka, 1969) 21, 215–226 (in Russian). 3. O. B. Lupanov, An Approach to the Synthesis of Control Systems – The Principle of Local Coding, Problemy kibernetiki (Moscow: Nauka, 1965), 14, 31–110 (in Russian). 4. A.V. Chashkin, Average Case Complexity for Finite Boolean Functions, Discrete Applied Mathematics, 114 (2001), 43–59. 5. A.V. Chashkin, Average Time of Computing Boolean Operators, Operations Research and Discrete Analysis, Ser. 1, (1998) 5 N 1, 88–103 (in Russian). 6. A.V. Chashkin, Computation of Boolean Functions by Randomized Programs, Operations Research and Discrete Analysis, Ser. 1, (1997), 4 N 3, (1997), 49–68 (in Russian). 7. A.V. Chashkin, Average Time of Computing Boolean Operators, Discrete Math. and Appl., (2001) 1, 77–82.
Classes of Binary Rational Distributions Closed under Discrete Transformations Roman Kolpakov Department of Computer Science, The University of Liverpool, Liverpool L69 7ZF, UK [email protected]
Abstract. We study the generation of rational probabilities by Boolean functions. A probability a is generated by a set H of probabilities if a is the probability of f (x1 , . . . , xn ) = 1 for some Boolean function f provided that for any i the probability of xi = 1 belongs to H and all the values of x1 , . . . , xn are independent. The closure of the set H is the set of all numbers generated by H. A set of probabilities is called closed if it coincides with its closure. We give an explicit characterization of closures for all sets of rational probabilities. Using this result, we describe all closed and all finitely generated closed sets of rational probabilities. Moreover, we determine the structure of the lattice formed of these sets. Keywords: Stochastic automata, probabilistic transformations.
1 Introduction Transformations of finite probabilistic distributions are of great importance for the generation of randomness which plays a vital role in many areas of computer science (see [11]). The notion of a transformer of probabilistic distributions is basic for investigations in this field. One of the most important from theoretical and practical points of view types of transformers of probabilistic distributions is a transformer which issues the value of a simulated by the transformer random variable ζ0 proceeding from the values of some disposable random variables ζ1 , . . . , ζn . Such a transformer is naturally presented as a function from Ω1 × . . . × Ωn to Ω0 where Ωi is the set of values for the random variable ζi , i = 0, 1, . . . , n. In particular, if the random variables ζ0 , ζ1 , . . . , ζn are Boolean, i. e., they have only two values, for example, 0 and 1, then this transformer is determined by some Boolean function f : {0, 1}n → {0, 1}. In this case, if the random variables ζ1 , . . . , ζn are independent and the variable ζi is equal to 1 with probability ρi , i = 1, . . . , n, then the variable ζ0 is equal to 1 with the probability
∑
(σ1 ...σn )∈{0,1}n
(ρ1 )σ1 . . . (ρk )σk f (σ1 , . . . , σn )
(1)
This work is partially supported by the EPSRC grant GR/R84917/01. On leave from the French-Russian Institute for Informatics and Applied Mathematics of Lomonosov Moscow State University.
A. Albrecht and K. Steinh¨ofel (Eds.): SAGA 2003, LNCS 2827, pp. 157–166, 2003. c Springer-Verlag Berlin Heidelberg 2003
158
Roman Kolpakov
where (ρ)σ = ρ, if σ = 1, and (ρ)σ = 1 − ρ, if σ = 0. We denote value (1) by P { f (ρ1 , . . . , ρn )}. The investigations on properties of value P { f (ρ1 , . . . , ρn )} were initiated in the middle of XIX century by famous mathematician Boole who revealed some fundamental relations between values ρ1 , . . . , ρn and P { f (ρ1 , . . . , ρn )} (see [2]). Transformations of probabilistic distributions in the asynchronous case when for some values of input random variables we admit the absence of output were considered in [1, 5]. In the paper we use the notations N for the set of all natural numbers and (x1 , . . . , xn ) for the greatest common divisor of numbers x1 , . . . , xn . For a set T of natural numbers and a natural k we denote by T >k the set of all numbers of T which are greater than k. By I (n) we denote the set of all prime divisors for a natural number n. Let H be a set of numbers of the interval (0; 1). We say that a number a of the interval (0; 1) is generated by the set H if there exists a Boolean function f (x1 , . . . , xk ) such that a = P { f (ρ1 , . . . , ρk )} for some ρ1 , . . . , ρk ∈ H. We denote by H the set of all numbers generated by H and call this set the closure of the set H. Note that if f (x) = x then P { f (ρ)} = ρ for any ρ ∈ (0; 1). So H ⊆ H. The set H is called closed if H = H. We say also that a set A of numbers is generated by the set H if A ⊆ H. Studying of the defined generation of numbers is an important direction of investigations for the synthesis of transformers of probabilistic distributions. A number of questions on the approximate generation of numbers by sets of single numbers and some sets of a special kind was considered in [6, 10]. However the problem of a complete description for the closures of arbitrary sets of numbers is still open. One of the natural approaches for solving this problem is to consider this generation on closed subsets of numbers everywhere dense in the interval (0; 1). The set of all rational numbers of the interval (0; 1) is the most appropriate example of such subsets. We denote this set by Q(0; 1). For any set Π of different prime numbers we can choose from Q(0; 1) the subset G[Π] of all fractions such that all prime divisors for the denominators of these fractions belong to Π:
a=
m | 0 < a < 1, m ∈ N, n ∈ N>1 , I (n) ⊆ Π . n
One can easy verify that the set G[Π] is closed. So it presents a simple example of closed subsets of the set Q(0; 1). To our knowledge, the considered generation of rational numbers has been first studied in [9]. Itwas shown that the sets G[{2}] and G[{3}] were generated by the subsets 12 and 13 , 23 respectively. Thus the sets G[2] and G[3] are finitely generated, i. e., generated by some their own finite subsets. Further investigations in this field were made in [7, 8]. It was proved that for any finite Π the set G[Π] was finitely generated. The lattice formed of sets G[Π] was also described. In [3] a simple criterion for verifying the generation of a set G[Π], where Π is arbitrary, by a given finite subset is formulated. In [4] an explicit description for the closures of arbitrary finite subsets of the set Q(0; 1) is obtained. Here we generalize this result to the case of arbitrary infinite subsets of the set Q(0; 1). Using this generalization, we give a complete description of all closed and all finitely generated closed subsets of the set Q(0; 1). We also determine the structure of the lattice formed of these subsets.
Classes of Binary Rational Distributions Closed under Discrete Transformations
159
2 Auxiliary Definitions and Results Natural numbers a1 , a2 , . . . , ak are called pairwise relatively prime if each of these numbers is relatively prime to any other of them. We call a set of numbers of N>1 divisible if it contains less than two numbers or all its numbers are pairwise relatively prime. We call also a set of natural numbers relatively prime to a natural number n if each of the numbers of this set is relatively prime to n. An empty set is supposed to be relatively prime to any natural number. If a set A of natural numbers is finite we denote by A the / = 1. Note the following product of all numbers of A. For an empty set we suppose 0 obvious fact. Proposition 1. The value A is relatively prime to any natural number which is relatively prime to a set A. Let A, B be non-empty divisible sets of natural numbers. We call the set B a divisor of the set A if for any number b of B the set A has a number divisible by b. An empty set is supposed to be a divisor of any divisible set. Proposition 2. Divisible sets A and B are divisors of one another if and only if A = B. Proof. Let A and B be divisors of one another. Consider a number a ∈ A. Then B has a number b divisible by a, and A has a number c divisible by b. Since c is divisible by a > 1, the numbers a and c are not relatively prime. As the set A is divisible, we conclude that a = c = b. So A ⊆ B. In the same way we can prove that B ⊆ A. Thus A = B. On the other hand, if A = B then the proposition is obvious. We note also that any divisor of a divisor of a divisible set is a divisor of this set. Let A1 , . . . , As be finite divisible sets. The greatest common divisor (A1 , . . . , As ) of these sets is the set { a | a = (a1 , a2 , . . . , as ) > 1, ai ∈ Ai , i = 1, 2, . . . , s } , which consists of the numbers of N>1 that are greatest common divisors of all possible s-tuples formed by picking out one number of each of the sets A1 , . . . , As . If at least one / Note that (A1 , . . . , As ) is a of the sets A1 , . . . , As is empty we suppose (A1 , . . . , As ) = 0. divisible set which is a divisor of all the sets A1 , . . . , As . Moreover, any other divisor of all the sets A1 , . . . , As is a divisor of (A1 , . . . , As ). Greatest common divisors of divisible sets satisfy also the associative property, i.e., for any divisible sets A, B,C we have ((A, B),C) = (A, (B,C)) = (A, B,C). So the introduced notion of greatest common divisor is a correct generalization of the notion of greatest common divisor of natural numbers to the case of divisible sets. We can also introduce the notion of greatest common divisor for an infinite number of finite divisible sets. Let A1 , A2 , . . . be finite divisible sets. We define the greatest common divisor (A1 , A2 , . . .) of these sets as the set { a | a > 1, a = (a1 , a2 , . . .), ai ∈ Ai , i = 1, 2, . . . } , which consists of all the numbers of N>1 that are greatest common divisors of infinite samples of numbers of A1 , A2 , . . . formed by picking out one number of each of the
160
Roman Kolpakov
/ Note sets. If at least one of the sets A1 , A2 , . . . is empty we suppose (A1 , A2 , . . .) = 0. that (A1 , A2 , . . .) is a finite divisible set. Note also that the set (A1 , A2 , . . .) is a divisor of all the sets A1 , A2 , . . . and any other divisor of all the sets A1 , A2 , . . . is a divisor of (A1 , A2 , . . .). So it is a correct generalization of the notion of greatest common divisor to the case of an infinite number of divisible sets. Below we use the following property of the set (A1 , A2 , . . .). Lemma 1. Let A1 , A2 , . . . be an infinite sequence of finite divisible sets. Then for some great enough j the equalities (A1 , A2 , . . .) = (A1 , . . . , A j ) = (A1 , . . . , A j+1 ) = (A1 , . . . , A j+2 ) = . . . are valid. Proof. For the sake of convenience we take T1 = A1 , Ti = (A1 , . . . , Ai ) for i ≥ 2, and T = (A1 , A2 , . . .). Since Ti+1 = (Ti , Ai+1 ) for any i, the set Ti+1 is a divisor of Ti , so all the sets T2 , T3 , . . . are divisors of T1 . If for some i we have Ti = Ti+1 then all the sets Ti+1 , Ti+2 , Ti+3 , . . . are divisors of Ti which can not coincide with Ti by Proposition 2. Thus, since T1 has a finite number of divisors, the inequality Ti = Ti+1 can be true only for a finite number of values of i. Therefore, for some great enough j we have T j = T j+1 = T j+2 = . . .. Assume that T j = T . Since T is a divisor of T j , by Proposition 2 we obtain then that T j is not a divisor of T . So T j has a number a which is not contained in T . According to the definiton of the set T j , we have a = (a1 , . . . , a j ) for some numbers a1 , . . . , a j of the sets A1 , . . . , A j respectively. If for every k > j the set Ak contains some number ak divisible by a then a = (a1 , . . . , a j , a j+1 , . . .) and, therefore, a ∈ T . So there exists some k > j such that no number of the set Ak is divisible by a. Then the number a can not be contained in Tk which contradicts the equality T j = Tk . Thus T j = T . If a divisible set A is relatively prime to a natural number n then all divisors of A are also relatively prime to n. So we can formulate another property of greatest common divisors of divisible sets. Proposition 3. The greatest common divisor of divisible sets is relatively prime to any number which is relatively prime to at least one of these sets. We call a set of natural numbers B = {b1 , . . . , bt } a multiplicative partition of a finite set A of natural numbers if one can associate with B a partition of the set A into disjoint (possibly improper) subsets A1 , . . . , At such that bi = Ai , i = 1, . . . ,t. Proposition 1 implies Proposition 4. Any multiplicative partition of a divisible set consists of pairwise relatively prime numbers which are relatively prime to all numbers relatively prime to the set.
3 Closed Classes in Q(0; 1) Let t1 ,t2 be relatively prime natural numbers, Π be a non-empty set of different prime numbers relatively prime to t1 and t2 . We denote by G[Π;t1 : t2 ] the following subset of the set G[Π]: m m | ∈ G[Π], m ≡ 0 (mod t1 ), m ≡ n (mod t2 ) . n n
Classes of Binary Rational Distributions Closed under Discrete Transformations
161
One can note that the definition of G[Π;t1 : t2 ] depends formally on choosing the denominator n of fractions mn . However, this dependence is actually fictitious. To verify it, take another denominator n = cn such that I (n ) ⊆ Π. Then I (c) ⊆ Π, so (t1 , c) = (t2 , c) = 1. Hence the relations m ≡ 0 (mod t1 ) and m ≡ n (mod t2 ) hold if and only if the relations cm ≡ 0 (mod t1 ) and cm ≡ n (mod t2 ) hold respectively. Thus a fraction mn of G[Π] belongs to G[Π;t1 : t2 ] independently of the choice of its denominator n. It is shown in [4] that any set G[Π;t1 : t2 ] is nonempty, and all fractions of G[Π;t1 : t2 ] with a denominator n form an arithmetic progression with the difference t1t2 /n. Let T be a finite divisible set of numbers relatively prime to the set Π. According to Proposition 4, for any subset T of the set T the numbers T and T \ T are relatively prime and, moreover, they are relatively prime to Π, so there exists the set G[Π; T : T \ T ]. Therefore, we can consider the following subset of G[Π]:
G Π; T : T \ T ,
T ⊆T
where the union is over all (including improper) subsets T of the set T . We denote this / = G[Π]. One can show that any subset by G[Π; T ]. In case of T = 0/ we suppose G[Π; 0] set G[Π; T ] is closed. Lemma 2. For any set Π of different prime numbers and any finite divisible set T of numbers relatively prime to Π the set G[Π; T ] is closed. Proof. To prove the lemma we need to verify that for any numbers ρ1 , . . . , ρs of G[Π; T ] and any nonconstant Boolean function f (x1 , . . . , xs ) the value P { f (ρ1 , . . . , ρs )} belongs to G[Π; T ]. We prove it by induction over s. For s = 1 there exist only two nonconstant Boolean functions of one variable: the identity function fid (x) = x for which P { fid (ρ1 )} = ρ1 ∈ G[n]; T and the function of negation f¬ (x) = x for which P { f¬ (ρ1 )} = 1 − ρ1 . Since ρ1 ∈ G[Π; T ], so ρ1 ∈ G[Π; T : T \ T ] for some subset T of the set T . Hence 1 − ρ1 ∈ G[Π; T \ T : T ] ⊆ G[Π; T ]. Assume now that P {g(ρ1 , . . . , ρs−1 )} ∈ G[Π; T ] for any numbers ρ1 , . . . , ρs−1 of G[Π; T ] and any nonconstant Boolean function g(x1 , . . . , xs−1 ). Let µ = P { f (ρ1 , . . . , ρs )}. Denote by g0 (x1 , . . . , xs−1 ) and g1 (x1 , . . . , xs−1 ) the functions f (x1 , . . . , xs−1 , 0) and f (x1 , . . . , xs−1 , 1), and by µ0 and µ1 the numbers P {g0 (ρ1 , . . . , ρs−1 )} and P {g1 (ρ1 , . . . , ρs−1 )} respectively. Then µ = (ρs )0
∑
(ρ1 )σ1 . . . (ρs−1 )σs−1 g0 (σ1 , . . . , σs−1 )
∑
(ρ1 )σ1 . . . (ρs−1 )σs−1 g1 (σ1 , . . . , σs−1 )
(σ1 ...σs−1 )∈{0,1}s−1
+ (ρs )1
(σ1 ...σs−1 )∈{0,1}s−1
= (1 − ρs)µ0 + ρsµ1 .
(2)
For i = 0, 1, according to the induction assumption, either gi ≡ const and so µi ∈ {0, 1} or µi ∈ G[Π; T ]. Therefore, without loss of generality µi can be considered as a fraction mi /n where I (n) ⊆ Π and mi ≡ 0 (mod Ti ), mi ≡ n (mod T \ Ti ) for some subset Ti of the set T (we can take Ti = T in case of µi = 0 and Ti = 0/ in case of µi = 1). In a similar way, the number ρs is equal to some fraction m /n where I (n ) ⊆ Π and m ≡ 0
162
Roman Kolpakov
(mod T ), m ≡ n (mod T \ T ) for some subset T of the set T . Substituting in (2) for µ0 , µ1 , ρs the corresponding fractions, we obtain (n − m )m0 + m m1 m m0 m m1 + · = . (3) µ = 1− n n n n nn Denote by T11 , T10 , T01 , T00 the sets T0 ∩T1 , T0 ∩(T \ T1 ), (T \ T0 )∩T1 , (T \ T0 )∩(T \ T1 ) respectively. For each i = 0, 1 and each j = 0, 1 we define Tij = Ti j ∩ T , Tij = Ti j ∩ (T \ T ), ti j = Tij , tij = Tij . Observe that the set of all the numbers ti j , tij where i, j = 0, 1 is a multiplicative partition of the divisible set T . So, according to Proposition 4, all these numbers are pairwise relatively prime. ∪T . Then T \ Tˆ = T ∪T ∪T . Since T = T ∪T ∪T ∪T , Let Tˆ = T11 ∪T10 00 0 01 10 01 11 11 10 10 . Hence m ≡ 0 (mod t t t t ). Therefore m is divisible by so T0 = t11t11t10t10 0 0 11 11 10 10 t11 t10 . In an analogous way, n − m is divisible by t11 t01 . Thus, taking into account that the numbers t11t10 and t11t01 are relatively prime, we obtain that (n − m )m0 is divisible t t t = Tˆ . In an analogous way we can prove that m m is also divisible by by t11 1 11 10 01 ˆ T . Therefore (n − m )m0 + m m1 ≡ 0
(mod Tˆ )
Consider also the number nn − (n − m )m0 + m m1 = (n − m )(n − m0 ) + m(n − m1).
(4)
(5)
∪ T ∪ T ∪ T , so T \ T = t t t t . Hence m ≡ n Since T \ T0 = T00 0 0 00 01 01 00 00 01 01 (mod t00t00t01t01 ). Therefore n − m0 is divisible by t00 t01 . In an analogous way we can t . Since the numbers t t and t t are relatively show that n − m is divisible by t00 10 00 01 00 10 t t t = T \ Tˆ . In prime, we then obtain that (n − m )(n − m0 ) is divisible by t00 00 10 01 an analogous way it can be proved that m (n − m1 ) is also divisible by T \ Tˆ . Thus, using equality (5), we have
(n − m )m0 + m m1 ≡ nn
(mod T \ Tˆ ).
(6)
From relations (3), (4) and (6), taking into account that I (nn ) ⊆ Π, we conclude that µ ∈ G[Π; Tˆ : T \ Tˆ ] ⊆ G[Π; T ]. According to Lemma 2, all sets G[Π; T ] form a collection of closed subsets of the set Q(0; 1). We denote this collection by G. In order to characterize the lattice formed of sets of G, we need to establish a relationship between the inclusion G[Π1 ; T1 ] ⊆ G[Π2 ; T2 ] and the parameters Π1 , Π2 , T1 , T2 for any two sets G[Π1 ; T1 ], G[Π2 ; T2 ] of G. For the parameters Π1 , Π2 this relationship is obvious. Proposition 5. Let G[Π1 ; T ], G[Π2 ; T ] be two sets of G. Then G[Π1 ; T ] ⊆ G[Π2 ; T ] if and only if Π1 ⊆ Π2 . For the parameters T1 , T2 the relationship is a little more complicated since for different sets T1 , T2 the sets G[Π; T1 ], G[Π; T2 ] may coincide. The following statement gives us an example of such a coincidence. Proposition 6. Every set G[Π; T ] of G coincides with the set G[Π; T >2 ].
Classes of Binary Rational Distributions Closed under Discrete Transformations
163
Proof. It is sufficient to show that if T contains the number 2 then G[Π; T ] = G[Π; T \ {2}]. Let mn be an arbitrary fraction of G[Π; T \ {2}]. Then mn belongs to the set G[Π; T : T \ T ] for some subset T of the set T \ {2}. It implies that m ≡ 0 (mod T ) and n − m ≡ 0 (mod (T \ {2}) \ T ). Since 2 ∈ T , all numbers of Π are odd, so n is odd. Hence one of the numbers m, n − m is even. Let m be even. Since T is divisible, so T is relatively prime to 2. Then by proposition 1 the number 2 is relatively prime to T . Therefore m ≡ 0 (mod T ∪ {2}). Hence mn ∈ G[Π; T ∪ {2} : T \ (T ∪ {2})] ⊆ G[Π; T ]. If n − m is even we can show in a similar way that mn ∈ G[Π; T : T \ T ] ⊆ G[Π; T ]. Thus, G[Π; T \ {2}] ⊆ G[Π; T ]. On the other hand, it follows obviously from the definition of set G[Π; T ] that for any subset T of the set T the relation G[Π; T ] ⊆ G[Π; T ] is valid. So G[Π; T ] ⊆ G[Π; T \ {2}]. We use Proposition 6 for proving the following fact. Proposition 7. Let G[Π; T1 ], G[Π; T2 ] be two sets of G. Then G[Π; T1 ] ⊆ G[Π; T2 ] if and only if T2>2 is a divisor of T1 . Proof. Let T2>2 be a divisor of T1 . Consider an arbitrary subset T1 of the set T1 . Since every number of T2>2 is a divisor of a number of T1 , we can split the set T2>2 into two disjoint subsets T2 and T2 such that all numbers of T2 are divisors of numbers of T1 and all numbers of T2 are divisors of numbers of T1 \ T1 . Thus, all numbers of T2 are divisors of the number T1 and all numbers of T2 are divisors of the number T1 \ T1 . Since all numbers of T2>2 are pairwise relatively prime, we obtain that the numbers T2 and T2 are divisors of the numbers T1 and T1 \ T1 respectively. So G[Π; T1 : T1 \ T1 ] ⊆ G[Π; T2 : T2 ] ⊆ G[Π; T2>2 ]. Since T1 is arbitrary, we have then G[Π; T1 ] ⊆ G[Π; T2>2 ]. Hence, applying Proposition 6, we obtain that G[Π; T1 ] ⊆ G[Π; T2 ]. Assume now that T2>2 is not a divisor of T1 . This implies that T2>2 contains a number t which is not a divisor of any number of T1 . We show that G[Π; T1 ] ⊆ G[Π; {t}]. We consider separately three possible cases. a) Let (T1 ,t) = 1. We choose a natural n such that I (n) ⊆ Π and n > 2T1 , and consider the fractions Tn1 and 2Tn1 . Since (T1 ,t) = 1 and t > 2, neither T1 nor 2T1 is divisible by t. Moreover, the numbers n − T1 and n − 2T1 cannot be congruent modulo t, so at least one of them is not divisible by t either. Therefore, either T1 2T1 is not contained in G[Π; {t}]. On the other hand, both these fractions are n or n contained in the subset G[Π; T1 : 1] of the set G[Π; T1 ]. b) Let (T1 ,t) = a > 1 and T1 be not divisible by t. We choose a natural n such that I (n) ⊆ Π and n > T1 , and consider the fraction Tn1 which belongs to the subset G[Π; T1 : 1] of the set G[Π; T1 ]. If n − T1 is divisible by t then n is divisible by a and, therefore, (n,t) > 1. This contradicts the definition of the set G[Π; T2 ] which requires the set Π to be relatively prime to all numbers of T2 . Thus neither T1 nor n − T1 is divisible by t. Hence Tn1 ∈ / G[Π; {t}]. c) Let T1 be divisible by t. On the other hand, no number of T1 is divisible by t. So T1 contains at least two numbers t and t which are not relatively prime to t. We split the set T1 into two disjoint subsets T1 and T1 such that t ∈ T1 , t ∈ T1 , and consider an arbitrary fraction mn of the subset G[Π; T1 : T1 ] of the set G[Π; T1 ]. If either m or n − m is divisible by t then n is divisible by either (t,t ) or (t,t ) respectively, and,
164
Roman Kolpakov
therefore, n is not relatively prime to t. Thus we have the same contradiction as in the previous case. So neither m nor n − m is divisible by t. Hence mn ∈ / G[Π; {t}]. Since G[Π; T2 ] ⊆ G[Π; {t}], the proved relation G[Π; T1 ] ⊆ G[Π; {t}] implies G[Π; T1 ] ⊆ G[Π; T2 ]. Summing up Propositions 5, 6 and 7, we obtain Proposition 8. Let G[Π1 ; T1 ], G[Π2 ; T2 ] be two sets of G. Then 1. G[Π1 ; T1 ] ⊆ G[Π2 ; T2 ] if and only if Π1 ⊆ Π2 and T2>2 is a divisor of T1 ; 2. G[Π1 ; T1 ] = G[Π2 ; T2 ] if and only if Π1 = Π2 and T1>2 = T2>2 . Proof. If Π1 ⊆ Π2 and T2>2 is a divisor of T1 then, according to Propositions 5 and 7, we have G[Π1 ; T1 ] ⊆ G[Π2 ; T1 ] ⊆ G[Π2 ; T2 ]. Let now G[Π1 ; T1 ] ⊆ G[Π2 ; T2 ]. Then, obviously, Π1 ⊆ Π2 . Assume that T2>2 is not a divisor of T1 . Then, according to Proposition 7, we obtain G[Π1 ; T1 ] ⊆ G[Π1 ; T2 ]. Since G[Π1 ; T2 ] = G[Π2 ; T2 ] ∩ G[Π1 ], so G[Π1 ; T1 ] ∩ G[Π2 ; T2 ] ⊆ G[Π1 ; T2 ]. Hence the relation G[Π1 ; T1 ] ⊆ G[Π1 ; T2 ] implies G[Π1 ; T1 ] ⊆ G[Π2 ; T2 ]. Thus item 1 is proved. Assume now G[Π1 ; T1 ] = G[Π2 ; T2 ]. Then, according to item 1, we have that Π1 = Π2 and T1>2 is a divisor of T2 . So T1>2 is a divisor of T2>2 . In an analogous way we obtain that T2>2 is a divisor of T1>2 . Hence by Proposition 2 we have T1>2 = T2>2 . On the other hand, the equality G[Π1 ; T1 ] = G[Π2 ; T2 ] follows obviously from the equalities Π1 = Π2 and T1>2 = T2>2 in virtue of Proposition 6. Proposition 8 allows us to know the inclusion relation between any sets of G.
4 Main Result The main goal of our work is a description of all closed classes of numbers of the set Q(0; 1). Any of these classes is a closure of some subset of the class (for example, of the class itself). So, to solve our problem, it is enough to describe closures of all possible subsets of the set Q(0; 1). Without loss of generality we consider numbers of the set Q(0; 1) as fractions in their lowest terms. m1 ms Let H = n1 , . . . , ns be an arbitrary finite set of fractions of Q(0; 1) in their lowest terms. We denote {m1 , n1 − m1 }>1 , . . . , {ms , ns − ms }>1 , if s ≥ 2; T (H) = {m1 , n1 − m1 }>1 , if s = 1. In [4] the following description for the set H is obtained. Theorem 1. H = G si=1 I (ni ); T (H)>2 . If H = mn11 , mn22 , . . . is an infinite set of fractions of Q(0; 1) in their lowest terms
then we denote by T (H) the set {m1 , n1 − m1 }>1 , {m2 , n2 − m2}>1 , . . . .
Since fractions of the set H are in their lowest terms, for any i = 1, 2, . . . the set {mi , ni − mi }>1 is divisible and relatively prime to ni . Hence T (H) is a finite divisible set which is relatively prime to all the numbers n1 , n2 , . . . by Proposition 3. So
Classes of Binary Rational Distributions Closed under Discrete Transformations
165
T (H) to any number of ∞ i=1 I (ni ). Thus we can consider the set ∞ is relatively prime G i=1 I (ni ); T (H)>2 . The following theorem is a generalization of Theorem 1 to the case of closures of infinite subsets of the set Q(0; 1). Theorem 2. Let H = mn11 , mn22 , . . . be an infinite set of fractions of Q(0; 1) in their >2 . lowest terms. Then H = G ∞ i=1 I (ni ); T (H)
Proof. For the sake of convenience we denote the set ∞ i=1 I (ni ) by Π. For any fraction mi >2 is split obviously into two subsets T , T of numbers which of H the set T (H) i i ni are divisors of the numbers mi and ni − mi respectively. Since all numbers of T (H) by the are pairwise relatively prime, the numbers mi and ni − mi are divisible numbers Ti and Ti respectively. So mnii ∈ G[I (ni ); Ti : Ti ] ⊆ G Π; T (H)>2 . Thus H ⊆ >2 >2 is closed. Therefore, H ⊆ G[Π; T (H) >2], and by Lemma 2 the set G Π; T (H) G Π; T (H) . Consider fraction mn of G Π; T (H)>2 . Denote by Hi , i = 1, 2, . . ., now an arbitrary the subset
m1 mi n1 , . . . , ni
of the set H. By Lemma 1 there exists some j such that
T (H) = T (H j ) = T (H j+1 ) = . . ., and so T (H)>2 = T (H j )>2 = T (H j+1 )>2 = . . .. Since I (n) ⊆ Π, for any p of I (n) we can find some j p such that p ∈ I n j p . Let j∗ j∗ = max j, max p∈I (n) j p and Π∗ = i=1 I (ni ). By Theorem 1 we have H j∗ = ∗ G Π ; T (H j∗ )>2 = G Π∗ ; T (H)>2 . Since I (n) ⊆ Π∗ , so mn ∈ G Π∗ ; T (H)>2 . Hence m >2 ⊆ H. ∗ n ∈ H j ⊆ H. Thus G Π; T (H) Since the set Q(0; 1) is countable, Theorems 1 and 2 imply that any closed set of numbers of Q(0; 1) is an element of the set G. So, taking into account Lemma 2, we obtain Corollary 1. G is the set of all closed subsets of the set Q(0; 1). Corollary 1, together with Proposition 8, gives us a complete description for the lattice of all closed classes of numbers of the set Q(0; 1).
5 Finitely Generated Closed Classes in Q(0; 1) Among closed classes of numbers, the classes generated by finite sets of numbers are most important in practice. So the problem of a description of all finitely generated closed subsets of the set Q(0; 1) is of great interest. We denote by Gfin the subset of G which consists of all sets G[Π; T ] such that Π is finite. It follows from Theorem 1 that all finitely generated closed subsets of the set Q(0; 1) are elements of the set Gfin . We can show that the converse is also valid. Theorem 3. Any set contained in Gfin is finitely generated. Proof. Let G[Π; T ] be an arbitrary set of Gfin . According to Proposition 6, without loss of generality we can suppose that T = T >2 . Since G[Π; T ]is a countable set of rational
numbers, we can represent G[Π; T ] as an infinite sequence mn11 , mn22 , . . . , mnii , . . . of frac tions in their lowest terms. Denote this sequence by H. Since, obviously, ∞ i=1 I (ni ) =
166
Roman Kolpakov
Π, by Theorem 2 we have H = G Π; T (H)>2 . By Lemma 2 the set G[Π; T ] is closed, so H = G[Π; T ]. Thus G[Π; T ] = G Π; T (H)>2 . This equality impliesT = T (H)>2
in virtue of item 2 of Proposition 8. Denote by Hi , i = 1, 2, . . ., the set mn11 , . . . , mnii . According to Lemma 1, there exists some j such that T (H) = T (H j ) = T (H j+1 ) = . . ., and so T = T (H)>2 = T (H j )>2 = T (H j+1 )>2= . . .. Since Π = ∞ i=1 I (ni ) and Π is s finite, for some great enough s we have Π = i=1 I (ni ). Let k = max( j, s). Note that T = T (Hk )>2 and ki=1 I (ni ) = Π, so by Theorem 1 we obtain Hk = G[Π; T ]. Corollary 2. Gfin is the set of all finitely generated closed subsets of the set Q(0; 1).
6 Conclusions Considered in this paper binary probabilistic distributions are a particular case of finite probabilistic distributions. So the generalization of our results to the case of arbitrary (including non-binary) finite probabilistic distributions with rational values of probabilities is a natural direction of further investigations. It would allow us to make universal transformers of finite probabilistic distributions. The study of closed classes of numbers in sets more wide than Q(0; 1), in particular, classes of algebraical numbers of the interval (0; 1), is another interesting field of research.
References 1. P. Elias. The efficient construction of unbiased random sequence. The Annals of Math. Statistics, 43(3):865-870, 1972. 2. Th. Hailperin. Boole’s logic and probability: a critical exposition from the standpoint of contemporary algebra, logic and probability theory. Studies in logic and the foundations of mathematics, v. 85. Amsterdam Oxford North-Holland Publishing Co. 1976. 3. English Translation: R. Kolpakov. Criterion of generativeness of sets of rational probabilities by a class of Boolean functions. Discrete Applied Mathematics (to appear). 4. R. Kolpakov. On transformations of Boolean random variables with rational-valued distributions. Preprint 7, Liapunov French-Russian Institute of Applied Mathematics and Informatics, Moscow, 2001. available at http://liapunov.inria.msu.ru/publications.html. 5. J. von Neumann. Various technique used in connection with random digits. Monte Carlo Method, Applied Mathematics Series, N. 12, pp. 36-38, 1951. 6. N. Nurmeev. On Boolean functions with variables having random values. Proceedings of VIII USSR Conference “Problems of Theoretical Cybernetics”, 2:59–60, Gorky, 1988. 7. F. Salimov. To the question of modelling Boolean random variables by functions of logic algebra. Verojatnostnye metody i kibernetika, 15:68–89, Kazan: Kazan State University, 1979. 8. F. Salimov. On one family of distribution algebras. Izvestija vuzov. Ser. Matematika, 7:64–72, 1988. 9. R. Skhirtladze. On the synthesis of p-circuits of contacts with random discrete states. Soobschenija AN GrSSR, 26(2):181–186, 1961. 10. R. Skhirtladze. On a method of constructing Boolean random variable with a given probability distribution. Diskretnyj analiz, 7:71–80, Novosibirsk, 1966. 11. A. Srinivasan and D. Zuckerman. Computing with very weak random sources. SIAM J. on Computing, 28(4):1433–1459, 1999.
Author Index
Andersson, Erika
98
Mitsumori, Yasuyoshi
Baalham, Paul 138 Barbay, J´er´emy 26 Barnett, Stephen M. 98 Bartholomew-Biggs, Mike C. Chashkin, Alexander Devanur, Nikhil
146
108
Gottschling, Peter 83 Gutjahr, Walter J. 10 Hasegawa, Atsushi
98
Kirschner, Roland 1 Kolpakov, Roman 157 Koryakin, Roman A. 117 Kuo, Winston Patrick 39 Lipton, Richard J. 108 Luebke, Karsten 61
Naumann, Uwe
98
83
Ohno-Machado, Lucila 125
Parkhurst, Steven C. Raphael, Benny
39
125
71
Sasaki, Masahide 98 Sawai, Takeshi 50 Smith, Ian F.C. 71 Steuernagel, Ole 138 Takahashi, Hayato 50 Takeoka, Masahiro 98 Vaccaro, John A. 98 Vishnoi, Nisheeth 108 Watanabe, Osamu 50 Weihs, Claus 61 Wilson, Simon P. 125