SHORTEST CONNECTIVITY
COMBINATORIAL OPTIMIZATION VOLUME 17 Through monographs and contributed works the objective of the series is to publish state of the art expository research covering all topics in the field of combinatorid optimization. In addition, the series will include books, which are suitable for graduate level courses in computer science, engineering,business, applied mathematics, and operations research. Combinatorial (or discrete) optimization problems arise in various applications, including communications network design, VLSI design, machine vision, airline crew scheduling, corporate planning, computer-aided design and manufacturing, database query design, cellular telephone frequency assignment, constraint directed reasoning, and computational biology. The topics of the books will cover complexity analysis and algorithm design (parallel and serial), computational experiments and application in science and engineering.
Series Editors Ding-Zhu Du, University of Minnesota Panos M . Pardalos, University of Florida
Advisory Editorial Board Alfonso Ferreira, CNRS-LIP ENS London Jun Gu, University of Calgary David S. Johnson, AT&T Research James B. Orlin, MI.T. Christos H . Papadimitriou, University of California at Berkeley Fred S. Roberts, Rutgers University Paul Spirakis, Computer Tech Institute (CTI)
SHORTEST CONNECTIVITY An Introduction with Applications in Phylogeny
DIETMAR CIESLIK Ernst-Moritz-Arndt University, Greifswald, Germany Massey University, Palmerston North, New Zealand
Q - Springer I
Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 0-387-23538-8
e-ISBN 0-387-23539-6
Printed on acid-free paper.
O 2005 Springer Science+Business Media, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, Inc., 233 Spring Street, New York NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now know or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if the are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springeronline. com
SPIN 11336228
CONTENTS
PREFACE 1
T W O CLASSICAL OPTIMIZATION PROBLEMS 1.1 The Fermat-Torricelli point 1.2 Minimum Spanning Trees
2
GAUSS' QUESTION 2.1 2.2 2.3 2.4 2.5
3
WHAT DOES SOLUTION MEAN? 3.1 3.2 3.3 3.4 3.5
4
,4 metaphysical approach Does a solution exist? Does an algorithm esist? Does an efficient algorithm exist? Does an approximation exist?
NETWORK DESIGN PROBLEMS 4.1 4.2
5
Gauss' question and their coii.i.ersion to Steiner's Problem Examples and Esercises References A first analysis of Steiner's Problem Steiner's Problem in graphs
,4n overview of applications Several variants
A N E W CHALLENGE: T H E PHYLOGENY 5.1
Phylogenetic Trees
vii
5.2 3.3
Phylogenetic Spaces Applications and related questions
AN ANALYSIS OF STEINER'S PROBLEM IN PHYLOGENETIC SPACES 6.1 Difficulties 6.2 More about trees 6.3 Cluster Analysis 6.4 Spanning Trees metric spaces 6.5 Counting the elements in cli~cret~e 6.6 Fermat's Problem in several discrete metric spaces
TREE BUILDING ALGORITHMS 7.1 7.2 7.3 7.4 7.5 7.6
Tree building methods - an overview Maximum Parsimony Method The perfect phylogeny problem Pair Group Methods Steinerization Handling more than one tree
REFERENCES INDEX
PREFACE
The problem of "Shortest Connectivity" has a long and convoluted history. Usually, the problem is linown as Steiner's Problem and it can be described more precisely in the following way: Given a finite set of points in a metric space, search for a network that connects these points with the shortest possible length. This shortest network must be a tree and is called a Steiner Minimal Tree (SNIT). It may contain vertices different from the points which are to be connected. Such points are called Steiner points. Steiner's Problem seems disarmingly simple. but it is rich with possibilities and difficulties, even in the simplest case. the Euclidean plane. This is one of the reasons that an enormous volume of literature has been published, starting in the seventeenth century and continuing today. Over the years Steiner's Problem has taken on an increasingly important role. More and more real-life problems are given which use Steiner's Problem or one of its relatives as an application, as a subproblem or as a model. 1717, will discuss the problem of "Shortest Connectivity" as a general approach to investigate real structures in nature. We will see that this involves the ident,ification of a combinatorial structure that requires the smallest number of changes. It is often said that this principle abides by Ocliham's razor, according to which the best hypothesis is the one recluiring the smallest number of assumptions.' At first we mill give a,n overview of Steiner's Problem and its relatives as one of the most interesting optimization problems in the intersection of colnbinatorics and geometry. In this sense, the present book is an introduction to the theory of "Shortest Connectivity". We mill see that Steiner's Problem is the core of the so-called "Geometric Network Design Problems", where the general problem can be stated as follo~vs: given a configuration of vertices and/or edges, find a network which contains these objects, satisfies sonie predetermined relRoughly speaking: Do not increase the number of entities without unnecessarity.
SHORTESTCONNECTIVITY
viii
quirernents, and which minimizes a given objective function that depends on several distance measures. Secondly, we will discuss a new challenge, namely t o create trees which reflect the phylogeny, which is the evolutionary history of "living entities". For 3.5 billion years, since life on earth began, evolution has created a remarkable variety of organisms. Millions of different species are alive today, while countless have become extinct. To describe the evolution of these species is a fundamental problem t h a t has been of interest a t least since Charles Darwin first proposed the theory of evolution more exactly. Trees are widely used to represent evolutionary relationships. In biology, for example, the dominant view of t,he evolution of life is that all existing organisms are derived from some common ancestor and that a new species arises by a splitting of one population into two or more populations that not do not crossbreed, rather than by a mixing of two populations into one. The principle of Maximum Parsimony involves the identification of a combinatorial structure t h a t requires the sma,llest number of evolutionary changes. Note t h a t here, minimizing the number of assumptions does not mean minimizing the steps of a n evolution" it m e m s t h a t among all possible structures we seek one which satisfies only one, and moreover a natural, condition. We mill consider the problem of reconstruction of phylogenetic trees in our sense of shortest connectivity. To do this we introduce the so-called Phylogenetic spaces. These are metric spaces whose points are arbitrary words generated by characters from some alphabet, and the metric measuring "similarity" of the words is generated by a cost measure on the characters. The "central dogma" will be: A phylogenetic tree is a n SMT in a desired chosen phylogenetic space. In any case this topic contains many problems for further research. The aim in this graduate-level text is to outline the key mathematical concepts that underpin the important questions in applied mathematics. These concepts involve discrete mathematics (particularly graph theory), optimization; computer science, and several ideas in biology. Acknowledgements. I thank all people who supported my research and gave me helpful advice on how t o write this book: A. Dress (Bielefeld), W.M. Fitch (Irvine), P.Gardner (Palmerston North), R. Graham (La Jolla), M.D. Hendy (Palmerston North), K.Huber (Uppsala), A.v.Haeseler (Jiilich/Diisseldorf); A.O. Ivanov (I\/loscow), h hat ever
that means!
Prefuce
ix
4 . Kemnitz (Braunschweig), V.il'1oulton (Cppsala), P. Pardalos (Gainesville), D.Penny (Palmerston North), H.J.Prome1 (Berlin), J.iVlacGregor Smith (Amherst); M.Stee1 (Christchurch), A.A. Tuzhilin (hfoscow), D.M. Warme (Xlexandria) and J . Wills (Siegen). I thank Tim White (Palmerston North) and my student K.Kruse for proof reading of the manuscript. Heidrun G. Kohler (Greifswald) gave a lot of remarks regarding writing t,his book in a suitable style. Moreover, I thank my colleagues H.-R. Frieling, W. Girbardt and for helpful technical support.
I&'.
Passauer
I thank the Institute of Fundamental Sciences, Massey University, Kew Zealand; the von Neuma,nn Institut for Computing, Forschungszentrum Jiilich, Germany; and the Allan Wilson Cent,re for Molecular Evolution and Ecology, Massey University, New Zealand for hosting me during the winter 2001/02, the spring of 2002, a,nd the spring of 2003, respectively.
T W O CLASSICAL OPTIMIZATION PROBLEMS
Scientific or engineering applications usually require the solution of mathematical ~ p t i m i z a t ~ i oproblems. n Such applicatioiis span a wide range, from modelling the evolution of species i11 biology t o niodelling soap films for grids of wires; from the design of collections of data t o the design of heating or air-coiiclitioning systems in buildings; and from the creation of oil and gas pipelines to the creation of comrnunicatioil net~vorlis,road alid railway lines. These are all network design problems of significant importance and nontrivial complexity. T h e network topology and design characteristics of these systems are classical examples of optimization problems. T h e general networli design problem is this: for a given configuration of vertices and/or edges, find a networli ~vhiclicontains these objects, fulfills some predetermined requirements and minimizes a given objective function. This is quite general and rnoclels a wide variety of probleins. Two classical optimization problerns represent the parsimonious view of the world: The Fermat-Steiner-Weber-Proble~nand the probleni of minimum spanning trees.
1.1 THE FERMAT-TORRICELLI POINT The problem discussed here has a long and strange history; moreover, it has gone by many names. Players from a lot of fields of study have stepped on its stage, and some of them have stumbled. I t is usual t o credit the Italian mathematicians with proposing and solving the problem: The problem was posed by Ferniat early in the 17th century a t
the end of his book Treatise on Mznzma and Maxima [159], and was stated as follo~i%: Given three points in the plane, find a fourth point such that the sum of its distances to the three given points is minimal. The problem seems disarmingly simple, but is so rich in possibilities and traps that it has generated an enormous literature dating back to the seventeenth century, and contiiiues to do so. We will come across these more than once in our considerations. Around 1640 Torricelli solved this problem: He asserted that, assuming that the given points forrn a triangle in which all angles are less than 120°, the circles which circumscribe the equilateral triangles constructed on the sides of and outside of the given triangle intersect in the desired point, called the Torricelli point. Note, that, in general, the Torricelli point is not one of the well-known points for triangles, it has its own character. Shortly afterwards, in 1647, Cavalieri's Exerciones Geometricae showed that the three lines joining the Torricelli point to the given points form angles of 120' with each other. Over the centuries Fermat,'~Problem mas rediscovered and genera,lized by other mathematicians. In the following centuries this problem was well established in the mathematical folklore. A history of Ferrnat's Problem is given by Boltyanski et al. [48],Scriba and Schreiber [391], and Wesolomsky [455]. In the nineteenth century Steiner studied this problem and generalized it to include an arbitrarily large set of points in the plane. About one hundred years later Courant and Robbins [I161 wrote:
" A very simple but instructive problem mas treated by Jacob Steiner, the famous representative of geoinetr) a t the Univeisity of Berlin in the early nineteenth century. Three rillages A,B,C are to be joined by a system of roads of minimum total length".' In other terms, we are interested in l ~ u itt should be noted what Kuhn, compare [455],said: "Although this very gifted geometer (Steiner) of the 19th century can be counted among the dozens of mathematicians who have written 011 the subject, he does not seem to have contributed anything new. either to its formulation or its solution."
T w o classical optimixation problems
Fermat's Problem Given: A finite set of points in the Euclidean plane (or in a Euclidean space). Find: A point such that the sun1 of the distances to all points of the set is as small as possible. This point will be called a Torricelli point. Here, the Euclidean plane is the affine plane equipped with the norm by ~ ~ ( x ,=y ) ~ ~
1 . 1 I defined (1.1)
Let N be the set of given points. To establish the existence of a Torricelli point we note firstly that the so-called Felrnat function FN.
mhich is to be minimized, is continuous, and secondly the fact that we have to search for a Torricelli point only in
which is a compact set. This implies that FAT attains a minimum value." In the twentieth century, the problem passed to those who claimed there was a use for it. Weber uses in his book ~ b e rden Standort der Industrien [449] a weighted three point version of the problem to depict industrial location minimizing transport cost. A mathematical appendix to his book, written by Pick, gives a geometrical construction procedure to find the optimum location, and discusses the conditions under mhich one of these points is the optimum. m7e will follow these considerations. Let
be a set of n points of the Euclidean plane. Then the Fermat function is given by
= "ote
c
d(z,
-
x)'
+ (y,
-
that the Torricelli point can be one of the given points.
yI2.
(1.5)
where w = ( 2 ,y). If we differentiate f and set the partial derivatives equal to zero to obtain the first order conditions for optimality, we have the following observations:
L e m m a 1.1.1 Let f (x, y) = F,v(w) be th,e Fermat f u n ~ t i o nfor the set N = {(xL,yl) : i = 1 , .. . , n ) of points i n th,e Euclidearh plane. Th,en the following conditions are necessary and s u f i c i e n t "or the minimal.ity o f f outside the set N itself:
and
v) + (yi
-
~ ( ( z z;
where (x,y)
# (xi,y i ) for a71y i
=
=0 -
u)?)
I,. . . , n.
Let q be the Torricelli point for v l , . . . , v,,, and assume that q is not one of the given points. Defining the vectors ul,. . . , u,, by
i = 1.. . . ,n,then the equations (1.6) and (1.7) can be written as the vector equality 1%
(1.9) n-hereby o is the zero vector. The inner product square of this equation is
("""") is the cosine of the angle between the segments from the points v , I u,II and .cj to the Torricelli point q, respectively. Hence.
3 ~ o t that e F,v is a convex function!
-
Two classical optimization problems
5
On the other hand, from the vector equation (1.9) we find by inner product luj the equation multiplication with uj/l
1
+
for any j = 1,. . . , n,.Then (1.11) and (1.12) form a system of n 1 equations for ( n 2 - n ) / 2 unknown variables, which can be solved uniquely for 7z = 3 and with one free parameter for n = 4. In geometric terms this says the following: The segments from three given points to the Torricelli point make angles of 120° with each other, provided that the given points form a triangle in which each angle is less than 120". For four given points the sum of neighboring angles for the segments from the Torricelli point equals 180'; provided that the given points form a convex quadrilateral. For n > 4 the equations (1.6) and (1.7) cannot, in general, be solved explicitly for (z, y ) , see Bajaj [26]. These facts are helpful in deciding whether or not a Torricelli point can be constructed with compass and ruler. (i) The compass construction: Given two points are can use the compass to draw a circle, centered a t one of them and passing thiough the other. (ii) For any two different points. the ruler can be used to join them x i t h a line segment, which can be extended as far as we like. Then, me have
T h e o r e m 1.1.2 Let 1L' be a finite set of n point.$ i n the Euclidean plane (a) If n,= 2 t h e n a n y point in the segment created b y the t w o given points is
a Torricelli point for N . ( b ) (Torricelli 1646. Cavalieri 1647) Let n = 3. If the convex hull o f ATf o r m s a triangle i n which each angle i s less t h u n 120°, t1ze.n the Torricelli point for N = { ~ ~ , v 2 ,can ' ~ be ~ found ) with the following construction:
1. Find a n equilateral triangle dra;wn along o n e side, for instance with the third node v'; 2. Construct the circle C circumscribing the equilateral triangle;
vlv2,
3. T l ~ eTorricelli point i s the point ,wh,ere the segment circle C .
& intersects the
Otherwise, ,if one of the angles i s at least 120'; one of the given points is the Torricelli point, n a m e l y thx point i n which this ungle i s present. ( c ) (Faynano j157) Let n = 4 .
4 t h e n n o general construction of the Torricelli point ~ u i t hcompass and ruler exists.
Figure 1.1 Cavalieri's coristruction
1.1.2(d) says that Fermat's Problem in the Euclidean plane turns out to be highly intractable: It cannot exactly be solved under models of computation with the four basic arithmetic operations and taking square roots. Thisl leaves
TWOclassical optimixation problems
us only numerical or symbolic approximation methods. It was Weiszfeld in 1937 who provided a practical method for finding the Torricelli point for large number of given points. This method is an iteration procedure. In view of 1.1.1it is clear that the following is true for a finite set iVwhich contains a t least three points and is not colliiiear4: (a) Palermo [332]: The Torricelli point is uniquely determined.
(b) Kupitz et al. [271], [320]: If the point q is outside of N then the condition
is sufficient and necessary for v to be the Torricelli point. (c) Kupitz et al. [271]. [320]: If the Torricelli point q is in ATthen the condition
holds true. In effect, the follo~vingalgorithm attempts to solve the first order condition written in 1.1.1iteratively. Weiszfeld asserted that such a sequence converges to the Torricelli point. This assertion has been discussed in Illgen [233],Krarup, Pruzan [268] and Kuhn [270]. Algorithm 1.1.3 (Wei.szfeld) Let N be a finite set i n the Euclidean plane. T h e n the following procedure finds a Torricelli point for N iteratively:
( a ) If for a point q E N it holds that
t h e n q is a n exact solution for Fermat's Problem; (b)
Otherwise I.
Choose a n error estimate
E;
%et N = { v l ; ..., v,} be a collinear set of points appearing in this order on the line. If n is an odd number then u ( , , + , ) / ~ is the Torricelli point. If n is an even number then any point on the segment 1J,,/2Vn/2+1 is a Torricelli point.
2. Choose q ( 0 ) In conuN, 3. Fork = 0 , 1 , . . . do
Weiszfeld's algorithm is simple. However, its rates of convergence are not very attractive, since tlie convergence is slow in tlie vicinity of the given points." Xue, Wang [467] discuss this observation. A further disadvantage of the Weiszfeld procedure is that it fails if one of the iterated points q(" falls on a given point; the reason for this is t h a t the Fermat function FN is non-differentiable there. This problem can be avoided by replacing FN with a hyperbolic approximation. An example is tlie following: Define the distance function in t,he Fermat function by
where
17
is a very small real number.
In view of many contributions t o the Fermat problem, its popularity through the ages, and its natural applications to various practical questions, it is hopeless t o expect a complete list of the many facets of the problem. RIoreover; location analysis as the theory of the "generalized" Fermat problem, has attracted the attention of researchers from many academic disciplines including many applied fields. This tremendous interest in location modelling is the result of several factors. In the introduction to the first issue of the journal Location Science the editors wrote: First, location decisions are frequently made a t all levels of human organization from individuals and households t o firms, governments, and international agancies. Second, such decisions are often strategic in nature; that is, they involve significant capital resources and their econoniic effects are long term in nature. Third, they frequently impose economic e~t~ernalities.Such externalities include economic development, as well as pollution and congestion. Fourth, location models are often extremely difficult to solve, a t least optimally. Even " ~ r e z r i e ret.al. [I351 give an example which gives the algorithm a very hard time.
T w o classical optimixation problems
9
some of the most basic models are cornputationally intractable for all but the smallest problem instances. In fact, the computational complexity of location models is a major reason t,ha,tthe widespread interest in formulating and implementing such models did not occur until the advent of high speed digital computers. Finally, location models are application specific. Their structural forrn, "the objectives, constraints and variables", is determined by the particular location problem under study. Consequently, there does not esist a general location model that is appropriate for all, or even most, applications. It is well-known t h a t solutions of Fermat's problems depend essentially on the way in which the distances in space are determined. Surveys in the forrn of monographs are given by 1. W.Domschke, *A.Drexl: "Logistilt: Standorte", 1982, [128]. 2. R.F.Love, J.G.Morris, G.O.Wesolowsky: "Facilities Location", 1989, [292]. 3. H. W.Hamacher: " Mathematische Losungsverfahren fiir planare Standortprobleme", 1995, 12061. 4. D.Cieslik: " Steiner Minimal Trees", 1998. [92] 5. V.Boltjanslti, H.1Iartini. V.Soltan: "Geometric Metliods and Optimization Problems", 1999, [48].
6. A.Schobe1: "Locating Lines and Hyperplanes", 1999, [384] There are several collections of works on Fermat's Problem and its relatives: [33], [74], [78], [%I, [134], [149], [151], [224], [234]: [250]: [272], [2851, [345], [422], [455] and 14671. Let AT be t,he set of given points. In applied mathematics the Fermat function
is usually called the median function, and a Torricelli point is called a median of N. Also of practical interest is the so-called center function
Ghr (w)= max Ilv utN
- IU
11,
(1.17)
which is to be minimized. A solution point is called a center of N . Of course, this is a complete other question, and has other solution strategies. For us it mill be only necessary to collect several observations.
Observation 1.1.4 L e t AT be a finite set of given points. L e t FN a n d G N be t h e m e d i a n a n d center fu,n,ction for AT, respectively. T h e n
holds for each, point w. T h e search of a center can be described in the sense of covering: We consider balls in the plane defined by B,(w) = {n: : I z - 'IUI where
T
2 0 is a
(1.19)
real number and u: a point in the plane. The boundary
(r) bdB,(w) = {z:j(z-w ( =
(1.20)
is called a circle (with radius r around the center w).Then consider the following
Problem of minimal covering Given: A finite set A' of points in the plane. Find: A ball B,(w) which covers 3' with minimal radius r . T h e circle is uniquely determined and called the minimum covering circle (MCC) of AT. This simple question is documented t o have been raised already in the middle of the 19th century by Sylvester [420].'
Observation 1.1.5 L e t N be a finite set of poin,ts i n the plane. center w of th,e MCC i s t h e optimal site, a n d satisfies
T h e n the
GN(w) = min! The following properties for the Problem of minimal covering are fundamental, compare [206] or [346]: G ~ uitt is probably older
T w o cla,ssical optimization problems
(i) At least two given points lie on the MCC. (ii) If there are only two given points on the bICC bdB,,(ul), these form a diameter = 2r. (iii) If three or more given points on the hICC (which is then fully determined), three among them form a n acute triangle. (iv) *4ny circle satisfying one of the previous t ~ properties o and which ball covers all given points is the MCC. Csing these facts it is easy t o find the 3 C C for any set of given points. There are several collections of works on the center problem and its relatives: [33], [128], [134], [206], [272], [346] and [384].
1.2
MINIMUM SPANNING TREES
The minimum spanning tree problem is one of the most typical and well-known problems of combinatorial optimization; methods for its solution have generated important ideas of modern combinatorics and have played a central role in the design of computer algorithms. The problem is usually stated as follows: Given a weighted (connected) graph one would then wish t o select for construction a set of comniunication links t h a t mould connect all the vertices and have minimal total cost.7 At first, we have t o introduce several kriomleclge of graphs and networks. Graphs are among the most basic of all ~ n a t h e ~ n a t i cstructures. al Correspondingly, they have many different versions, representations and incarnations. Intuitively speaking, a network is a set of points and a set of connections where each connection joins one point t o another and has a certain length. We will describe the combinatorial structure of such a network as a graph G which is defined to be a pair (17, E) where (a) V is any finite set of elements, called vertices, and 7Weighted means edge-weighted and the total cost is the sum of the cost of the edges in such a network.
(b) E is a finite family of elements u-liicli are unordered pails of vertices, called edges. means that the edge e joins the vertices u and v . In this The notation e = case; we say that the vertices u and v are incident to this edge and that u and v are the endvertices of e. Two vertices 21 and G are called acljacent in the graph G if is an edge of G.8
N ( v ) = N G ( v ) denotes the set of all vertices adjacent to the vertex v and is called the set of all neighbors of v in G. For a vertex v of a graph C the degree gc(v) is defined as the number of edges x~hichare incident to v . If G has no parallel edges then the cardinality of hr(v) = N G ( v ) is the degree of the vertex:
If we sum up all the vertcs degrees in a graph, m-e count each edge exactly twice. once from each of its enclvertices. Thus.
Observation 1.2.1 I n any graph G = (V,E ) the equulity
holds. In particul~w~ i n every graph the num,ber of vertices with odd degree is even.
A graph is called regular, or more exactly regular of degree g if each vertex has degree exactly g. In view of 1.2.1 we find Observation 1 . 2 . 2 Let G = (1; E ) be a graph which is regular of degree g. Then lE1 = -9 . 15'1. (1.23) 2
A graph which is regular of degree 1 is called a perfect matching, of degree 2 is a collection of so-called cycles, and of degree 3 is called a cubic graph. 8 ~ any n case,
TW
assume that u # v , i.e. we do not admit loops
Two classical optimization problems
13
A graph G is said to be a complete graph if every two vertices are adjacent. A complete graph with n vertices has exactly
edges, and is regular of degree n
-
1.
Let G = (If; E ) be a graph. Then G' = (V', E')is called a subgraph of G if 1" is a subset of I/ and E' is a subset of E such that any edge in E' joins vertices from V'. In other words, Ifr' Ir (1.25)
c
and
Let IY
C I'
be a set of vertices, then
is called the induced subgraph of I17 in G = (k;E ) , i.e. all edges of G that connect vertices of TV are also edges of G[W]. The union of two graphs G = (TI, E)and G' = (I",
E')is defined by
A chain is a sequence vl,e l , v2, e2, us, ....v ,,,, e,,, U , , ~ + I of edges and vertices of G such that the edge ei is incident to the vertices vi and vi+l for any index i = 1,...,m. ,4 chain in which each vertex appears a t most once is called a path: more exactly, the path interconnecting the vertices ul and v,,+l. Then the number m denotes the length of the pa,th. A single vertex is a path of length 0.
A cycle is a chain with a t least one edge and with the follomiiig properties: No edge appears twice in the sequence and the two endvertices of the chain are the same. A graph which does not contain a cycle is called acyclic. A key notion in graph theory is that of a connected graph. It is intuitively clear what this should mean, and it is also easy to formulate this property: A graph G = (V, E) is called a connected graph if for any two vertices there is E)be a a path (or, equivalently, a chain) interconnecting them. Let G = (V, graph and let v and v' be two ~ e r t i c e sof G. Clearly,
Observation 1.2.3 T h e relation " T h e r e is a path i n G connecting is a n equivalence relation o 7 ~1' x If.
?I
an8d v'"
T h e equivalence classes of this relation divide 11' into subsets, which create connected subgra.phs of G . These classes are called the connected components, or briefly the components of the graph G. -4 component is a maximal subgraph t h a t is connected. A connected graph has exact,ly one component. An edge e of a graph G = (If, E) is called a bridge if G' = (1'. E \ {e)) contains a co~nponentmore t,han G . Let G = (17, E) and G' = (I", El) be two graphs which are connected and have a t least one vertex in common: I' n V' # 0. Then G U GI is connected. In a natural sense. graph theory is the study of connectivity, and. indeed. this tern1 will play a n essential role in our investigations. In all cases we look for graphs which interconnect the points of a space: T h e problem of a minimum spanning tree is the theory of "shortest coiinectivity". The length minimality forces the graph to be ~vithoutcycles. Thus we are interested in this kind of graph.
A tree is defined to be a connected graph without cycles.\4 forest is defined as a graph ~vlioseconnected components are trees. That means a forest is a acyclic graph, or equivalently, a graph in which each edge is a bridge. A vertex with degree one is called a leaf. It is easy to see t h a t each tree with more than one vertex has a t least two leaves. A vertex in a tree t h a t is not a leaf is called a n internal vertex.
Observation 1.2.4 Let G = (I/; E) be a graph with n vertices; where n > 1.' T h e n the followin,g properties are pairwise equivalent (and each characterizes a tree): (a) G i s connected and has n80 cycles. ( b ) G i s connected and contains exactly n - 1 edges. (c) G has exuctly n - 1 edges and h,as n o cycles. g ~ r e e swere first used in 1847 by Kirchlioff in his work on electrical networks, they were later redeveloped and named bj. Cayley in order to enumerate diirerent isomers of specific chemical molecules. ' O B ~definition a graph w i ~ hone vertex and without edges is also a tree.
T w o classical optimixation problems
15
( d ) G i s naaximally acyclic; t h a t mea2nsG has n o cycles, a n d if a n e w edge i s added t o G , exactly o n e cycle i s created. ( e ) G i s m i n i m a l l y connected; t h a t m e a n s G i s connected, a n d if a n y edge i s removed, t h e r e m a i n i n g graph i s n o t connected.
( f ) E a c h pair of vertices of G i s connected by exactly o n e patlz. The proof is intuitively clear and given in any textbook of graph theory. Usually, we will denote a tree n i t h the letter T. The unique path connecting the vertices v and v' in the tree T is denoted by T ( v ,..., v'). A connected subgraph of a tree is called a subtree. -4 path is either a tree with exa,ctly two leaves or it is a single vertex. A star is a tree with exact,ly one vertex tha,t is not a leaf. A tree is called binary if every internal vertex has degree three. As a consequence of our considerations, we consider a tree T = (V, E) n i t h n vertices. Let ni be the number of vertices of degree i and A = A(T)is the maximum degree in the tree T. Then; of course,
nl
+ n2 + . . . + ??.A= n.
In view of 1.2.1 and 1.2.4, we have
When we subtract this equation from two times (1.29)we get Observation 1.2.5 It holds t h a t
for a n y tree, where 72i den,otes t h e n,umber of vertices of degree i a n d A ( T ) i s t h e m a z i m u m degree in t h e tree. Consequently. considering only trees without vertices of degree two, a binary tree has the maximum possible number of internal vertices for a given number of leaves. A further consequence of 1.2.4 Observation 1.2.6 L e t G be a forest w i t h n ,uertices a n d c com,ponents. Th,en G c o n t a i n s exactly n - c edges.
Let G = (I< E) he a graph. A subgraph G' = (V. El) is called a spanning tree of G if G' is a tree. If G" is a spanning tree of G, then G itself must be connected. Conversely. if G = (I7 E) is a coilnectecl graph. then G contains a subgraph G' = (I/, EE')that is minimal with respect to the property that GI is connected. The graph GI is a spanning trce of G. Hence,
Observation 1.2.7 A graph is connected if and onky if i t con,tain,s a spanning tree. An immediately consequence is that a connected graph nit21 n vertices contains a t least n - 1 edges. Additionally, we assume that a function f : E + IR is given for the edges of the graph G. Usually, me assume that f has only positive values and we call it a length function. Then me define the length of a subgraph GI = (1); El) of the graph G as L(G1) := f(r). (1.32) egE'
A (connected) graph equipped with a length function is called a network. If no length-function is given explicitly, we will assume that f r 1; i.e. the length of a subgrapli is the number of its edges. We are interested in
The Minimum spanning tree problem Given: A network G = (If, E . f ) . Find: A spanning tree T = (1; El), E' C E. r h i c h minimizes the length
W). A solution is called a minimum spanning tree for the network G. This seems to be the first netnork optiinizatioii problem ever studied. Its history dates a t least to 1926. Boruvka produced the first fully realized minimum spanning tree algorithm, and it has been rediscovered several times.
Algorithm 1.2.8 (Boruwka [53], S o l l ~ n ,in [39]) Given a connected graph G = (17,E) with a length function f : E + IR, a m i n i m u m spanning tree T for G can be found b y the following procedure:
Two classical optimization problems
1. Start with a forest
G(0)= (I.: 0))
k := 0: 2. Wlzile there is more than one tree i n G(k)do (i) For each tree Ti in G(k)do (ii) Select an edge :v' with m,inirnum length such that u is a vertex in Ti and v' is a vertex in o,nother tree T j ; Form the forest G(k + 1)by jo,inzng all T,and T j of G(k); (222) k : = k + l . T h e second and most recently discovered of the classical algorithms is that of Krusltal. created in 1956:''
Algorithm 1.2.9 ( K r ~ ~ s k[269]) al Given a connected graph G = (V, E) with a length-function f : E + lR,a minim,urn spanning tree T for G can be found by the following procedure: 1. Start with the forest T = (I/,@); 2. Sequentially choose the shortest edge that does not form a cycle with edges already chosen,; ,?. Stop when all ,uertices are conmected, that is when T i
-
1 edges lmve been
chosen. Or in a "dual" version
Algorithm 1.2.10 Given a connected graph G = (V,E)with a len,gth function f : E + lR, a minimum spanning tree T for G can be found by the following procedure: 1. Start with the graph
G = ('i/,E);
2. Sequentially delete the longest edge that does not disconnect the remaining graph; "Later we will discuss a third approach, namely Prim's algorithm
3. S t o p w h e n the graph does n o t contain a cycle; that i s w h e n / V / - 1 edges remain. Both algorithms 1.2.9 and 1.2.10 are greedy, meaning t h a t a t each step of the process a n optimal choice is made from the remaining available data. Once again, if what appears to be the best choice locally turns out t o be the best choice globally, then the greedy algorithm will lead t o a n optimal solution. A nice description of the difference between t,he two techniques is given by LovAsz et.al. [289], [290]: There is this story about the pessimist and the optimist: They each get a box of assorted candies. T h e optimist always picks the best; the pessimist eats the worst (to save the better candies for late). So the optimist always eats the best alailable candy. and the pessimist always eats the worst available candy: and yet, they end up with eating same candies In view of the many contributions to the problem of constructing minimum spa,nning trees, its popularity through the ages, and its natural applications t o various practical questiolis, it is hopeless to expect a complete list of t,he many facets of the problem. Moreover, interconnecting networks as the theory of the "generalized" minimum spanning tree problem, has attracted the attention of researchers from many academics disciplines including many applied fields. For further reading see [80]; [126] and [246]. A history of the minimum spanning tree problem was written by Graham and Hell [189]. Minimum spanning trees prove important for several reasons: They can be computed easily (see below) and they create a sparse interconnecting graph that reflects a lot about the set of the given points. They have obvious applications in the design of computer, communication and transportation networks, \\iring connections, and piping in a flow network. Besides numerous network design applications, they play a n important role in new areas of research, such as sequence alignments and construction of evolutionary trees. They provide a way to identify classes in sets of points: Deleting the long edges from a minimum spanning tree leaves components that define natural classes in the given set.
T w o classical optim,ixationproblems
19
They can be used to give approximate solutions t o computationally hard problems, such as Steiner's Problem and the Traveling Salesman Problem. They often occur as subproblems in solutions of other problems. Moreover, on the theoretical side. the so-called greedy method, typical of all the minimum spanning tree construction algorithms, is an important concept that can be applied to various other problenis and is studied in its general form in the theory of matroids, compare [246] or [266]. Related t o this problem we are interested in a "geometric" version of the miriimum spanning tree (MST) problem:
The Problem of MST Given: A finite set N of points in a metric space ( X , p). Find: A tree T = (AT, E) with shortest possible length. If we look for a n MST for A- we search for a minimum spanning tree in the complete graph consisting of the vertices from !\;and all edges connecting each pair of different vertices, i.e. for G = (AT, ( I ; ) ) . The length-function is defined by (1.33) f (vl;')= p(u. v ' ) , for v.v' E N Note t h a t using 1.2.9. we have a method to find a n LIST which does not use any knowledge of tlie "geometry" of the space ( X , p ) except for an "oracle" which gives the mutual distances p(u. d)for any two points v, v' E N . O n the other hand, ~vlieriwe have a geometric structure in which a measure of distances between the points is defined, me may expect that an MST can be found very fast: Instead of the 72(72 - 1)/2 input data (namely tlie lengths of the edges of the complete graph) the search for an MST requires a suitably chosen graph with n vertices (2n input data using the coordinates of the given points) and possibly a number of edges which is linearly bounded by the number of points. m indeed exist: The geometric In the Euclidean plane such a fast a l g ~ r i t ~ hdoes properties are characterized by the concept of a l'oronoi diagram. Consider a finite set N = { v l , . . . , v,) of points. We define a partition of the plane into "spheres of influence" IT1,.. . , I,);:;,,called Voronoi cells, around the given points
such that any point in such a cell is closer to the terminal point in its cell than t o the terminals of other cells, namely
for all i = 1,.. . , n. The Voronoi cells cover the plane. The graph on the terminals of iV with a n edge wherever two regions share a common side is called a Delaunay triangulation (DT) for N . It is sufficient to look for a minimum spanning tree in the D T for 1V to find an MST for N . Since a DT is a planar graph, there are a t most 3n - 6 edges. Thus, the application of Kruskal's algorithm needs provablj- less time in the worst case12; compare Preparata, Shamos [%I]. In most real-life network optimization situtations requiring a n LIST, it is not enough for the spanning tree to satisfy just the minimum-length condition. The MST must also satisfy some addit>ionalconstraints. This often makes the problem .V'P-hard, whereas the unco~lstrainedMinimum Spanning Tree Problem is efficiently solvable. Many modified (minimum) spanning tree problems are presented in the literature; some examples taken from [179], [182], [246], 12661, [307] and [465] include: Find a spanning tree interconnecting all vertices n-ith minimal maximum degree. Find a spanning tree interconnecting all vertices with a t least k leaves in the tree. Find a spanning tree isomorphic t o a given tree. Find a spanning tree with small diameter. We mill discuss the geomet,ric versions of all these problems.
' % ~ ewill discuss times of algorithms later more exactly
GAUSS' QUESTION
At first glance it seems that our two problems in the previous chapter have not many facets in common. Fermat's problem is a typical one in the class of geometric, and the problem of a minimum spanning t,ree in the class of combinatorial optimization problems. And moreover, we used very different methods to find solutions. Now, consider:
2.1
GAUSS' QUESTION A N D T H E I R CONVERSION T O STEINER'S PROBLEM
On March 19., 1836 the astronomer Schuhmacher wrote a letter t o his friend the mathematician GauB, in which he expressed surprise about a specific case of the Fernlat problem: He considered four points vl , ~ 2us, , v4 in the plane whicli forms a quadrilateral such that the segrnents vlvs and u3u4 are parallel and v1v3 and 'u'v4 ~ n e e tin one point v outside. Now the the lines of the segments Torricelli point q of these four points is the intersection point of the diagonals vlv4 and vavy. Schuhmacher did not understand the fact that, if the segment u3uq runs t o the point ,u then the point q runs in the same way to ,u, but this cannot be, since the Torricelli point of three points is not necessarily one of the given points.' GauB [I811 answered on March 2 1 . ~ that Schuhmacher did not consider the lSee our considerations in 1.1.2. '1t is remarkable that in age a letter inside Germany was delivered in one day
Fermat problem; instead, he looked for a solution of
More important was the next remark of G a d . He said that it is natural to consider the following. more general problem : 1st bei einem 4Eclt ... von dern ltiirzesten Verbindungssystem die Rede ..., bildet sich so eiiie recht interessante matheinatisclie Aufgabe, die mir nicht fremd ist, vielmehr habe ich bei Gelegenheit eiiie Eisenbahnverbindung zwischen Harburg, Bremen: Hannover, Braunschweig ...in Erwagung genommen .... In English: "Hon- can a railway network of minimal length which connects the four Gernian cities Bremen, Harburg (t,oday part of the city of Hamburg), Hannover, and Braunschweig be c r e a t e ~ l ? " ~ In such a generalized Feriilat problem we are given a cert,ain number of points in the plane which are to be connected by a system of curves of smallest total length. The concrete problem by G a d was completely discussed by Bopp [50] in 1879.~Today it is easy to see that a solution is given by a network in which Bremen, Harburg and Hannover are intercoil~iecteclby their Torricelli point and Hannover and Braunschweig are connected by a st,raight line ". Perhaps starting with tlie book What is Mathe.matzcs [116] by Courant and Robbins in 1941, G a d ' Problem became popularized under the name of
Steiner's Problem Given: A finite set of points in a plane (or in another metric space). Find: A network which corinects all points of the set with rr~inirnallength. Steiner's Problem for three points, however: is also a special case of Ferniat's Problem. If four or more points are given, then Steiner's Problem is independent from Fermat's, and asserts its own character. 3~ picture of this letter can be found on the cover of the book Approzimation Algorithms 43a]. 'Martini in [18]names an older source for this generalized problem. namely Lame and Clapeyron in 1827, but he doesn't give an exact reference. Moreover, Scriba and Schreiber [39!] give a discussion of the origins of the problem. '111 our figure this is marked with bold line style; the thin line style depicts an LIST.
'
Gal~ss'question
Figure 2.1
Gauss' problem
Now. we mainly refer to the misleading title "Steiner's Problem", which was attributed by Courant and Robbins. They referred in their book to neither Fermat for the n = 3 problem nor GauB for the general problem. An extensive discussion of reasons for this incorrect naming is contained in [271], [386] and [391].6 GItseems that Courant and Robbins knew of a report by Steiner on "Fermat's Problemn(!) to the Prussian Academy of Sciences in 1837.
Notice that t,he authors generated bot,h the mistake in priority, and the great interest of many scientists in this problem. The popularity of their book has been the main reason that the misnomer "The Steiner Problem" or "Steiner's Problem" has stuck, arid that the interest in this problem has spread. Geometric optimization problems are inherently not pure combinatorial problems since the optimal solution often belongs to a n infinite feasible set, the entire real Euclidean space. Steiner's Problem is typical for a large number of so-called "Geometric Network Design Problems" which act in a geometric structure. But often it is necessary to combine geometric and combinatorial methods to find a solution, and this is t8heapproach we taken here. Any network solving Steiner's Problem must be a tree, which is called a Steiner Minimal Tree (SMT). It may contain vert,ices different from the points which are t o be connected. Such points are called Steiner points. Given a set of points, it is a priori unclear how many Steiner points one has t o add in order to construct a n SMT. Furthermore, we are interested in the location of the Steiner points. A Steiner point is the Torricelli point of its neighbors. T h a t means: Fermat's Problem is the local version of Steiner's Problem. On the other hand, if the number and location of the Steiner points are known then an SbIT is an hlST for the union of the given and the Steiner points. Until 1961 it was not even known that Steiner's Problem is finit>elysolvable. There are infinitely many points in the plane, and even though most of them are probably irrelevant, it is not obvious that any algorithm exist. Then Melzak [305] established many basic properties of ail SMT: Without loss of generality, the following is true for any SMT T for a finite set N of points in the Euclidean plane: (i) The degree of each vertex is a t most three. (ii) The degree of each Steiner point equals three (iii) Any Steiner point is the Torricelli point of its neighbors; and two edges incident t o a Steiner point meet a t a n angle of 120'. Consequently, a Steiner point is uniquely located in relation to its neighbors. (iv) There are a t most IhT- 2 Steiner points: equality holds if and only if the T are the leaves of T and the Steiner points are of degree vertices from A three.
G a ~ ~ squestion s'
25
(v) An SMT has a t most 2lN
- 3 edges: equality holds if and only if the vertices from N are the leaves of T and the Steiner points are of degree three.
(vi) When there is a Steiner point in the tree has two given points as neighbours.
T:then least one of these points
(vii) The SMT is an MST for the set N U Q , where Q is the set of Steiner points of T. As a consequence of all these statements it is sufficient to develop solution methods only for specific kinds of trees: Let T = (V, E) be a tree for AT = { u I , . . . ,u,): 11 > 2, with
Such a tree will be called a full tree. Second, h/Ielzak gave a finite solution method to Steiner's Problem, using a set of Euclidean (that is ruler and compass) ~onst~ructions. The central idea is given in the Torricelli construction given in the chapter before: In the threepoint problern, a replacement point can be substituted for two of the given points without changing the length of the tree. In the general version of the problem the algorithm must guess which pair is to be replaced, which could potentially involve may trying all possible guesses. After one pair of points in the subset has been replaced by a single point, each subsequent step of the algorithm replaces either two given points, a given point and a replacement point or two replacement points with another replacement point until the subset is reduced to three point^.^ Once the Steiner point for those three points has been found, the algorithm works bacltwards, attempting to determine the Steiner point corresponding to each replaceinent point. A11 att,empt can fail because of contradictory constraints on the placement of Steiiier points. Now we give a complete list of the instructions of this method: 7Surprisingly, the Melzak algorithm cannot be extended to higher-dimensional Euclidean spaces, not even to spaces of dimension three. The reason is that for two given points there are an infinite number of replacement points.
Algorithm 2.1.1 (Melzak [305]) Let T = (I/:E ) he a full tree for the finite set N of points. T h e n do
2. (Reduction, stage)
Q := v1\ lV!; if Q i s e m p t y t h e n goto 4.;
3. Let q be i n Q s u ~ hthat q is adjacent t o vl and v2 i,n N1; Delete v l , v2 and q; A d d a substitution point v12 that forms a n equilateral triangle with v2: If' := v l u{ v 1 2 j \ { v 1 , v 2 , q j ; 1V' := N' u {v12} \ { v l ,v 2 } ; got0 2.;
vl
and
111
and
4. (Recovery stage) Connect the last two points o f AT' b y a n edge: 5. Reserve the order of th,e reduction steps and bring back each pair of va a t each recovery step;
6. Let C be the circle circumscribing v l , va and ~ 1 2 ; If the arc v 1 u ~of C intersects the edge inxident t o vls a t the point v', t h e n v' is the S t e i n e r point joining vl and vz; i n this case con,nect these points and discard vi2; got0 5 . The proof of correctness is to apply the construction of 1.1.2(b): Let q be a Steiner point adjacent to the given points vl and ~ 2 v1, . v2 and vlz form an equilateral triangle. Since the Steiner point q is the Torricelli point for v l . vz and v3 it makes angles of 120" with the edges to each of them. If a quadrilateral is inscribed in a circle, the sum of opposite angles equals 180'. Thus the Steiner point q is necessarily located on the clrcle circumscribing ~ 1 . 2 1 2 and via. The theorem of Ptoleniaius says
and consequently ll?b - 411
This means that
3
+ Ilv1
-
qll =
1 1 ~ 1 2-
qll.
Gauss' question
achieves a minimal value if and only if q E 7 4 2 ~ 3
I t is obvious that using iCIelzak's algorithm t o find an SMT, although effective, is extremely redundant and inefficient; more exactly it takes exponential time. There are two causes of the exponential running time: T h e main reason is the large number of trees which are to be considered. each step chooses one of two possible substitution points because there are two equilateral tria,ngles for a given side. Since the correctness of the choice cannot he seen until the tree has been constructed or demonstrated t o be impossible, backtraclting is necessary. Hence, we require O(2') time; where k is the number of Steiner points in the given tree. Hwang 12291 has described a implementation of Melzak's construction which eliminates the second cause of exponential behavior. In general. to determine a n SLIT for a given finite set of points we have to consider many different trees, and conlpare their lengths in order t o single out the shortest ones. Unfortunatel~,this needs a n astronomical number of computational steps. Although exponential-time algorithms have been found for Steiner's Problem, no polynomial-time algorithms have yet been found and the prospects for such a n algorithm are riot good.
2.2
EXAMPLES AND EXERCISES
For a n introduction to Steiner's Problem it is helpful to investigate several specific cases t o explore the difficulties and surprising twists of the problem.
I. Show that the degree of each vertex is a t most three; and hence. t h a t the degree of each Steiner point equals three. I t is helpful to observe that any Steiner point is the Torricelli poiut of its neighbors. Moreover, we then have that two edges incident to a Steiner point meet a t a n angle of 120°.
11. Not every locally minimal tree, however, is a solution of minimal length overall - t h a t is, a n S3IT. Large-scale rearrangements of the Steiner points
may be necessary to transform a network into a shortest possible tree, which is a globally minimal tree. To see this we investigate the following example: Consider the four corners of a rectangle in the Euclidean plane measuring three units by four units. An MST for these points has length 10. There are two locally minimal trees with two Steiner points. Each arrangement forms a tree that has three edges connected t o each Steiner point a t 120°. If the Steiner points are arranged parallel to the width, the locally minimal tree t h a t result,s is 9 . 9 2 8 . . . units long. If the St>einerpoints are arranged parallel t o the length, a locally nlinimal tree results with a length of 9.196 . . .. Consequently, oiily in the last case do we have an SMT. Ollerenshaw (compare [147]) proved that if two full trees exist for the four points, the one with the longer edge between the two Steiner points is the shorter tree, i.e. the SMT. Illoreover, this consideration shows that a solution of Steiner's Problem is not always uniquely determined: For four points forming d solution. a square, we have two equivalent ( e q ~ length)
>
111. Let AT be the set of nodes of a regular ri-go11in the plane, n = IN1 3. Find a n SIVT for !Y. For n = 3 we seek a Torricelli point. For n = 4 the example above will be helpful, where, roughly spoken, the "Double Y" is shorter than the "X". I t is not simple t o see (compare [141]) t h a t for n 2 6 there is no Steiner point in the tree, meaning the SMT is a n MST with length equal to (n - 1) . I , where I is the length of a side. Jariiik and Kosler proved this result 13. I t was another fifty years until the proof by Du et al., in 1934 for n compare [314].
>
IV. A set AT = {(i,O), (i, 1) : i = 0,.. . , n - 1) is called a ladder. Chung and Graham [84] examined ladders and determined the length of SMT's for these sets. Particularly, tliej. denionstrated t h a t there are arbitrarily large sets of points for which the SAIT cannot be separated, that means cannot be divided in full trees. Burltard e t a 1 [60] describe a method t h a t always finds a solution for Steiner's V = {(i . b , 0),( i . b, 1) : i = 0 , . . . , n - I), where Problem for ladders of the kind i b 5 1. The subject becomes more difficult if we consider grids of arbitrary dimension. A nice representation of this question has been given in [I751 and [176]. V. Suppose we wish to find a netmorlt that will connect a set of given points. One may t o do this is t o use a MST, which uses only edges joining pairs of the given points. We saw that such a netmorlt is easy to find. Another is to use an SMT. Obviously, the length of the SMT is less than or equal the length of the MST. How much shorter can it get? Consider three points which form the
Gauss' question
Figure 2 . 2
TWOlocally rninirrlal trees
corners of a n equilateral triangle of unit side length. An MST for these points has length 2. Ail SA'IT uses one Steiner point, which is uniquely deter~ninedby the condition t h a t the three angles a t this point are equal, and consequently equal 120°. Consequently, wit11 help of a simple calculation, using the cosine law, we find the length of the SMT in 3 . J1/3 = A. So we have the ratio of between the length of the both iietworlts is 4 1 2 = 0.866025 .... Is there a finite set of points for which the ratio is smaller?
VI. Related t o Steiner's Problem, we mill require that the minimal network has a t most k Steiner points, where k 0 is a predetermined integer independent of the number of given points. Such a network must be a tree also, and is called a k-SLIT. This problem mas introduced indepelideritly by C. [87] in 1982 and Georgakopoulos and Papadiniitriou [183] in 1987. T h e combinatorial structures of k-SMT's and SiVIT's are quite different. Particularly, in contrast to I., we find Steiner points of degree 4 in ~ - S M T ' S . ~
>
I t is a difficult task t o discuss all these esamples in spaces other than the Euclidean plane.
2.3
REFERENCES
Steiner's Problem is one of the most famous combinatorial-geometrical problems. It is the core of the so-called Geometric Ketwork Design, but has itself two origins: Fermat's Problem and the Minimum Spanning Tree problem. Consequently, in the last three decades the investigations into and, naturally, the publications about Steiner's Problem have increased rapidly. The articles that have been writt,en on Steiner's Problem and its relatives are nearly countless. The first survey of Steiner's Problem in the Euclidean plane was presented by Gilbert and Pollak in 1968 [186]; they christened the terms "Steiner Minimal Tree" for the shortest inkrconnecting network and "Steiner points" for the additional vertices. I t is u-ell-known that solutions of network design problems depend essentially on the wag in which the distances in space are determined. Clearly, this is true for Steiner's Problem. Consequently, there are many metric spacesg to be considered. Surveys in form of monographs are given by 1. S.VoB: " Steiner-Probleme in Graphen", 1990, [439]
2. F.K.Hwang, D.S.Richards, P.Winter: " T h e Steiner Tree Problem", 1992, [23l]. 3. -4.O.Ivanov, X.A.Tuzhilin: "Minimal Networlis - The Steiner Problem arid Its Generalizations" ; 1994, [238]. 8 ~ u not t Steiner points of higher degree. see [89]and [369]. 9See the next section.
Gauss ' question
4. D.Cieslik: " Steiner Minimal Trees", 1998, [92]. 5. A.O.Ivanov, A.A.Tuzhilin: "Branching Solutions of One-Dimensional Variational Problems ", 2000, [235]. 6. D.Cieslik: "The Steiner Ratio", 2001, [99]
7. H.J.Promel, A.Steger: "The Steiner Tree Problem", 2002, [355]. Surveys in journals are given by Harris [212], Hwang and Richards [230]. and Winter [464]. There are several collections about Steiner's Problem and its relatives: [79], [143], [239]. [333]. [441] and [435]. A nice representation of the complete subject has been given in [44], [43]. [108], [175], [176], [219], [234]. [389] and [422]. In this sense it is strange that people "discover" Steiner's Problem again and again, and prove "facts" which have al~eaclybeen proven a dozen times.''
2.4
A FIRST ANALYSIS O F STEINER'S PROBLEM
We start with a general analysis of Steiner's Problem in arbitrary metric spaces. We describe several basic fact,s about the combinatorial and geometrical st,ructure of SMT's. Later we will discuss more detailed facts that arise if we restrict ourselves to specific cases.
2.4.1
Metric spaces
Distance is the mathematical description of the idea of proximity, and consequently, we may assume (and it is not hard to see) that a solution of Steiner's Problem depends essentially on the way in which a distance in the space is determined. The following term was introduced by Friichet in 1906: A pair ( X , p) is called a metric space if X is a nonenlpty set of elements called the points, and p : X x X + R is a real-valued function satisfying: l0One of these discoveries is the fact that the degree of a Steiner point in an SLIT in Euclidean spaces of arbitrary dimension equals 3 .
(i) p(x, y)
> 0 for all
2, y
in X ;
(ii) p(x, y) = 0 if and only if x = y; (iii) p(x, y ) = p(y, x) for all z;: y in X: and (iv) p(x, y ) 5 p(2, Z)
+ p(z, y) for all 2, y, z in X
(triangle inequality).
Usually, such a function p is called a metric.'' TVe will say that the quantity p(2, y) is the distance between the points x and y. If p satisfies (ii) only in the weaker form (ii') p(x, x) = 0 for all x in X ; we say that p is a pseudoinetric. If the function p satisfies the conditons (i).(ii') and (iii) it is called a dissimilarity . I 2 A metric, pseudometric or dissiinilarity p on a finite set X of n points can be specified by a n n x n matrix of (nonnegative) real numbers. (Actually numbers suffice because of (ii') and (iii).)
(y)
Let ( X , p) be a metric space. If X' C X , then the restriction p' of the metric p on X' x X ' is a metric on S ' . In what follows me regard ( X ' , p') as a metric space and call it a subspace of ( X , p).
A graph G = (V,E) is embedded in ( X , p) such that (i) V is a (finite) subset of S (ii) E is the set of all unordered pairs
of points v and v' in 5'.
(iii) The metric p induces a length function for the graph, so t h a t for each edge g d a length is given by p(v, v'). ' l ~ l l eaxioms are not independent: (i) is a consequence of (iv).
011
the other hand,
Observation 2.4.1 A metric p can be defin.ed equivalen,tly b y
( i i ) p(x,y ) = 0 if and only if z = y ; and (zvl) p ( 2 , y )
< p(x,z) + p ( y : z ) for
all z,y , z i n S .
'"Ve will give the reason for this name later. There are various measures of dissimilarity, and not all of them yield a metric, but many do.
Gauss' question
33
(iv) We define the length of the graph G in ( S , p ) as the total length of G:
In general we will consider graphs and their embedding in a metric space a t the same time. In each case it will be easy to see whether we use combinatorial or metric/geometric facts. Steiner's Problem is the "Problem of Shortest Connectivity". Since the demand of shortness forces the netxork to be cycle free it is only necessary to consider trees: Observation 2.4.2 Steiner's Problem is only interested in trees. Let N be a finite set of points in the metric space (X,p ) . For a given natural number k and for k points v l , .... vx E X \ AT,let T ( k , vl, ..., v~,)be a spanning tree of minimal length in the complete graph with the set ATU {vl, ...,vx) of vertices, where the length of the graph is induced by the metric p as defined in (2.2).13 If there is both a number k' and points wl. .... wnj such that the value
is minimal among all candidates T ( k .vl,..., vk), then me call T ( k l ,wl,..., wk,) a Steiner Minimal Tree (SMT) for N , and the points wl. .... wp are called Steiner points. T h a t means, an ShIT for i\; is a minimum spanning tree on i Y U Q, where Q is a set of additional vertices inserted into the metric space in order to achieve a minimal solution. It is not true that there is an SMT for any given finite set in each metric space, but for all spaces considered in this book any given finite set has an SMT; this implies that the set Q of additional vertices is a finite set as well.14 In the remainder of this section, we will discuss which properties an SMT possesses, under the assumption that an SMT exists. 1 3 ~ e c a l that l a minimum spanning tree can be found easily. l%xamples for spaces in which there does not alwals exist an S M T are given in the next chapter.
Observation 2.4.3 Let ( X , p ) be a m e t r i c space and let AT be a finite set of points in X . W i t h o u t loss of generality, the followin,g is t r u e for a n y S M T T = (V,E ) for AT
> g ~ ( v> ) 3 for
( a ) g ~ ( v ) 1 for each vertex v i n 1;; (b)
each S t e i n e r point v i n I/'.
Proof. (a) is a n obvious fact, since T is a tree which connects all vertices. I t is impossible for a Steiner point v to have degree one, since the edge v.u' which joins v with the remaining tree has a positive length, contradicts the minimality requirement. T h e triangle inequality of the metric p implies (h) in the following way: Let v and be a Steiner point of degree two. Then we may replace the two edges & by the edge &. Because p(w, w')
< p(w,v) + p('u. w'),
(2.4)
the new tree is not longer than the old.
Moreover, a Steiner point v in a n SLIT T can be of degree two. Then p(w. v)
+ p(v. w') = p(w, w')
(2.5)
holds for {w. w') = ATT(v).I5 Now, n7e will prove t h a t the number of Steiner points cannot increase arbitrarily:
Observation 2.4.4 I t i s suficien,t t o consider only finite trees as candidates for a n SMT.
Proof. Let T = (If, E) be a tree interconnecting a finite set N = {vl, ...,v,) of points. 2n2 then there T h e number of vertices in T is bounded, more precisely: If 111' exists a tree interconnecting AJ which is a proper subtree of T and consequently has a shorter length. To show this we distinguish between two cases:
>
lSThis observation will be helpful in several investigatio~ls. In some proofs we will use Steiner points of degree two.
Gauss' question
35
Case 1: For any tW0 points c and v' in A' the path T ( v . . . . , v') contains a t most 2n vertices. Then we define the graph G by 11-1
G=
U T ( v 2 ,.... v i + l ) .
2=1
The graph G interconnects all points of N by edges of T and contains a t most 2n(n - 1) = 2n2 - 2n < 27a2 = Vi vertices. Hence, a spanning tree of G is a proper subtree of T and must be shorter. Case 2: There are two points v and u' in N sucli that the path T(!u,.. . , v') has more than 2n vertices. Then T ( v , ...,v') contains a t least n+ 1 Steiner points, each of which is of degree a t least three, see 2.4.3. If T(v, ..., v') is removed from T we get the graph G. We observe that G is a forest with a t least n + 1 connected components. Hence, at least one component does not contain a point of N . If we remove this component in the tree T we get a shorter tree.
We can determine a sharp upper bound for the number of Steiner points explicitly: Observation 2.4.5 Let ( X , p ) be a metric space and let 1V be a finite set of points i n X . Without loss of generality,
hence
v1 5 2 . IN1
-
2
(2.7)
an,d
lE 5 2 . lV - 3 (2.8) is true for any S M T I' = (11; E) for N. Equality holds if and only if the vertices from AT are the leaves of I' and the Steiner points are of degree three. Proof. In 2.4.4 me found that it is sufficient to consider finite trees. Hence, the first assertion is a consequence of
The nurnher of edges in a tree is one less than the number of vertices. Consequently. the third inequality must hold. The discussion of equality follows immediately from 1.2.5.
Another observation for trees n-ith Steiner points mill frequently he helpful:
Observation 2.4.6 Let T = (V, E ) be an SMT for N . If V \ N is nonempty then it contains at least one Steiner poin,t adjacent to two given points.
Proof. Assume that each Steiner point is adjacent to a t most one vertex in N. The set 1'' = V \ i V induces in T a subgraph G' = (I", El). It follows from 1.2.1 that
This contradicts the fact that the forest G' has a t most / 1"l 1.2.6.
-
1 edges, compare
An SLIT is a finite tree. The number of such trees for a finite set of given points (vertices) rnust also he finite. In other words,
Gauss' question
37
Observation 2.4.7 I t is s u f i c i e n t t o consider onsly a finite n u m b e r of trees as candidates for a n SMT. It will be helpful to associate a matrix to a graph: Let G = (V,E) be a graph ..., v,, ). Then we define and assume that the vertices are labelled, i.e. V = {q, the adjacency matrix A(G) = (a,,), ,=L ,, with aij
=
1 : if the vertices ,u, and 0 : otherwise
vj
are adjacent
These matrices contain the complete information about the structure of graphs. Consequently, many matrix calculatioiis have a meaning in the sense of graph theory.16 The adjacency matrix of the graph G does depend on the labelling of the vertices of G ; t,hat is, a different labelling of the vertices may result in a different matrix, but they are closely relat,ed in that one can be obtained from the other simply by interchanging rows and columns. A matrix which contains entries only 0 or 1 is called a binary or Boolean matrix. Using adjacency matrices we can describe the length of a graph G by
In other words,
Observation 2.4.8 For a given topology of a tree its length i n a m,etric space i s a linear fw,n,ction of the metric. Steiner point locations in the space are not prespecified from a candidate list of point locations, but we may assume that the set of Steiner points is contained in a suitably bounded subset of the space. Here. a set It' of points in a metric space (X.p) is called bounded if
Equivalently, we consider balls it1 the space defined by
16We will discuss this further in the next section
>
where r 0 is a real and z is a point of the space. Then it is easy to see that the set W is bounded if and only if there exists a nonnegative real r and a point z such that Idr g B r ( z ) . (2.12) Observation 2.4.9 Let N be a finite set of poin,ts in a metric space ( X , p ) . Then we may assum,e that the set L7\!V of Steiner points of an SMT T = (If, E ) for N is contained In 0, bounded subset of X :
where u is an arbitrary point in N and
I- = L ( X ,p) (MST
for N)
Kote t h a t it is not simple t o describe a small set containing all Steiner points. Such a set is usually called a Steiner hull of N. A known Steiner hull allows confinement of the construction of the tree within a given set. Hence, the smaller a Steinel hull is, the better it is. O n the other hand, if the Steiner points in Q have been localized, a n SMT for N is simple to find, since Observation 2.4.10 Let N be a finite set of points in a metric space. Then an SMT T = (If,E ) for N is an MST for If*. Comparing all these facts, the search for a n SMT for a finite set of points in a ~ n e t r i cspace forces investigations of two specific questions: How many Steiner points are used in a n SMT? Where are these Steiner points located in the space? Unfortunately, these questions cannot solved independently from the construction of the shortest tree itself. For a complete discussion of these difficulties see [92], [230], [231] and [464], or the next chapter. What are the spaces for which an SMT always exists? Such a tree necessarily exists if the bounded subset which contains the Steiner points is cornpact.17 In 1 7 ~ c t u a l l in y several interesting cases i t will be finite
Gauss' p e s t i o n
39
this case r e must consider, for each tree of a finite number of trees, the value of the function (2.9). More precisely:
I. Considering "continuous" spaces, it is sufficient for an SAIT to exist if the metric space has the following four properties: (i) (X, p) is complete; (ii) ( X , p) is finitely compact. i.e. each bounded and closed subset is compact; (iii) Each pair of points in ( X , p) can be connected with a geodesic curve, i.e. a curve of shortest lengt,h;ls (iv) For all points x, x' in (X,p), the distance p(z, x') is equal to the length of a geodesic curve joining x and x'. T h e following classes of metric spaces satisfy the four properties and thus in each case we establish the existence of an SMT for a finite set N of points with the help of a compactness argument: (a) Euclidean spaces are classical examples for Steiner's Problem. (b) Finite-dimensional Banach spaces. Since these spaces play a n iniportant role in both theoretical questions and in applications we will describe them more extensively. In his book Geometrie der Zahlen [310], published in 1896, Minkowski proved a number of results by geometrical arguments, using the idea of normed spaces mhich is based on the assumption t h a t t o each vector can be assigned its "length" or norm satisfying some "natural" conditions. A convex and compact body B of the d-dimensional affine space Ad centered in the origin o is called a unit ball, and induces a norm I . / = I . / IB in the corresponding d-dimensional linear space AClaccording t o the so-called Miriltomski fui~ctiorlal:
1 1 ~ 1 =1 i~ n f { t > 0 :v E t B ) for any3uin Ad \{o), and
On the other hand. let 1.11 be a norm in Ad. ~vhichmeans that 11.1 : ~4~ + is a real-valued function satisfying (i)
positivity: 1 . c /
> 0 for any
'L'
in ACL;
his is the specific form of Steiner's Problem for two given points
(ii)
identity: I vli = 0 if and only if
2:
= o;
(iii) and
homogeneity: (ltvll = ltl . i/vil for any v in Ad and any real t:
(iv)
triangle inequality: Iv
+ vll
5 Ilv /
+ i l ~ i ' l l for any w ,v' in Ad.
Then B = ( 7 1 E ACi: jvj 5 1) is a unit ball in the above iense. It is not hard t o see t h a t the correspondence between unit balls B and norms 1 1 . 1 1 is unique, t h a t is, a norm 1s completely determined by its unit ball and vice versa. Co~~secluently, such a space is uniquely defined by a n affine space & and a unit ball B. I t is called a Banach-hlinltouslti space, and is abbreviated as AId(B). A Banach-Minkowski space M d ( B ) is a coniplete metric linear space if we define the metric by (2.14) P ( ~v l.) = I V - u l I ~ . Usually, a (finite- or infinite-dimensional) linear space which is complete with regard to its given norm is called a Banach space. Essentially, every Banach-Minltowslti space is a finit,e-dimensional Banach space and vice versa.lg
Observation 2.4.11 Segments in a Banach-Minkowski space are shortest cuwes (in the sense of inner geometry). They are the unique shortest curves if and only if the unvit hall is strictly Roughly speaking, the observation t h a t a straight line is the shortest distance between two points is Steiner's Problem for a set of two points. In particular. we consider finite-dimensional spaces with p-norm, defined in the following lvay: For r; = ( J ~.. .,, zd) we define the norm by
"'n infinit,e dimensional Banacli space is often called a Banach-Wiener space, compare 14601. The structure of such spaces is intrinsically more complicated than that of the finite dimensional ones. ' O ~ h efourth problem of Hilberl, is to characterize all geometries in which segments (convex hulls of two different points) are shortest curves (in the sense of inner geometry). In particular, Hilbert asks for the construction of all these metrics and the study of the individual geometries. Hilbert's problem is a program of research about the foundations of geometry. The major contributions were the books T h e Geometry of Geodesics [61] b y Busemann in 1955 and Hilbert's Fourth Problem [347] by Pogorelov in 1979. For a historical discussion compare [ll] and [468].
Gaz~ss'question
41
where 1 5 p < m is a real number. If p runs t o infinity then we get the so-called Maximum norm
In each case we obtain a Banach-Minkowslii space written by
C;.
(c) Compact manifolds. About more facts of metric/geometric properties of several continuous spaces compare [262], [281], 12971, [364], [381], [411] and [426].
11. Concerning "discrete" spaces we make the following definition: X metric space (X, p) is called a discrete metric space if any bounded set is finite. In other words. if for a subset W it holds that
then also
/ W < 00.
(2.18)
Consequently, a n SMT esists for any finite set of points in such spaces. Examples are: (a) Finite metric spaces. (b) Graphs ( a specific case of finite ~ n e t r i cspaces"). (c) Let Z be the set of all integers, then Z" equipped with a rectilinear, Euclidean or any other "desired chosen" distance is a discrete metric space. (d) Spaces of words with phglogenetic (= space measured el-olutionary) distances. Kote, that a n infinite set with the so-called discrete metric, which defines the distance betxeen two different points to be 1, is not a discrete metric space." For more facts about metric/geometric properties of several discrete spaces compare [246] and [476]. 21An introduction to the theory of graphs we gave in the previous chapter; the representation as metric spaces we will describe at the end of the present chapter. " B U ~ , of course, for any given set of points in such a space there exists an SWIT, namely the MST.
2.4.2
More facts in the Euclidean plane
Of course, if we investigate a more specific metric space, we find further facts about Steiner Mininial Trees. The Euclidean plane is defined in the affine plane with the Euclidean metric J(xl - ~ 2 )(yl~ between t,he points (zl, y l ) and ( 2 2 , y2) derived from a norm 1.1: (2.19) i(z,y)II =
+
dm.
Steiner's Problem looks for a shortest network and in particular for a curve C of shortest length joining two points. For our purposes, we regard a geodesic curve as any curve of shortest length. If we para~netrizethe curve C by a differentiable map y : [O,11 -+ lRd we define
1 1
length of
C=
y d t
It is not hard to see that among all differentiable curves C from the point v to the point v' the segment
'u'L1/ = {tv + (1- t)vl : 0 5 t 5
1)
(2.21)
minimizes the length of C. And, moreover, as a consequence of 2.4.11. Observation 2.4.12 A l l s e g m e n t s a n d n o other sets of points are geodesic
curves in the Euclidean plane. Consequently, in the Euclidean plane SiLITs always exist, and n-e may represent a graph G = (1;. E) embedded in the plane so that (i) V is a finite set of points; (ii) Each edge w' is a geodesic curve. which means a shortest curve in the sense of inner geometry. We may assume that w'is a segment.23; (iii) Each edge
w'has length
1.1: -
I
v' ;
(iv) The length of G is defined by
d E E
h his justifies
the double meaning of
w'
as an edge of a graph and as a segment
Gauss' question
Using our first example in section 2.2, me have
Observation 2.4.13 Let N be a finite set of points i n the Euclidean plane. Witlzout loss of generality, i n a n SMT T = (11'. E) for AT a given point can have degree 1, 2 or 3; a Steiner point always has degree 3. Moreover, paying attention 2.4.5, we find
Observation 2.4.14 A n S M T for n given points has exactly n - 2 Steiner points if and only if each g h e n point i s of degree one.
.A tree with the property described in the last observation is called a full tree. I t is a binary tree, i.e. it contains only leaves and internal vertices of degree three. T h e following property of full trees can be empirically observed: "Typical" sets of given points in the Euclidean plane usually do not have SMTs which are full trees. T h a t is, its SNITS tend t o he unions of small full trees.24 We decompose a given tree for N into full trees by the folloving procedure: Procedure 2.4.15 Let T = (Ii, E) be a tree for X , th,at rnean,s iV let v be a poin,t isn N with g ( v ) > 1.
C V , and
1. Define G = (V \ {v}, E \ {.ul;' : v' i s a neighhor o f v}). (G is a forest with g(v) components Gi = (I;,Ej), i = I , . . . , g ( v ) . )
2. Define for i = 1,.. . , g ( v ) the graph : v' is a neighbor of v i n G and v' is in I/:}), G ( i )= (V, U { v i ) , Ei U where v ; is n o t in V . If we repeat this procedure vie obtain a fanlily of trees in vihich f o ~each tree, the degree of any vertex which is a given point equals one.
Observation 2.4.16 Let T = trees o f T i s
(V,E ) be a tree for N . T h e n the n u m b e r of full
WEN
2"he fastest exact algorithms (in practice) for Steiner's Problem use two phases: first a small but sufficient collection of full SMTs is generated and then an S M T is constructed from this collection. See [444].
To estimate the total number of full trees more exactly, denote by f ( n ) the number of such trees with n given and n - 2 Skiner points. Then f ( 2 ) = 1. If one removes a given point and also its adjacent Steiner point, one obtains a full tree. This shows that every full tree with n 1 given points can be obtained from a full tree with n given points by adding a Steiner point in one of the 2 n - 3 edges and adding a new edge. Hence,
+
f (n
+ 1 ) = ( 2 n - 3) . f ( n ) .
(2.24)
A solution of this recursive equation is given by Observation 2.4.17 Th,ere are
pairwise distirrct full trees with n, leaves. Conseqz~ently,'if w e ignore the numbering of the internal vertices, w e h u e to check
distinct ,full trees. Remember that it is not simple t o describe a Steiner hull of
X.
Observation 2.4.18 In the Euclidean plane th,e convex hull of the set of gi,uen points i s a S t e i n e r hull. In other words. there is a Steiner hull which is a polygon. Cocltayne [I101was the first to find t h a t a n improved polygonal hull can be obtained by repeatedly deleting triangles from the boundary of the convex hull of the given set: In the following description, let A-be a finite set of points in the Euclidean plane. 1 . Start with the convex hull corivAT;
2. Let v and 71' he two points of A' such that C boundarv of convAT. If there is a third point w in AT such that the triangle conv{v, v',w) contains no other point of N and the angle a t w is not less than 120° then no edge of the ShIT is within conv{u, v',70);
Gauss' question
The new boundary of the Steiner hull is obtained by replacing the segment by the segments and .wv'. If the hull then becomes self-intersecting in some of the given points, the original problem can be decomposed into two or more smaller problems. Weng [452] has generalized this concept and gives a method t o construct Steiner polygons by repeatedly deleting m-gons. here m is a t most the number of given points. He has also shown the uniqueness of the Steiner polygons obtained by this method.
It is a n interesting question t o decide which of all these facts are true in higherdimensional Euclidean spaces, or more generally, in metric spaces.'" 2" helpful discovery in the investigations of Steiner's Problem is the the observation that the degrees of vertices of SMTs in finite-dimensional Bamch spaces are bounded by a quantity which only depends on the space:
Observation 2.4.19 Consider d-dimen,sional Banach spaces with a smooth n o r m . T h e n it holds that ( a ) ( C . [92]) T h e degree of each vertez i n a n S M T is at most 2 d . ( b ) (Lawlor, Morgan [275], Stuan,epoel [415]) d S M T , but never d 2 .
+
+ 1 edges can meet at
(I
Stern,er point of an,
In particular, the degree of Steiner points in Eucliciean spaces is independent of the dimension:
T h e o r e m 2.4.20 I n Euclidean spaces of any dimension the degree of a Steiner point i n a n S M T equals 3. Proof. The equation ( 1 . 1 1 ) also holds true in d dimensions. Hence we have
that is, an inequality which is satisfied only for 77, 5 3. For the planar case we know more about the ~ P r t e xdegrees.
Observation 2.4.21 Conszder S M T s i n a Banach planes equipped with u unit ball B . T h e n ( a ) ( C . [gl], Swanepoel [4lG]) For the degrees of the vertices the following holds true: If B is a n ajjinely regular hezagon, then the degree is at most 6, otherwise at m o s t 4 . ( b ) (Morgan et.al. [315]) A t most four edges come together i n u Steiner point
2.5
STEINER'S PROBLEM I N GRAPHS
Connectivity is also a very important concept in combinatorial optimization. M7e will discuss this concept in the sense of Shortest Connectivity in metric spaces.
2.5.1
The metric closure of a network
Here we consider networks. These are (connected) graphs G = (V, E) equipped with a length function f : E + lR. This fimction on the edges of G is constrained t o take only strictly positive ~ a l u e s . ~ ' T h e simplest question, which mill be of great importance in further considerations, is t o look for the "geodesic curves", which are the interconnecting chains of shortest length between vertices in the network:
The Shortest Path Problem Given: A netvork G = (L: E , f ) and two vertices v and v' of G. Find: 4 path connecting v and v' with minimal length.
A solution is called a shortest path (between the vertices 1: and v' in G). With this in mind each network is a metric space, more precisely Observation 2 . 5 . 1 Let G = (17, E) be a connected graph equipped with a length f i ~ n ~ c t i ofn : E + lR.Define the distance function p o n V so that p(v,vl) = the length of a shortest path between the ~uerticesv and v' in G , ,
,
for two different vertices v and v', and p ( v , v ) = 0. T h e n ( V , p ) i s a m e t r i c space. The space (17; p) is called t,he metric closure Gf of a graph G = (1)': E ) with length function f : E + IR. We can also define Gf as the complete graph on I7 such that the length of a n edge .uv' in G,f is the length of a shortest path between 21 and v' in G. Then we call Gfthe distance graph of the network G = (1:; E, f ) . Note that G is a subgraph of G f , but the restriction of p on G must not be f . everth he less saying it explicitely, sonletimes we will use a length function which has the value 0 for several edges.
Gauss' question
47
The problem of finding shortest paths in a graph with a length function is easy to solve by the so-called dynamic programming technique, which is a rather general method for solving combinatorial problems having the property that their optimal solution can be computed recursively from solutions t o subproblems. More precisely, we use the following observation, called Bellman's principle of optimality, which is indeed the core of dynamic programming: Observation 2.5.2 (Bellm,an [37]) Let G = (I/: E , f ) be a network, and let u and v' be two vertices of G . If e = & is tlze final edge of some slzortest path 'u, . . . , W , v' from v to v', then 7:, . . . , w (that is the path without the edge e ) is a shortest path from v to w .
Roughly speaking: An optimal strategy contains only optimal substrategies. The observation gives immediately Algorithm 2.5.3 (Dijkstra [125]) Let G = (V,E , f) be a network. A shortest path between tlze vertices v and v' can be found by the following procedure:
1. Start wiM the vertex v ; Label v ,with 0: L(v) := 0; all other vertices are unhbelled;
2. Determine min{L(vl) + f (v 1 v 2 ) )where zll and labelled and uy not; Choose GI and 62 which attain the minimum; Label f i 2 by L(62) = L(G1) f
+
Un
are adjacent vertices,
vl
(a):
3. Repeat the second step until v' is labelled.
For all labelled wertices w the quan,tity L(w) is the len,gth of a shortest path connect~n~g v and w: p(v,w ) = L ( w ) . Kow it is easy to construct the metric closure G f : it is sufficient to apply 2.5.3 111' times.17 m7hen we are only interested in the metric p we can find the metric closure in a simpler way: " 1 ~ h e n the aigorithrrl in 2.5.3 runs if all vertices are labelled then the algorithm creates a tree T = (V,F) in which the unique path from v to all other vertices 2;' is a shortest path interconnecting these points in G. T is called the distance tree related to T I .
Algorithm 2.5.4 (Floyd [I 661) Let G = (1)' = {vl, . . . , v,), E, f ) be a network. The m.etric closure G f = (11: p) can be found b y the following procedure: 1. for
. . '$ EIde,fine f (d) = m,
2. for i := I t o n do for j := 1 to n do p(vi, vj) := f (vivj); 3. for i := 1 to n do for j := 1 to n do for k := 1 t o n do ) p(vi, vk) i f ~ ( v j , v i+
< p(uj; uk) then p(vj; vk)
+
:= p(vj, ~ i ) p(v,, vk)
In particular, the function f = 1 is a length function. It measures the distance by counting the number of edges in the path.
A first example: Let A = A(G) = (a,,),,,=l, ,,, be the adjacency matrix for the graph G = (1' = {vl, ..., v,,}, E). Then. obviously, t h e equation a,, = 1 means t h a t there is a chain of length 1 from u , to v,. Now consider
the k-th power of '4. Using induction it is not hard t o see t h a t the equation a!;) = rn means t h a t there are rn different chains of length exactly k from vi t o v j . Hence, t h e graph G is connected if and only if for any pair of distinct vertices vi and v, there is a number k = k(i, j ) between 1 a n d n - I such t h a t a (ikj) > 0: Remark 2.5.5 Let G = (b7 = { v l . . . ,v,,),E) be a connected graph, let A = A(G) rts adjacency rnatrzz and let A ' = (a,,( A ) ) , , , = I , ,,. k = 1 , 2 . . . .. T h e n
holds true for any two distinct riertices vi and v, T h e quantity diam G = max{p(u, u ' ) : v , v' E V}
Gauss' question
49
is called the diameter of t,he graph G = (11, E)." Of course, for any connected graph G it holds t h a t diam G jV(- 1. This implies that, using the adjacency matrix, we have to check only the powers up t o k = 1)' - 1 t o decide if a graph is connected or not.
<
A complete overview about the theory of shortest paths in iietworlts is given by Huckenbeclc [227].
2.5.2
The Question
The central question of "Shortest Connectivity" in networlts is Steiner's Problem in Graphs Given: A connected graph G = (If, E) with a length-function f : E and a nonempty subset N of 1'. Find: A connected subgraph G'= (Lr', E') of G such that
+ B.
is minimal.
This formulation is equivalent to our definition in the section before. This can be seen as follows: First, a solution G' = (V', El) of Steiner's Problem must be a tree, because it is connected and acyclic. Consider the vertices in V'\ N . Such vertices v with g c j ( v ) 2 3 are Steiner points, and the vertices v with g ~(v) ' = 2 lie on a shortest path between Steiner points and given points of N . In other terms, we consider Steiner's Problem in the metric closure G f . The length of the SMT in both graphs must be the same. In this sense, Steiner's Problem in graphs is a special case of the problem in metric spaces.2g Since each finite metric space is equivalent t o some network, compare [204], we have
Observation 2.5.6 Steiner's Problem in networks and i n finite metric spaces are essentially the same. '8For disconnected graphs this quantity is undefined, or cm. '"11 particular; there is no loss of generality in requiring t h a t the length function satisfy t h e triangle inequality: if it does not. construct the metric closure.
Steiner's Problem in graphs was originally formulated by Hakimi [203] in 1971. Since then, the problem has received considerable attention in the literature. A collectioii of equivalent formulations for Steiner's Problem in graphs is given in [257]. Two specific cases are well-knon-n: N I = 2: S4?e search a shortest path interconnecting the two points in N . Here there does not exist a Steiner point, so any internal vertex on the path has degree 2. To find such paths we use the dynamic programming strategy of algorithm 2.5.3.
iV = k': Here Steiner points are not necessary; we look for a minimum spanning tree. This is easy t o do using the greecly strategy of algorithm 1.2.9. Two algorithms, which generalize our specific cases, create a n SMT in graphs. These algorithnls are given by Dreyfus, Wagner and Hakimi. T h e Dreyfus and Wagner solution method breaks the problem down into subproblems, and each of these subproblems themselves into subproblenls etc.. until the subproblems can be solved with help of a shortest path technique. Algorithm 2.5.7 (Dreyfus and W a g n e r [133]) Let G = (I/, E, f ) be a network. Let 1V C V be a set of given points. T h e n a n SMT for 1V i n G can found b y the followlng procedure: 1. (initialization)
For all vertices v, v' compute p(v,v') i n G ,
2, (Recursion) P e r f o r m the following calculations for all k f r o m 2 t o IN1 - I: - For all K C N such that K l = k and for all v E V \ K , compute L , ( K U {v)) = min{L(Kr U {v)) - For all
K
5 I"\:
+ L ( K \ K' U {v)) : 0 # Ii' C I<);
such th,at lIi1 = k and for all v E 1'
+
\ K,
L ( K U { u ) ) = rnin{niin {p(v, w) L ( K ) ) , inin {p(v, w)) EEK
(u
4 I<
compwte
+ L,,. ( K U {w)))),
where L ( K ) denotes th8e length of a n SMT for K and L , ( K U {v)) denotes th,e length of a shortest tree for K U {v) satisfying the additional constraint that v has degree at least two.
Gau,ss' question
51
The algorithm, as stated above, only computes the length of an SMT T. For the explicit constiuction of T, the algorithm has to be supplemented by a backtracking procedure. The time complexity is O(3"k + 2" k 2 k 3 ) , where n = lAT1and k = 1171. Hence, the algorithm is exponential in the number of given points and polynornial in the number of other vertices.
+
On the other hand, Hakimi proposed that a minimum spanning tree be calculated for each of the possible subsets of vertices. from just the set of given points through to the co~npleteset of vertices:
Algorithm 2.5.8 ( H a k i m i [203]>Lawler [274]) Let G = (V, E , f) be a network. Let AT 5 V be a set of given points. Tlzen a n S M T for N i n G can found by the following procedure: 1. C o m p u t e shortest paths between all pairs of vertices;
Replace tlze edge l e n g t l ~ swith the shortest path lengths, addisng edges t o the graph where necessary;30 - 2 , ,find a 2. For each possible subset V' C 1' \ X such that 0 5 l1"l 5 m i n i m u m s p a m i n g tree T ( A iU V ' ) i n the induced subgraph Gf [NU V ' ] ;
3. Select the shortest spanning tree from the ones computed i n step 2; Transform i t i n t o a tree of the original gmplz, i e . , replace each edge of the spanning tree with the edges of tlze shortest path between tlze vertices. The time complexitv of the algorithm is 0 ( n 2. 2 L - ' f~3 ) . where n = lhTl and k = 111. Hence, the algoritlinl is polynornial in the number of given points and exponential in the number of the other vertices.
A polyhedral approach for Steiner's Problem in graphs is given by Aneja [16], Grotschel and Monma [195], and Lucena and Beasley [36]: For each edge e E E, a variable x, is introduced. We consider the vector space lRE. Each subset F C E induces an incidence vector xF = in IRE by defining X f = 1 if e E F , X f = 0 otherwise. Conversely, each 011 -rector x in lRE induces a subset F = {e E E : x, = 1) of the edge set E of G. Then Steiner's Problem
(Xr)eE~
can be formulated as the following integer linear program:
30~11other
words. determine the metric closure of G
subject to
CrEII, CVjE1 ,M x
x d E {O,1)
d
>r
u for all pairs u,u' E 1:u T I ' C IT with u E TI: 21' (uhere r,,,,~ - = 1 for all r d = 0 otherwise); fol all & E E
# u'and #W 21, U'
E
for all
N. and
Branch and bound is a technique for the complete enumeration of all possible solutions without having to consider them one by one. To apply this method to a combinatorial minimization problem, we need two steps:
Branch: 4 given subset of the possible ~olut~ions can be partitioned into at least two (nonempty) subsets; Bound: For a subset obtained by branching iteratively, a lower bound on the length of any solution within this subset can be computed. Such an algorithm for Steiner's Problem in g ~ a p h sx a s first developed by Shore. Foulds and Gibbons [396]. Another branch and bound approach that uses heuristics to provide good lower bounds and is based on an integer programming formulation is given by Khoury and Pardalos [256]. Other approaches to solve Steiner's Problem in networks are given by C. et al. [lo21 and [103]. -411 known exact algorithms for Steiner's Problem in graphs are in some way enumerative algorithms. However, they differ in how the enumeration is done and how clever their strategies for avoiding total enumeration are.31 Consequently, all of these algorithms need exponential time. But this is not a surprise, since Remark 2.5.9 (Karp [251]j Steiner's Problem in graphs is A f P - h ~ r d . ~ ~
Steiner's Problem remains N P - h a r d if any of the following conditions hold: 3 1 ~ o the r problem of enumerating all solutions see [104]. 3'For information about the complexity of problems see the next chapter.
Gauss' question
All edge lengths are equal, i.e. the length of a subgrapli is its number of edges [ 2 511. w
The graph is bipartit,e [177].3"
w
The graph is a hypercube [I691 The graph G is planar [177], [355].34
33A graph G = (V; E) is called bipartite if it is possible to partition 'I" into subsets VI and fi sucli that every edge joins a vertex of 'I/I to a vertex of b.?. A well-known characterization is Theorem 2.5.10 A connected graph is brpartzte if and and only if it contains n o cycle of odd length.
Sketch of the proof. If a graph G = (1; E) contains an odd cycle then it cannot possibly be bipartite. S o w suppose that G contains no odd cycle, then choose any vertex u of G and create a partition by ' I
=
' I
=
{w E 11' : p(v, W )is an even number) {w E 'I : p(v; w)is an odd number)
(2.32) (2.33)
Corollary 2.5.11 All trees are bipartite. 3% graph G = ('I" E) is called planar if it can be embedded into the plane such that no two curves which are the embeddings of the edges intersect each other outside of the vertices. More precisely, planarity asserts that it is possible to represent the graph in the plane in such a way that the vertices correspond to distinct points and the edges to simple Jordan curves connecting the points of its endvertices such that every two curves are either disjoint or meet only at a common endpoint. An embedding of a planar graph determines a partition of the plane into regions. Exactly one of these regions is unbounded. The number of regions call be computed by the classical formuia of Euler:
Theorem 2.5.12 Let G = (If,E ) he a con,nected and planar graph, and let f denote the number of regions (including the single un.houn,ded regzon) of a n embedding of G i n the plane. Then (2.34) vj E f = 2 . -
+
Consequently, the number of regions is uniquely determined by the number of v e ~ t i c mand edges, i.e. by the combinatorial structure of the graph.
Corollary 2.5.13 Under the assumption of 2.5.12 it holds that
E f
5
3k'l
<
21V/- 4.
-
6 an,d
The graph is a grid [I781 Restricting j\;'P-hard algorithnlic problems regarding arbitrary graphs to a smaller class of graphs will sometimes, yet not always, result in polynomially solvable problems. For instance Steiner's Problem in graphs is polynomially solvable if any of the following conditions hold: The graph G is planar and in addition all given points lie on the boundary of at most rn faces of the embedding of G n-here In is a number independent of the numbers of points [43], [231] or [356]. w
The graph is strongly chordal. meaning that cvcry cycle with four or more edges has a chord and every cycle with an even number of six or more edges has a chord dividing the q c l e into two parts. each containing an odd number of edges [ 4 ~ 7 ] . ~ ~
w
The graph is a permutation graph [ I l l ] .
Steiner's Problem in graphs can be solved in linear time if any of the following conditions hold: The graph is a series-parallel network [463]. w
The graph is a Halin network. =1 Halin network is a graph formed by embedding a tree without degree-2 vertices into the plane and connecting its leaves by a cycle that crosses none of its edges [464]. The graph is a partial 2-tree. Partial 2-trees are precisely those graphs which contain no subgraph homeomorphic to the complete graph with four vertices [440]. The graph is a double tree [ l o l l .
Surveys on Steiner's Problem in graphs can be found in [231], [355], [439] and [464].
35This result cannot b? extended to chordal graphs since then Steiner's Problem is complete [457].
NP-
W H A T DOES SOLUTION M E A N ?
Philosophy is written in this grand book of the universe, which stands continually open t o our gaze .... I t is written in the language of rnathematics.
Galileo Galilei
R.oughly speaking in English: The essentially scientific part in any theory is the mathematical one. T h e essence of the application of mathematics to any bra.nch of science is the recognition and exploitation of regularity, which may be rigid and striking or a dimly observed tendency hardly distinguishable amidst a general confusion. In this sense. we mill discuss several scientific problems.
3.1
A METAPHYSICAL APPROACH
In investigating a "real world problem" we make a lot of assumptions. The logical combination of these assumptions yields hints t o the solution of the problem. Doing so we invoke the so-called inetaphysic. Davies [121] describes this in the following way: In Greek philosophy. the term "metaphysics" originally means "that which comes after physics". I t refers to the fact that Xristotle's ~ n e t a physics was found. untitled, placed after his treatise on physics. But
metaphysics soon came t o mean those topics that lie beyond physics (we mould today say beyond science) and yet may have a bearing on the nature of scientific inquiry. So nietaphysics means the study of topics about physics (or science generally), as opposed to the scientific subject itself. Traditional metaphysical problems have included the origin, nature and purpose of the universe, how the world of appearances presented t o our sense relates to its underlying "reality" and order, the relationship between mind and matter, and the existence of free will. Clearly, science is deeply involved in such issues, but empirical science alone niay not be able to answer them, or any "meaningof-life" questions. Mathematics gives the possibility t o order and t o verify scientific facts. In other words, mathematics is the logical part of metaphysics. In this sense. mathematics cannot be a scientific theory. hloreover. this is true starting from a different point of view. Davis continues: Modern philosophy has been strongly influenced by the work of Karl Popper, who argues that in pract,ice scientists rarely use inductive reasoning in the way described. When a new discovery is made, scientists tend to work backward t o construct hypotheses consistent with that discovery, and then go on t o deduce other consequences of those hypotheses t h a t can in turn be experimentally tested. If any one of these predictions turns out t o be false, the theory has to be modified or abandoned. The emphasis is thus on falsification, not verification. A powerful theory is one that is highly vulnerable to falsification, and so can be tested in many detailed and specific ways. If the theory passes those tests, our confidence in the theory is reinforced. A theory that is too vague or general, or makes predictions concerning only circumstaiices beyond our ability to test, is of little value. Clearly, the construction of hypotheses cannot use scientific methods; it has to use logic and a verification scheme - in other terms, Matliematics. So we formulate: i\/lathematics is not a scientific theory. Without mathematics science is impossible. Moreover, Russell 13721 states:
What does solution mean?
57
The question which Kant put a t the beginning of his philosophy. namely "How is pure mathematics possible?" is a n interesting and difficult one, to n-hich every philosophy which is not purely sceptical must find a n answer. In other words, mathematics is a n essential part of any scientific theory. Brown [57] gives a discussion of this claim. In particular, he named the following aspects which are important t o us1: ( I ) Mathematical results are certain
(2) Mathematics is objective
(3) Proofs are essential
(6) Mathematics is wedded t o classical logic
(7) Mathematics is independent of sense experience (8) T h e history of mathematics is cumulative
(10) Some mathematical problems are unsolvable in principle We will use this scheme to analyse network design problems. First, we will describe the main questions in this sense.
-4 problem consists of either a question t o be answered, a requirement to be fulfilled, or the search for a n optimal candidate among several objects. 7' of points be Remember Steiner's Problem: In a 1net)ricspace let a finite set 1 given. We seek a network interconnecting the points of N with minimal length. This is a very general question using only two restrictions: w
T h e network has t o connect the given points. The concrete kind of the network is not predetermined. Only t,he total length of the network is minimized. This is obviously a natural demand i11 a metric space. l ~ o other r perspectives on mathematics see [123], [154],[ZOO],[312] and [ 3 8 2 ] .
ST7e saw in the chapter before that, if a network exists which satisfies these properties, then it must be a (finite) tree, called a Steiner Minirnal Tree (SMT). T h a t we are only interested in finite trees mas not stated in the description of the problem; it is a consequence of our logical analysis.
3.2
DOES A SOLUTION EXIST?
The existence of a solutioli means that there is an object which fulfils the condition of the problem without creating contradictions, in itself or in accordarice with other objects in mathematics. It is not true that there is an SMT for any given finite set in each metric space, if we require that the set of additional vertices is finite as well. Examples of spaces in which there does not always exist an SMT are given below. Consider three points vl,v2 and us which form the nodes of an equilateral triangle in the Euclidean plane. An SMT uses one Steiner point q , which is uniquely determined by the condition that the three angles a t this point remove q from the plane, are equal, and coiisequeritly equal 120"."ov:, and we cannot find an SLIT for u l , w2 and u3 in this new metric space. Baronti, Casini and Papini [33] consider co, the usual space of (infinite) sequences of reals equipped with the supremum-norm. They show that there are three points (sequences) in co without a Torricelli point. Iranov; Ryzhiltow, Tuzhilin [236]: Let X be the set of all positive integers. A metric is defined by
Then. consider the three-element set
in the complete met,ric space
ernem ember t h a t q is the Torricelli point for vl,ue and
us
What does solution mean?
The triangle spanned by AT is equilateral, since the length of each of its sides equals 2. Hence, the length of a n MST for N is 4. O n the other hand, for any point q @ N we have J(v, q) > 1, therefore the length of a n arbitrary tree constructed for ATU {q} is strictly more than 3. But for q = ( t , t , t ) , t > 1, we have
when t
+ m. Thus, there does not exist an SMT for AT in
(X3,P).
-4 complete description of all metric spaces in which Steiner's Problem is solvable is not known and this situation is unlikely t o change. because the class of all metric space is to big. So it is necessary to prove the existence of a n ShlT for each specific metric space independently.
3.3
DOES A N ALGORITHM EXIST?
Here we discuss algorithmic solutions of problems. First we will describe general facts about the design and analysis of algorithms which solve pioblems, and then we will apply these considerations to Steiner's Problem and its relatives. An algorithm for a problem is a step-by-step procedure, which, when applied t o any instance of the problem, produces a solution after a finite number of steps. For centuries almost all mathematicians believed that any mathernatical problem could be solved using a n algorithm. However, this view has been questioned over the course of time as more ancl more problenis have arisen for which no algorithmic s o l ~ t ~ i ohas n been found. Algorithms are fornlulated with respect to a specific model of computation, which t o describes what steps are possible. The number of available elementary operations, whatever "elementarj~"means in the particular context, is limited, and the same is true for the number of steps. Depending on this description me can say whether an algorithm exists or not. Three approaches are commonly used: One model of computation is the Random Access Machine (RAM). A RAhI models a one-accumulator computer whose instructions are not permitted t o modify themselves. The memory consists of a sequence of registe~s,
each of which is capable of holding a n integer of arbitrary size. An upper bound of the number of registers that can be used does not exist. The program is merely a sequence of instructions, working step by step. For more information about the RAM see Aho, Hopcroft, Ulmann [I]. Another model is the Turing iblachi~le(TM). It consists of a finite state control, a read-write head, and a tape made up of a t w o - m y infinite sequence of tape cells. Each instruction in a program for a T M specifies the straightforward changing of a word on the tape. For more information about the TM see Garey and Johnson [179]. The approaches of the RAM and the TIC1 are essentially the same. The equivalence of these different definitions of the term "algorithm" suggests g algorithm concept. This propot h a t they are appropriate for ~ a p t ~ u r i nthe sition, known as Cliurch's thesis, was first put forward in 1935: The only effectively computable functiorls are those definable using TMs. In Euclidean spaces a n algorithm uses ruler and compass constructions. To shon- t h a t such a strategy does not exist requires Galois theory; see Artin [19] or Stewart [409]. In many metric spaces we must be able to deal with real numbers, rather than integers. Hence, we will adopt a variant of the RAM in which each register is capable of holding a real number. The following operations are available in unit time: T h e elementary arithmetic operations, comparisons between two real numbers, k-th roots, exponential and t>rigonometricfunctions, in general analytic functions. T h e so-called real-RAM is described by Preparata and Shamos [351]. It closely reflects the kinds of programs t h a t are typically written in highlevel algorithmic languages, in which it is common to treat variables of the type 'real' as having unlimited precision, and we ignore such questions as how a real number can be read or written in unit time. The relationship between TM/R-4M and real-RALI is still a n open question. More specific forms and descriptions of algorithms are closely connected with concrete problems and will be discussed in their own environment^.^ 3 ~ o ar readable description of the~ret~ical aspects of coinpter science see Hare1 [210], [211].
What does solution mean?
3.4
61
DOES A N EFFICIENT ALGORITHM EXIST?
We are riot interested only in the creation of some algorithm, but also in the a,mount of the algorithm takes t o run. We wish to distinguish fast solution methods from slower ones: clearly this requires us t o formulate some objective notions on how t o measure algorithm efficiency. I t should be emphasized t h a t although faster computers can produce solutions more rapidly than slower computers, the main advances resulted from the improvements in the understanding of the mathematical structure of the underlying problems. A problem is usually expressed in terms of several input parameters which are described but whose values are left unspecified. In most cases, there exist two or more algorithms for solving a given problem. If we have in mind the implementation of the algorithm on a machine there is a feature that must be compared in deciding on one algorithm rather than another, namely the time taken (which depends on the number of times each step is executed), the so-called time complexity. This quantity depends on the size of the input parameter^.^ We may assume t h a t for a size n the time complesity t ( n ) is a function, where in general, but not exclusively, t ( n ) 2 n . In the following discussion we will use the phrase "on the order o f ' t o express lower and upper bounds. More precisely: Let f and g be functions from the positive integers t o t,he real numbers. Then: (i) T h e function g ( n ) is said t o be of order a t least f ( n ) . denoted n(f ( n ) ) , if there are positive constants c and no such that g(n) 2 c . f ( n ) for all 11 2 no. (ii) T h e function g ( n ) is said to be of order a t most f ( n ) , denoted O(f ( n ) ) , if there are positive constants c and no such t h a t g ( n ) 5 c . f ( n ) for all n 2 no. (iii) The function g(n) is said t o be of order f ( n ) , denoted O(f ( n ) ) if, g ( n ) = R(f ( n ) ) and g(n) = O(f ( n ) ) . T h a t is. f (n) and g(7z) both grow a t the same rate; only the multiplicative constants mag be different. This notation allows us to concentrate on the dominating term in a n expression describing a lower or upper bound and to ignore any multiplicative constants. T h e time complexity of a n algorithm expressed in terms of any of these nota4.411 questions, definitions and investigations about algorithms will be done in view of our original problem, namely the search for shortest trees. Hence, for our considerations we will use the number of given points as the size of the input.
tions is, in general, referred t o as asymptotic time complexity because it reflects the behavior of the algorithm for sufficiently large values of the problem size. I t is not hard to see t h a t these "Ordern-notations have the following properties: (a) gin) = O ( f (n)) if and only if f (n)= R(g(n)). (b) T h e order of the sum of two functions is given by the order of the faster growing function: f ( n ) g(n) = O(max{ f ( n ) .g ( n ) ) ) .
+
(c) If f (n) is a polynomial of degree k then f (n) = O ( n 9 . is transitive. (d) The relation represented by "0" (e) For the logarithmic order O(1og n ) the base is irrelevant since logbn = log, n . log, a. 0(b1') (f) Exponential functiolls grow faster than polynomial functions: n" for all k > 0 and b > 1. Conversely, logarithmic functions grow more slo~vlythan pol\-izomial functions.
A broader and more detailed discussion of the growth of functions is given by Aigner [3]. For our purpose we will use the following "classes of complexity", which are defined in terms of the input size n: Order
O(1) O(log n,) O(n) O ( n log n,) O(n" 0(n3) O(n9
Name of the "class" constant, logarithmic linear log-linear quadratic cubic polynomial
Remark esecution time is independent of the input size the base is irrelevant the base is irrelevant
k is a fixed positive integei
filention that the previous table shows the "fast" algorithms, this table the "slow" ones:
What does solution mean?
Order O(cl"
Name of the "class" exponential
Remark c > 1 is a fixed positive real number
O(n!)
factorial
Stirling's formula: r ~ % ! &):( Stirling's inequalities: e (:)IL 5 and n! en (2)"
<
1x1
~ ( 2 " " ) superexponential In particular, we say that the time of a n algorithm is polynomially bounded (briefly polynomial) if there is a positive integer k such that the time complexity depending on 7a is O ( n 9 . However, it is important t o note that here x e are considering worst-case performance. There exist certain problems for vhich the average-case performance of a polynomial algorithm is often worse than t h a t of certain exponential algorithms for relatively large input size.' Let II be a problem which is algorithmically solvable. Then the computational complexity of rI is defined as the minimum of the time complexity among all algorithms solving ll: complexity of
II = niin{time of M : itl is an algorithm which solves II). (3.3)
Obviously, the complexity of each concret,e algorithm is an upper bound for the complexity of a problem. Determining the complexity of a probleni requires a two-sided attack: (i) Finding a n upper bound - the minimuni complexity over all linomn algo-
rithms for solving the proble~n. (ii) Finding a lower bound - the largest function f for which it can be (mathematically) proved t,hat all possible algorithms for solving the problem are required t o have complexity a t least as high as f . Our ultimate goal is t o make these bounds coincide. An algorithm which realizes this coincidence is called optimal. A gap between (i) and (ii) tells us that more research is needed t o achieve this goal. 5E.g. the linear programming problem (compare [333] and [365])
As introductory examples me consider several elementary problerns which we will often use in the description of our algorithms. The elements of a set U. called the universe. are said t o satisfy a partial order j if (i)
5 is reflexive: For all z
(ii)
5 is a~ntisymrnetric:If x 5 y and y 5 x t,llen z = y;
(iii)
5 is transitive: For any three elements x , y and z , if x 5 y and y j z then z 5 z.
E
U it holds that z 5 z:
The pair (U, 5 ) is called a partially ordered set. or shortly a poset. T h e relation C of set inclusion is a partial order on any collection of sets. + is called a linear order if, additionally, -
(iv) For any two elements x and y of U, z 5 y, x = y or y 5
2.
One example of such a linearly ordered universe is defined over the set of letters of a n alphabet A which are in a predetermined order. This induces a lexicographic order on the set of all words over A. It is customary to use the symbol 4 t o denote 5 and #. For a sequence S = { ; c l , . . . , x,,) whose elements are drawn from a linearly ordered universe (U, j) consider the following problems:
< <
w
Let k be an integer satisfjliiig 1 k n. The problem of selection calls for finding t,he kth smallest element of S. In order t o determine this element, we must examine each element of S a t least once. This establishes a lower bound of R(n)for any algorithm which solves the probleim6 On the other hand, there is a n algorith~nwhich runs in linear time, see [I] or [ 5 ] .
w
The problem of sorting is defined as follows: given a set S in random order; arrange the elements of S in nondecreasing order. There are n! possible permutations of the input and consequently log n! = Q(nlog n) comparisons are needed t o distinguish among them. I t is well known that there are algorithms which run in O ( nlog n) time, see [I]or [5].
G.A sequence S is called sorted if i < j implies that xi 5 z,.If S were presented in sorted order then selection could be accomplished with a trivial constant-time operation.
65
What does solution mean?
Let z be an element of the universe M. In the problem of searching me seek x in S. In geneial, this problem needs O ( n ) ,but if S is sorted we can find z using binary search in O(1og n) time. As other examples consider our techniques for constructing shortjest paths in, and the metric closure of a network G = (11, E, f ) , with 1 ' = n , IEl = m and an integer-valued length-function f : Dijkstra's algorithm 2.5.3 determines the distances from a given vertex to all other vertices in quadratic time.' w
Floyd's method 2.5.4 consumes time cubic in n to create the metric closure of G.
Significantly improving the complexity of determining the metric closure is still an open problem. The class of problems which is solvable by an algorithm running in polynornially bounded time is usually defined as P. In theoretical computer science a problern is said to be efficiently solvable if it is in P. This observation has led to the widely accepted consensus that feasible problems should have polyiiornial time complexity. This is reasonable, as polynomial time complexity does not depend on the machine model provided realistic machines are considered.*>" A problern for which it is conjectured that no polynonlial algorithm exists is
+
7 0 r using so-called Fibonacci heaps, in O(n1ogn m) time [171]. 8Particularly, the concepts of T M and RAM are equivalent, [I]: More precisely, let t be the function that bounds the time which an algorithm needs on the simulating machine. Then simulating T M simulated Th'I sirnulateti RAM
simulating RAM O ( t log t )
o(t3)
In other terms, an algorithm which is polynomially bounded on the T M is polynomially bounded on the RAhI as well and vice versa. "he natural answer that a linear time algorithm is efficient, and an exponential time one not is to be read with care: Consider two algorithms whose running times are t l ( n ) = c . n and t z ( n )= 2 n / C , where c is a "very large'' number. Then the second algorithm is faster for all practical purposes. What does "very large" mean? In particular; consider the following family of numbers
said t o be intractable. For instance, vie saw that the problem of a shortest path in a network is in P; but the problem of a longest path is intractable, see Garey and Johnson [177]. The class NP is the class of decision problpms that can be solved in polynomially bounded time in a nondeterministic way. In a nondeterministic algorithm, a state may determine many successor states, and each of these followed up on simultaneously. In other words.
J\[P is the class of problems for which it is "easy", i.e. achievable in polynomially bounded time, t o check the correctness of a claimed solution; while
P is the class of problems t h a t are "easy" t o solve. Clearly,
P Moreover, for any problem
c !VP.
(3.6)
II in :VP there exists a polynomial p such that
complexity of
II 5 0 ( 2 P ) .
(3.7)
A problem is ArP-hard if it is as "hard" as any problem in ArP; it is N P complete if it is both iLrP-hard and in ArP. More exactly, a problem in JV'P is defined to be ,\fP-complete if all other problems in NP can be reduced t o it m-ith the help of a transformation which takes polynomial time. There is a straightforward strategy for proving new problems ArP-complete, once we we have a t least one (suitably chosen) known ,UP-complete problem available. To prove that the problem II1 is krP-complete, we merely show that 1. II1 E hrP; and 2. Some known ,UP-complete problem 112 can be transformed to I l l , using a t most polynomial time. ,UPC denotes the class of all ~L~C'P-corn~dete problems. All the problems in this class are believed to be intractable. Then, also for a moderate choice of an integer n , c(n) is a large number. Moreover. consider the value c(c(5)). Larger numbers are described by Conway and Guy [114].
What does solution mean?
67
An i ~ n p o r t a n topen question in the theory of coniputation is whether the containment of these classes is proper; meaning, is P c A'?? Usually, this st,atement is held to he true; and is called Cook's hypothesis, first stated in 1971 [115]. Note that the statements
1
' P # ~ \ r P , i . e .P c A " ; JVPC n P = 0; and ~\:'PCupcA'P,i.e. ,'\:'PCUP#,'%~P;'~
are pairwise equivalent, compare Garey and Johnson [179]. Roughly speaking, the class of &'PC problems has the following properties: (i) If a n efficient solution is found for one, then it will work for all; (ii) No such general solution has been found for any; but (iii) There is no proof tliat a n efficient solut,ion cannot exist. We assume t h a t Cook's hypothesis is true. By no\T there are se~reralthousands of problems known to be A'P-complete. For none of these a polynomial algorithm was has been found. Furthermore, Strassen [412]: "The evidence in favor of Cook's hypothesis is so overwhelming, and the consequences of their failure are so grotesque, that their status may perhaps be compared to tliat of physical laws rather than that of ordinary mat henlatical conjectures." In other terms: in "our world" P
# h'P
holds.ll
In this book we adopt the convention of referring to the optimization version of a n A'P-hard decision problem as being AfP-hard, even when the corresponding ' O ~ h eset .VPZ := .VP
\
( P u !ltPC)
(3.8)
consists of the problems having "intermediate" difficulty between P and .QPPC.It is reasonable to ask if there is any "usual" problem that is a candidate [or membership in JVPZ. A potential member is the problem of graph isomorphism, which we will discuss later. llThere are "worlds" in which 'P = N'P and others in which 'P # ,UP. Furthermore, if a "world" is chosen random, the probability is 1 that i t will be a world in which P # J\~"P. For a proof and a broader discussion see Schoning and Pruim [385].
decision problem is in fact known t o be AfP-complete.
A well-known guide through the world of :\iP-completeness is the book of Garey and Johnson [179]. Wjhen coilstructing trees of minimal length for a finite set ATof points in a metric space (X,p), we are interested in the time complexity of these algorithms. This complexity depends on n = I N only, because v e assume that the distance p(x, y) can be found in constant time for any points z and y of the space. That means we assume t h a t there is an algorithm, called ail oracle. giving the value p(x, y) for any input points z and y in constant time. For several metric spaces such a n oracle can use a variety of different methods. For a short collection of these methods see [92]. Using a n oracle we find Theorem 3.4.1 The ti~mecomplezity of jhdin,g a,n M S T i n a metric space is of order O(n" ).
Proof. For graphs as metric spaces we can define the lengtli of any edge with a n arbitrary positive number. Any algorithm that finds a n MST must process all these values. Consequently, R ( n 9 is a lower bound for complete graphs. The upper hound is given by Algorithm 3.4.2 (Dzjkstra [125], Prim [352] 12) Given a network ( G ,f ) , a minimum spanning tree T can be fousnd by the follo,wzng procedure:
1. Choose a vertex v arbitrarily;
2. Add a shortest edge which join.9 the subtree containing v with a vertex outside of th,is subtree; 3. Stop when all vertices are connected.
+
This algorithm runs in O ( m n l o g n ) time, where m is the number of edges and n is the number of vertices of G, see [5] or 13301. Using the fact that in the complete graph there are (',') = 0 ( n 1 2 )edges, we have the assertion.
l q h e approaches to the algorithms are similar; Prim's paper appeared earlier hut Dijkstra was apparently unaware of it.
Whut does solution mean?
69
As described, Prim's algorithm runs in 0 ( n 2 ) time, while Kruslal's algorithm takes O(m log m ) time. Thus Prim's algorithm is faster on dense graphs, while Krusltal's is faster on sparse graphs. There are several minimum spanning tree algorithms for graphs that are asymptotically faster than Prim's/Dijltstra's and Kruslal's algoritlims. All of these methods use data structures that are more complicated than those of the algorithms that we have discussed. The following sources give better methods to find a minimum spanning tree in a graph: Yao [469] describes a O ( m .log log n ) algorithm and Gabow et al. [I721 have found a rnethod mith running time O ( m . log p ( m , n ) ) , where
log(') z = log z and log('+') z = log log(') z A complete discussion of rnirlimum sparling tree s t ~ a t e g i e In i networks is given by Tarjan [423], [424]. Remember that the problem of shortest networks is given in a space mith a geometric structure. Geometric versions of conibinatorial optimization problems have attracted considerable interest. -4s a rule, easy problems become easier when restricted to geometric spaces, arid hard ones become no easier. We will see that a minimum spanning tree in the Euclidean plane can be found faster than in a general space. The problem of finding an MST for a set of points in an affirie space differs from the problem of finding a ~ninimumspanning tree in a general network in the following sense: The input consists of the numbers describing the c o ~ r d i n a t ~ e s of the points, with the edges and their weights being implicitly defined by an analytical system. Hence, it is useful and interesting to consider if the geo~iietric nature of the problem can be exploited to obtain fast algorithms for finding an MST. So, it is not astonishing that the time needed to find an hIST in such a space is substantially shorter than t,he time 0 ( n 2 ) in 3.4.1.
Theorem 3.4.3 T h e complexity of finding a n MST in the Euclidean plane and, moreover, over the class of all two-diniensionnl B a n a c h spaces, is of order R ( n log n) . Proof. Let tl , .. ., t,, be integers. We transform each integer t into a pair ( t , 0). Then all pairs lie as points on a line in the plane. If we use an algorithm to find an MST for these points we find the order in the line and hence an order of the
integers. But it is well known that finding an ordering of n nu~nbersneeds a t least R ( n log n ) time, see [I].
For some metric spaces we have a geometric structure in which a measure of distances between the points is defined: Instead of the n ( n - 1 ) / 2 input data items (namely the lengths of the edges of the complete graph), the search for a n MST requires a suitably chosen graph with n vertices (2n input data using the coordinates of the given points) and possibly O ( n ) edges. Hence, we may expect t h a t a n MST can be found very quickly if we use the geometry of the problem. In several specific spaces such a fast algorithm does indeed exist: The geometric properties are characterized by the concept of a Voronoi diagram. Such a diagram is a partition of the plane into closed regions \/I, . . . , I/', around the given points (the terminals) v l , . . . v,, such that any point in a region is closer t o the terminal in its region than to the terminals of any other regions:
for all i = 1,. . . , n , whereby 1 . / / denotes a norm. T h e regions 15 are called the Voronoi cells of the diagram.'"he two different points vi and vj divide the plane into two parts by defining
and the "opposite part",
The set X(v,. v,) is called the dominance region (or the Leibnizian halfspace) of v, over v3. Obviously, in the dominance region X(v,, v,), the distance of any point t o u, is less thdn or equal to the distance to u,. Hence, we can also define the jhronoi cell 1: as
The Voronoi cells cover the space. The points in the intersection of the Voronoi cells \< and k; are equidistant to 131n the Euclidean plane the Voronoi cells are convex polygons. This is not true in some more general spaces, compare [261],[276]and [277].
What does solution mean?
vi and
~
j
i , # j . T h e locus of these points,
is called the bisector between v i and 71, . I 4 AIany computational geometry algorithms have been developed for efficiently constructing Voronoi diagrams in the Euclidean plane, see Oltabe, Boots, Sugihara [325] or Preparata, Shanios [331]. The graph on the terminals of AT with a n edge if the two Voronoi cells share a common side is called a Dehunay triangulation (DT) for N . In other words, the DT for N is a straight-line dual graph of a Voronoi diagram for N. It is sufficient t o look for a minimum spanning tree in the D T for A; to find a n MST for N. Since a DT is a planar graph, there are a t most 3n - 6 edges. Thus, the application of Krusltal's algorithm needs O(nlog n) time. Hence
Remark 3.4.4 (Prepara,ta, Sl~,am,os[,?51]) In the Euclidean plane, the lower bound of 3.4.3 is realized, that m,earls we can find an MST for n points in O ( nlog n ) time. Kow, consider Steiner's Problem. In chapter 2 we showed that hlelzalt's algorithm 2.1.1 needs exponential time. In 1971 Cook [I151 proved that if a polynomial-time algorithm could be found for any single problem in NfPC. that algorithm could be used t o solve all other problems in .VP efficiently. Later,
Theorem 3.4.5 (Garey, Graham and Johnson [177])Stein,er's Problem i n the Euclidean plane is NrP-hard. I t is, however, not known t o be ,Ir?-complete sirice its membership in A r P is open.'"ut the following related problem is ,hrP-complete
The Discrete Version of Steiner's Problem Given: A set N of points in the Euclidean plane n-it11 integer coordinates '"n the Euclidean plane the bisectors are lines. This is riot true in several other planes: particularly in the plane with rectilinear norm. l"l1e central problem here is the following: Let ao, a ] , . . . : a , be integers. Is xr'=, & 5 a07 For a complete discussion of this question see [6].
and a n integer L. Find: A tree T interconnecting the points of AT such that all Steiner points have integer coordinates, and the discrete length-meaning that all distances are rounded ull-of T is less than or equal t o L. For more details and other computational remarlis see [231]. Hanan's theorem shows that Steiner's Problem in the plane with rectilinear norm Cf, contrary t o the Euclidean case C;.is a special case of Steiner's Problem in graphs, so that any graph theoretic method to find a11 SAIT in graphs can also be used to find an SMT in the rectilinear plane. Unfortunately,
Theorem 3.4.6 (Garey, Johnson [178l) Steiner's Problem in the plane with rectilinear distance is ,UP-complete. Clearly, the following question is of interest: A4rethere metric spaces in which an MST is always a solution to Steiner's Problem'? Yes, there are such spaces, in particular, the real line and trees. A more generally example is the following: Let ( X ,p) be a n ultrametric space, that is p(v, w) 5 mas{p(v, u ) , P ( W , u ) ) for any points u,v, w in X . I t is easy t o see t h a t
Lemma 3.4.7 The followi,ny is true for all ultrametric spaces ( X , p ) : p(w, 1 1 ) ) I f p ( u , u)# p(w, u). then p ( z , y ) = rnax{p(v, u),
That means that all triangles in (X,p) are isosceles triangles where the base is the shorter side. Kow, we prove
Theorem 3.4.8 Steiner's Problem in ~~ltrametric spaces is to find an MST.
Proof. Let T = (17, E) be an SMT for N . Let Q denote the set of all Steiner points in T. i.e., Q = V \ AT. Suppose that Q is nonempty.
What does solution m e a n ?
73
In view of 2.4.6, there is a Steiner point q in Q such that q is adjacent to two vertices v and v' in N . Having such a Steiner point and using 3.4.7, n-e may assume that p ( v , v l ) = p(v. q). The tree T'= (V. E \ { c i q ) U {m'})has the same length as T, and it is an ShIT for N , too. If g ~(q) i >3 we repeat this procedure. If g ~(q) i = 2 me find an SMT with a smallei number of Steiner points than T, since no Steiner point has degree smaller than 2. Hence, we have proved that Steiner's Problem in a n ultrainetric space is the same as finding an LIST.
A specific ultrametric space can be created over any set X of points: p(u; 'u) =
0
:
u=.u
1 : otherwise
In (X,p) any tree T = (A', E) has the shortest possible length IN
-
1.
Whenever we face a new problem we clearly would like to invent an efficient algorithm for solving it. If mre find that the problem is h"P-complete, what should we do? Of course. proving that our problem is in .%''PC does not solve it. From a practical point of view, Hu and Shing [226] suggest the following ideas: (a) Ignore whatever is known and invent your olvn algorithm. i\'laybe it is an efficient solution for your problem. (b) Consider special cases of the problem. (c) Find a heuristic or approximate approach. (d) When the problem is large, decompose it into several small problems and solve them individually. Later, after these small problems are solved, somehow piece them together to get a solution to the original problem. (e) Use well-known general methods. and hope that they are helpful. We use these approaches in the present book. Unfortunately, in general, there is no systematic method for discovering new combinatorial algorithms for our problems. And we do not expect tJo create such methods, because the class of problems is to big.
3.5
DOES A N APPROXIMATION EXIST?
We have seen that for most spaces, all known deterministic methods for finding SMTs need exponential time. This reinforces the interest and the recent emphasis on the development of polynomial-time approximations and heuristics for Steiner's Problem. LIore precisely, the interest in approximation and heuristic algorithms arises for a number of reasons: Sometimes. optimal solution methods are not ltnown, but a natural idea gives a heuristic to generate a tree with a short length. Only problems of relatively snlall size can currently be solved by using optimal solution algorithms. Consequently, large problems must be tackled via a n approximation technique. Finding optimal solutions of problems often takes much computation time. T h a t means the time required is exponential or a polynomial of high degree. (If the degree of the polynomial is greater than 3 then the algorithm is already almost useless for most practical applications.) For specific inputs t o a problem a n algorithm works exactly; maybe it is not far from the solution in the general case. Heuristic solutions can be used as upper bounds to improve the efficiency of optimal solution algorithms. Let N be a finite set of points in a metric space (X,p). Consider a n approsimation algorithm or a heuristic jtZ for Steiner's Problem. Then, of course,
L ( X ,p) (M(1V))
> L ( S ,p) ( S N T for N).
(3.16)
The function we use t o measure the deviation from the optimum may take several forms. We could choose t o measure the difference b e t ~ e e nthe length of the computed solution and the length of the optimal solution. But this measure does not make sense here. since
Observation 3.5.1 ( W i d m a y e r [459]) Unless P = JV'P, n o polynomial time approximation algorithm $4 for Stein,er's Problem i n networks can guarantee L(,W(N)) - L ( S M T for N )
< I(,
(3.17)
where N is a given set of vertices i n the network, and I< is some fixed constant.
What does solution m,ean?
More often we consider the quantity error(;U) = rnax
L(;U (AT)) L(shortest tree for N )
This measures the quality of a n approximation algorithm by its performance ratio. In view of (3.16) it holds that
and this leads t o the following definitions:
+
(i) A shortest tree is approxiinable within some constant 1 E if there exists a polynomially bounded algorithm ,U, such that error (,Me) 5 I F for all inputs. (ii) A shortest tree is approximable within for all reals 1 polynomially bounded algorithm .U, such that
+
+ t if there exists a
+
for all inputs. The 1 E notation is suggestive of the fact t h a t the closer we are t o 1, the better the approsimation algorithm. (iii) A fully polynomial approximation scheme is a n algorit,hm M , t h a t for each instance N of the problem and each t > 0,finds a solution for AT satisfying
and whose running time is bounded by a polynoinial in I / &and the length of the input for N .
It is observed t h a t optiinization problems which are hard in the sense of computational complexity display different kinds of behaviour in the sense of approximation.lG A4pproximations and heuristics differ in the following sense: For a n approximation algorithm, we can esti~natethe performance ratio with mathematical methods; for a heuristic algorithm, however, we only have experimental results or plausible reasons for the description of the performance ratio. '"erformance guarantees must consider the worst-case behavior of an approximation, and they may not reflect how well the approximation actually performs in practice. Thus, performance guarantees should not be the only criterion in evaluating an approximation. Running time, ease of implementation, and empirical analysis are at least as important for the practitioner.
For a complete discussion of theoretical aspects see A4usielloet al. [22]. Garey and Johnson [179], Hoclibaum [221]. Lengauer [282]. and Vazirani [433]. We have established that it is simple (in any sense) t o find a n MST. Moreover, the construction of an MST does not need any geometry, it uses only the mutual distances between points. Hence it is possible to creat,e a solution technique which can be applied in all metric spa,ces. Consequently, me are interested in the quantity m ( X , p) := inf
L(SMT for N) :N L(MST for N)
C ( S . p ) is a finite set
which is called the Steiner ratio of the metric space (X. p). T h e quantity m ( X ,p) . L(MST for N ) makes a convenient lower bound for the length of a n SNIT for N in (X,p); roughly speaking, m ( X , p ) says how much the total length of a n l I S T can be decreased by allowing Steiner points. Obviously, m ( X , p) _< 1 holds for any metric space ( X , p). O n the other hand, a tight lower bound for the Steiner ratio of any metric spaces is given by
Theorem 3.5.2 (E.F.Moore) The Stein,er ratio of every metric space obeys
m ( X .p)
> -21 = 0.5
(3.23)
Thzs is the best possible bound. Proof. Let T be a n SNIT for a finite set ATin X. Consider the graph G obtained by replacing each edge of T by two parallel edges. Since a n even number of edges is incident with each vertex of G the graph G has a Eulerian cycle17 which has the length 2 . L ( T ) and is a tour through N. This tour is not shorter than a minimal tour in which no Steiner point exists. If we delete any edge of the minimal tour we obtain a tree interconnecting i V without Steiner points. Hence, L(MST for AT)5 2 . L(T)= 2 . L(SMT for N) (3.24) which implies the first assertion. Next we show that the lomw bound 0.3 is the best possible over the class of all metric spaces. Let G = (1); E) he a star with n leaves. All edges have unit length. The leaves form the set i V of given points. Then a n YIST for N has length ~ ( I L 1) - and a n SMT with the internal vert,ex of the star as Steiner point 1 7 ~ h iiss defined as a cycle that uses each edge exactly once, compare 4.2.19.
What does solution mean?
77
has the length n. Hence, the ratio between the two lengths is nl(2n- 2), which tends to 0.5as n tends to infinity.
With 3.4.8and 3.5.2 in mind me have that the Steiner ratio of metric spaces lies precisely in the range between 0.5and 1. This is even true for spaces of finite cardinality. Ivanov and T~izliilinshorn- in [240]that for any real number between 0.3and 1 there is a metric space with this Steiner ratio. we can find a tree interconnecting a set of n points in a metric In view of (3.24) space in 0 ( n 2 . log n) time1' with length at most twice that of an SMT. The performance ratio of an MST as an approximation of Steiner's Problem in a metric space (X,p) is
With these facts in mind, we are only interested in approsimations and heuristics satisfying one or both of the following properties: The running time of the algorithm is a t most the time to compute an hIST in this space. The error is a t most
l l m , where m is the Steiner ratio of the space.
The proof of the theorem 3.5.2immediately suggests an approximation algorithm for Steiner's Problem in graphs: Algorithm 3.5.3 (Kou, Markowsky, Berman [267]) A finsite set N of n vertices in a network G = (11, E , f ) is given. Then,
1. Describe the m,etric closure G f : For all v,v' E AT,.u # v' determine the distances p(v, w') and the shortest paths G(v, . . . , d);
2. Find an MST T = ( N ,F ) for iV in the metric space (b; p); Set F' := U,,, G(v, . . . , 7)'); and Set 1." := Ulu'tF, { u , d); I 8 0 r faster using more specific techniques in several metric spaces; but not faster than
R(n . l o g n ) , see the previous section.
3. W h i l e there is a cycle G, i n ( V ' ;F')delete a n y edge from G,; Deleting leaves which are n o t m,em,bers of N .
<
I t is easy to see, compare [267], that the algorithm 3.5.3 is a l / m ( G f ) 2approximation algorithm for Steiner's Problem in graphs: and runs in cubic time.lg Kote t h a t the proof of 3.3.2 can be used t o show a slightly stronger result, namely
Corollary 3.5.4 Let IV be a finite set of n points i n a m e t r i c space ( X ,p ) . Then L ( M S T fol- N )
<2.
. L ( S M T for AT).
(3.26)
Proof. -4 spanning tree satisfying the above inequality arises, as in the proof of 3.5.2, by omitting the longest path between consecutive leaves in the tour. Consequently, L(MST for AT)
<
(
2 . 1-
-
n h )
. L(SL1IT for AT)
where nl(-\') denotes the number of leaws in an SLIT for n' Then note t h a t by 2.4 5 any leaf of an SAIT is a given point, implying nl(N) n .
<
We said that a (finite) set !Yo of points in a metric space (X, p) achieves the Steiner ratio if L(SMT fbr No) = m ( X ,p). (3.28) L(5IST for ATo) An immediate consequence of the last corollary is
Corollary 3.5.5 Let (X,p) be a m e t r i c space with S t e i n e r ratio 112.~' T h e n there does n o t exist a j k i t e set of points i n X which achieves the S t e i n e r ratio. l g ~ o methods r to improve the running time of 3.3.3 see [353]. 'OWe will see that such spaces indeed exist.
79
W h a t does solution m e a n ?
In 3.5.4 we offer the bound 2 - 2 / n , which stood for marly years as the best known bound of any approximation until Zelikovslcy [473] recently improved upon it. This method gives the hope that our goal to construct better approximations is realizable. As a n example consider three points which form the nodes of a n equilateral triangle of unit side length in the Euclidean plane. An MST for these points has length 2. An SMT uses one Steiner point. Consequently, with the help of a simple calculation we find that the length of the SMT is 3 . = &. So me have a n upper bound for the Steiner ratio of the Euclidean plane:
Similarly it is often simple t o determine a n upper bound for the Steiner ratio of a specific space, since we have only to find a finite set of points with an interconnecting tree shorter than the hlIST. On the other hand, it will be hard t,o determine sharp upper bounds, good lower bounds or the exact value of the quantity m ( X , p ) . To show this let us consider the history of the determination of the Euclidean Steiner ratio: -4 long-standing conjecture, given by Gilbert and Pollali in 1968, asserts that in the above inequality (3.29), equality holds; that is m = &/2 is the Steiner ratio of the Euclidean plane. This was the most important conjecture in the area of Steiner's Problem in the following years. Many people have tried t o show this: Pollak [348] and Du, Yao and Hwang [145] have shown that the conjecture is valid for sets N consisting of n = 4 points; Du, Hwang and Yao [142] extended this result to the case n = 5, and Rubinstein and Thomas [368] have done the same for the case n = 6. On the other hand, many attempts have been made to estimate the Steiner ratio for the Euclidean plane from below:
> I/& m > J 2 4 + 2 - (7 + 2&) m > 415 m
m
= 0.57735. . .
Graham, Hwang, 1976, [190]
= 0.74309..
Chung, H r a n g , 1978, 1861
= 0.8
Du, Hwang, 1983, [138]
> 0.82416.. .
Chung, Graham, 1983, [85]
Finally, in 1990, Du and Hwang [139]. [140] created inany new methods and succeeded in proving the Gilbert-Pollak conjecture completely.21 -
2 1 ~ h i mathematical s fact appeared in The New York Times, October 30, 1990 under the title "Solution to Old Puzzle: How Short a Shortcut?"
For most metric spaces the exact value of the Steiner ratio is still unknown. For a broader discussion of the concept of the Steinrr ratio and more ltnon-ledge of its values for specific spaces coinpare C. [92]. In particular, for the following specific (Banacli-Rlinkowski) planes n-e do know the exact value for the Steiner ratio: T h e norni is essentially
Steiner ratio
parallelogram
rectilinear
$
ellipse
Euclidean
Unit ball
= 0.6666.. . = 0.86602 = 0.75
affinely regular hexagon
An interesting problem, but which seerns very difficult, is to determine the range of the Steiner ratio m d ( B ) for d-dimensional Banach spaces equipped with a unit ball B, depending on the value d. &lore precisely: Determine the best possible constants cd and Cd such that
for all unit balls B of &. Both the values Cd and c,i are attained by certain d-dimensional Banach spaces. This follows from the continuity of the Steiner ratio as a function of the space and the Blaschke selection theorem2" The quantity CC1is defined as the upper bound of all numbers m d ( B ) ranging over all unit balls B of Acl. Of course, C1 = 1. Conjecture 3.5.6 For d = 2, 3,
where m(d, 2) denotes the Steiner ratio of th8ed-dimensionml Euclidean space.'3 2 2 ~ f f i n and e convex geometry are parts of the geometry for Banach-Minkowski spaces. Consequently, in our investigations the idea of convexity plays a central role and we will often use arguments from these geometries. For textbooks see LeichtweiB [281] or Valentine [43l]. "If we have an analytic formula decribing the norm, we also have the possibility of estimating the Steiner ratio with direct calculations. In particular, for d-dimensional Cp-spaces; abbreviated by m ( d , p ) , d = 1 , 2 , . . ., 1 5 p oc. For these quantities compare [ 7 ] ,[8],[ l o ] , [9]and [286].
<
81
W h t does solution mean?
This conjecture is open for all values of d , even in the planar case, where we onlv know
see [137], [I391 and [l4O]. Surprisingly, the conjecture remains t,rue if the dimension runs to infinity:
Theorem 3 . 5 . 7 (C. [95], sequence wrth
[loo])
7s {cd)d,l,2,
a decreasrng and convergent (3.32)
lim Cd = lim m(d, 2). i l i ~
d i m
On the other hand, the quantity cd is defined as the lower bound of all numbers m d ( B ) ranging over all unit balls 6' of ,qd is of interest. Of course, cl = 1.
Conjecture 3 . 5 . 8 For d = 2 , 3 , . . . Cd
> 112.
This conjecture is open, except in the planar case, where we know c2 = 2/3; see Gao et al. [174]. Now, we are interested in nor~nedspaces which are not necessarily finitedimensional. Here. we have to define the Steiner ratio more carefully, in the following may: m ( X ) = inf
L(SMT for N) L(1\IST for
x): N C X
I
a finite set for which an SMT exits . (3.33)
Theorem 3 . 5 . 9
(C.[lOl]) Let X be a i~nfinite-dimensionalB a n u c h space, then,
0.5 5 m ( X ) 5 inf{m(d, 2) : cl a positive integer) = lim m ( d , 2), d+m
(3.34)
where m(d,2) denotes the S t e i n e r ratio of the d-drmensional Euclidean space. Chung and Gilbert [83] showed limd,,
m ( d , 2) 5 0.66984. . ., which implies
Corollary 3.5.10 Let S be a i r ~ f i ~ ~ i t e - d i m e n s i oB~amn la c h space. T h e n
,
Arora 1171, [18] presents a polyno~nial-timeapproximation scheme for Steiner's Problem in Euclidean spaces. Remark 3.5.11 (Garey, Gmhmm,, Johnson [177])N o fully polynomial approx-
i m a t i o n scheme exi.sts for Steiner's Problem i n th,e Euclidean plane u,nless e v e q problem i n has a polynomial t i m e solu,tion. Karp [252] gave a pa~titioningalgorithm for the travelling salesman problem, but commented that the algorithm can be modified in a straightforward way into a probabilistic fully polynomial approximation scheme ,/M which yields a candidate solution to Steiner's Problem in O(n1ogn) time, where n is the number of points. such t h a t error(iW) i 1 with probability one. T h e Steiner ratio is a quantity t o describe a worst-case scenario. On the other hand. the average-case is also of interest. l l o r e precisely: Distribute n points u l . . . . . v,, by a suitable iandom process in the space (S, p) and then ask for the expected value E ( n ) = E ( X , p)(n) of L(SI\.IT for {vl , . . . . v,,)) L(MST for { v ~. .. . . v,,))
(3.36) '
Very little is ltnown about t,hese functions. Clearly,
E(x,p)(n)
'
{
L (S,p) ( S M T fol AT) L(X,p) ( M S T for \.) AT^ x , - v 5 n
(3.37) p ) ( n ) for specific spaces and distributions of points are 17aluesof E ( n ) = E(X, given by [186], [232] and [444].
NETWORK DESIGN PROBLEMS
A general net,worl<design problem is a,s follows: given a coilfiguration of vertices, find a network which contains these objects, fulfils some predet,ermined recluirenlents and minimizes a given objective function. This formulation is quite general and models a wide variety of problems. Here we find Steiner's Problem and its relatives in many applications. The main reason for this fact was also described by Weber [449]: Wenn schon einmal Theorie getrieben werden 5011, .... so ist als eine ihrer Formen auch diejenige notig, die die Abstraktion auf die Spitze treibt. In concise English: If you consider a problem, think about all consequences. One of the main groups of tasks considers connectivity problems, while other questions deal with problems of order, maximal flow and others (compare [2], [27],[98], [243] and [361].) We will focus on the connectivity questions.
4.1
A N OVERVIEW O F APPLICATIONS
Today we can say that Steiner's Problem is one of the most famous combinatorial-geometrical problems next to the traveling salesman problem. McGregor Smith [398] presents a classificat,ion of applications for network design problems. Generalizing this we have:
Large Region Networks T h e metric in large geographic regions is given by the shortest great circle distance between the points on the (Euclidean) sphere. Large region location problems arise when the difference between the Euclidean and the great circle inetrics is considerable. Location of international headquarters or distribution centers and planning of oil or natural gas pipelines or long distance telephone lines are examples; see [Xi],[237], [367], 13701, 14221 and [472]. Regional Networks Consider inter-urban networks: like coininunication networks, railway lines and interstate highway networks. The solution of network design problems in this area, whether approximate or exact, can provide guidelines for the layout of the network and the necessary amounts of material [454]. More complicated versions of Steiner's Problem can accommodate the need t o avoid certain geographic feat,ures or t o find the shortest connections along preexisting networks. For example the following problem mas discussed by Trietsch 14271: Given: A family of mutually disjoint trees in the Euclidean plane. Find: A set of new links and probably additional points t h a t interconnect all these trees with mini~nallength. In this sense the following idea is of interest: In 1992, Smith and Shor [404]introduced the notion of a so-called Greedy Tree (GT) for a set 121 of points in a Euclidean space as follows: 1. Start with all points of 127, regarded as a forest of n = IN single vertices; 2. -4t any stage, add the shortest possible segrnent to the current forest, mhich causes two trees to merge; 3. Continue until t,he forest is conipletely merged into one tree. Greedy Trees T are simple to construct and have the following properties:
(a) T is an MST for V. (b) Any edge e E E mhich connects two points of AT is also a n edge of a (desired chosen) IUST for AT. ( c ) T h e G T T is no longer than a n SIST for AT. Hence. L(ShfT for N)
L(T)
> -
L(SMT for N) > m, L(MST for AT) -
'under certain conditions such problems belong to tasks of "classic" convexity, see [306].
Network Design Problems
where m denotes the Steiner ratio of the space." Love and Morris [291] study a variety of mathematical forms as samples for intercity, urban and rural road distances. They found that the p-norms, and linear combinations of these, are the best possible; where these norms in the affine plane A2 a i t h the standard basis are defined as follows: Let p be a real number, a t least 1. Then
For p = 1 we have the rectilinear. and for p = 2 the Euclidean norm. If the value p runs to infinity me obtain the plane normed by
Recent surveys are given by [92] and [292].
Minimal Surfaces What is the least-area surface on a tetrahedral frame? Because soap film tends to minimize surface energy a good candidate may be obtained by dipping a tetrahedral wire frame in soap solution, see [I171 and [313].~ The same technique is used to find an SMT for points in the Euclidean plane, see [I161 and [328]. In this sense there are many similarities between Steiner's Problem and minimal surfaces, which are helpful to attack these problems. Facility Location We consider the following generalization of Fermat's Problem: Let X = {vl, . . . , v,,) be a set of 7% points. and let A = {a,),=l, ,, be a sequence of positive weights. Then the problem is to find the minimal value of the (generalized) Fei mat function
"t is conjectured that tlie ratio between an SLIT and a G T is greater than the Steiner ratio of the space: ini
{ LL((SGMTT forfor N ) )I\'
: AT
c ~2
is a finite set
)=
2&
= 0.9282.. ..
3 A specific question, tlie so-called "Double Soap Breakthrough", is ranked by the Encyclopedia Britannica as second only to XYiles' proof of Fermat's Last Theorem, compare [314].
where
1 . 1 IB
is a norm derived from the unit ball B. Then
is a non-empty, convex and compact set. Such a problem pertains to the so-called continuous location theory. The aim of the problem is t,o place a new facility on a territory in order to minimize the cost of servicing customers. Here the territory is the entire space, the customers are situated a t the points ui and the cost is a weighted sum of the distances to the new facility. Most of the results in the previous chapters remain valid. T h e following fact plays a central role:
Theorem 4.1.1 (Durier ,f149]) Let A: denote a finite set of points i n a finite-dimensional B a n a c h space with u n i t ball B , and let 4 be a sequence of positive weights. T h e n the following condition.s are pairwise e q ~ i v a l e n ~ t : and each A; (i) L A ~ , A ,cBo n~v N # 0 for each
(zi) L A ~ , A a~f f B h T# ~
0 for each N and each A;
(iii) L N , B conuAT ~ # 0 for each N ; (iv) L N , B a~f f N
# 0 for
each
1%'.
These facts are helpful t o characterise Euclidean spaces:
Observation 4.1.2 ( D u r i e r [149]) Let A&(B) be a Banach-Minkowski space, where th8e dimension d i s greater t h a n 8. Suppose t h a t the condition
is true for all subsets iV wzth three o r four elements. T h e n M d ( B )is a n i n n e r product space. In such spaces the following method gives a so1ut)ion of Fermat's problem:
Algorithm 4.1.3 (Weiszfeld Algorithm) Let N be a finite set ,in th,e E u clidean space and let A be a sequence of positive weights. Choose a n error estimate c .
I . Choose
q(0)
i n convN;
87
Network Design Problems
A mechanical interpretation in the Euclidean plane is obvious: Let N be a set of n points and let {a,),=l, ,,, be a sequence of weights. If n forces of corresponding intensities a, exert their effects from the Torricelli point q to the points in N , then this force system is in equilibrium. ,
A general strategy for determining a Torricelli point is solving a specific linear programming problem, a t least approximately: Theorem 4.1.4 (C. [88]) Let N = PI^, . . . , v,) be a finrite set of points in the Banmch-n/Iinkowski space MCL(B).A s s u m e that w e have a Torricelli point i n c o n v N . Ferm,at's Problem can be solved i n the following way:
( a ) If the u n i t ball S
B=
(-) K ( z , , 1)
(4.5)
]=I
is a convex polytope4, where K(z,, 1) = {v E A,? : (v, 2,) 5 1)
(4.6)
are halfspaces of A,,, t h e n w e can find a Torricelli point
in LN.B by solving the following linear program,ming problem:
(b) I f B is a n arbitrary u n i t ball of th,e ajjine space, but n o t a convex polytope, choose a positive real num,ber r , construct the convex polytope 4 A convex polytope is the convex hull of a finite set of points, or, equivalently, t,he bounded intersection of a finite set of halfspaces.
P in the HausdorSf-distance t o B and solve Fermat's Problem for N i n Mci(P)i n the sense of ( a ) . T h e relative error for a Torricelli point q can be estimated b y :
~wh~ere F * T and , ~ F N . are ~ the Fermat functions i n !Wcl(P) and Adci(B): r.rspectively, and A B ( 2 ) i s a largest umong all Euclidean balls zn B. Macro Scale Networks Chemical processing plants, urban arterial systems, cable television and similar intra-urban systems are typical applications. Often connection structures have to be designed in a,n environment with pronounced inner structure, see [280] and [303]. In these situations the rectilinear metric is often used. If the structure of the possible connections is predetermined it is also possible to formulate the problem as a networlt design problem in graphs. Mines The network of drive shafts for an undergrouiid mine niay be a specific Steiner Minimal Tree in the three-dimensional Euclidean space: over the life of a mine, ore is generally transported from a number of access points a t ore deposits, whose location locations are known, t o the surface along a networlt of sloping ramps and vertical shafts. See [453]. Intermediate Scale Networks Electric, heating and ail -conditioning systems in buildings are examples of network optimization problems where Steiner points can reduce the overall minimum cost solution of thc networlt. The rectilinear metric is the most frequent measure of the distance in these applications. For models and methods see [231] and [401]. Communication Networks During the last couple of decades technology advances have prompted a n explosion in the development of comnlunication networlts. I t is a widely held opinion that efficient and stable networlting is necessary to preserve a competitive edge in today's society. T'arious objectives play important roles in networlt design. Of course; here the networlts are considered as graphs, and Steiner's Problem is tackled directly applied. Compare 13221, [397].
Network Design Problems
Micro Scale Networks Perhaps one of the most practical applications of Steiner's Problem is the design of electronic circuits. The creation of Very Large Scale Integration (VLSI) networks is a n example of Steiner's Problem where the overall interconnecting length of the network is crucial for the solution. In this class of applications, the rectilinear metric is again t,he most frequent metric, since wires on a circuit generally lie in only two directions, vertical and horizontal. T h e problem, known as the "Rectilinear Steiner Proble~n"or as the "Manhattan Steiner Problem", was first investigated by Hanan [207] in 1966. He showed t h a t Steiner's Problem in the plane with rectilinear norm, contrary to the Euclidean case, is a special case of Steiner's Problem in graphs. I t is also conceivable t o use a linear combination of rectilinear and maximum norm; particularly consider the norm
which is a combination of tlie rectilinear and supremum norms. T h e geodesic curves are piecewise segments creating angles of 0, 2 or with the x-axis. Conipare [231] and [282]. A more general approach uses uniform orientation metrics. These are given by unit balls B(') which are regular 2X-gons with the x-axis being a diagonal direction. Then
5
(a)
B(" is a square. creating the rectilinear norm,
(b) B ( ~is) a n regular hexagon, (c) B(" creates the so-called octolinear norm (4.9), and (d) we may define the set B ( ~as) a circle. Compare [55] and [321] Here the classical problem received new importance in the development of techniques for VLSI layout, see [81]. [265], [311]. [425] and [471]. RPCtilinear SMTs in the plane can today be computed quickly for realistic instances occuring in VLSI design: see [81] and [444]. T h e practical importance of Steinel's Problem lies essentially in the following aspects: SNITS minimise length across all structures connecting the elements of a network.
In contrast t o hISTs, where the connections are unnaturally limited to be a t tlie vertices t o be connected, in SLITS there are no restrictions on the location of such brancliings.
The "probability" that two networks overlap is smaller if the networks are short. In this case it is highly desirable t o achieve optimal solutions. A one percent increase in wire length could mean a considerable loss of performance in the corresponding chip.
A4short network of wires on an integrated circuit requires less time to charge and discharge than a long network and thus increases the circuit's speed of operation.
Protein structures One of the key issues in biochemistry today is predicting the three-dimensional structure of proteins from the primary sequence of amino acids. Steiner's Problem in the three-dimensional Euclidean space might help explain the reason for these long molecular chains. In order t o examine this potential application area and others related t,o it, possible linkages between the objective function of Steiner's Problem and objective functions of these applications in the biochemical sciences need t o be examined, see [399] and [405]. In particular, it is conjectured that t,he optimal configuratioii of any point set in three-dimensional Euclidean space is an infinit,e chain of face-sharing tetrahedra, also knolvn as a triple helix or ribbon-sausage. More precisely: A conjecture, posed by Gilbert and Pollak, stated that the Steiner ratio of a three-dimensional Euclidean space was achieved when the given points are the nodes of a regular simplex. In 1992, Smith [403] showed t h a t this is not true. Moreover, the ratios between the lengths of an SMT and a n MST for the nodes of specific set of points have been found: (i) Chung, Gilbert [83]: 0.81305. . ., considering a regular simplex; (ii) Smith [403]: 0.81119.. ., investigating a regular octahedron; and (iii) Du, Smith [144]: 0.78419 . . ., using the method which will describe below. Let's look a t sausages in the three-dimensional Euclidean space: 1. Start with a ball B(vl), that is a congruent copy of the unit ball B of the space around the point v l ;
2. Successively add balls so t h a t the n-th ball B(v,,) added is always touching the min(3. n - 1) most recently added balls. This procedure uniquely5 defines a n infinite sequence of interior-disjoint numbered balls. The centers v l , vr.. . . , u,,. . . . of these balls form a discrete point set N ( w ) , which is called the (infinite) sausage. The first n points 'up t o motion
Network Design Problems
91
of the sausage will be called the "1%-pointsausage" AT(n!).Note that N(4) is a simplex. Du and Smith [I441 present many properties of the sausage, in particular that for a finite number of points it is L(Sh1T for N ( 7 ) ) L(MST for N(7))
< -
L(SMT for fV(4)) L(1IST for N ( 4 ) )'
which is a finite version of L(S41T for N ( m ) ) < L(ShlT for N ( 4 ) ) L(MST for N ( o c ) ) - L(AlST for lV(4)) The following value of the Steiner Ratio of three-dimensional Euclidean space was conjectured by Smith and AIcGregor Smith in their paper [405]:
Even if the conjectured value turns out to be incorrect, it is a very good upper bound on the exact value of the Steiner Ratio. I t seems that there probably is no finite set of points in three-dimensional Euclidean space which achieves the Steiner ratio.
Evolutionary Networks Molecular sequences are used to reconstruct the course of evolution. Since the evolution of a set of species is assumed to have proceeded from a common ancestral species in a tree-like branching of species, this process is generally modeled by a tree.6 The key question is the reconstruction of this tree based on the contemporary data. Molecular d a t a comes in the form of DKL4 sequences, which are informational-containing molecules composed of nucleotides from a n alphabet of four letters; or Proteins, which are the operational molecules, composed of sequences of amino acids from a n alphabet of 20 letters; or
RNA sequences, which stand between both and are composed of nucleotides from a n alphabet of four letters. T h e relationship of DNA, RNA and protein as described by the Central Dogma of Molecular Biology can be summed up as follows: "fore
precisely, by a rnotecl tree, since all species are related
Integral form: DN;1 makes RNA makes protein. Differential form: Changed DNA can make changed protein In a shortest network for this abstract arrangement of given points, the Steiner points correspond to the most plausible ancestors, and edges correspond to relationships between ancestor and descendant organisms that assumes the fewest mutations. The principle of Maximum Pxsimony involves the identification of a combinatorial structure that requires the smallest number of evolutionary changes. It is often said that this principle abides by Ockham's razor, according to which the best hypothesis is the one requiring the smallest number of assumptions. To consider the problem of reconstruction of evolutionary (phylogenetic) trees in the sense of shortest connectivity, n-e introduce so-called Phylogenetic spaces. These are metric spaces whose points are arbitrary words generated by letters frorn some (finite) alphabet, and the ~netricmeasuring "sameness" of the words which is generated by a cost measure on the alphabet. In other words,
A most parsimonious tree is an SRIT in some phylogenetic space. Sanltoff and Rousseau [379] discuss methods to find the location of the Steiner points for a given combinatorial tree in several compact metric spaces. Surveys are given in [168], 13941, [438] and [448]. M'e will discuss these questions in later chapter^.^
Databases ,4 relational database can be described as a graph. Then the problem of finding a subtree interconnecting several items is Steiner's Problem in this graph, see [23].
4.2
SEVERAL VARIANTS
Steiner's Problem is a natural question. We find it. and its relatives, in many network design problems. Consequently, Steiner's Problem is not only a single problem but also a group of related problems, specifications and generalizations of the original problem by Steiner. 7 ~ h e will y be the main problems of the second part of this book.
Network Design Problems
The restricted problem
4.2.1
Related to Steiner's Problem, we can require that the minimal network has a t most k Steiner points, where k 0 is a predetermined integer independent of the number of the given points:
>
The k-restricted Steiner's Problem Given: *4 finite set ATof points in a metric space (X. p) and a nonnegative integer k. Find: -4 network G = (1'. E) with (i)
fV
(ii) IV
17,
\ AT/5 k , and
(iii) L ( X . p) (G) is minimal. Such a network must also be a tree, so it is called a k-ShIT. This problem was introduced independently by C. [87] in 1982 and Georgakopoulos. Papadimitriou [I831 in 1987. If we do not allow Steiner points, then we are looking for an MST Note that k-ShITs are not simplifications of SNITS because there is no freedom to insert Steiner points in a k-SPIT. The following problems are often confused in the literature: The problem of finding a k-SMT and the problem of finding a n SMT with a t most k Steiner points. But the second problem can be unsolvable. Additionally, the combinatorial structures of k-ShITs and SMTs are quite different.8 Xote that a I-SMT is not necessarily a star, since the Steiner point need not be connected to all given points. In order to obtain methods to determine a k-SMT we need two assumptions on the "geometry" of the metric space (X,p):
Assumption A: There is a (positjive) integer c = c ( X , p ) , dependent on the space only, such that the degree for any Steiner point in each k-SiUT for a given set in (X, p) is a t most c. 'As a n introductory example consider the four points (1:0 ) , (-1,0): (0; 1) and (0, I ) , which are t h e corners of a square in the Euclidean plane. A 0-SMT, t h a t is an MST, has the = 4.242 .... A 1-SMT with Steiner point (0,O) is of length 4 and an S L I T with length 3 . two Steiner points has the length fi + 4 = 3.863.... We will discuss this example more rxt,ensively later.
Let us start with a n example showing that such m m b e r s indeed exist for specific spaces. Let T = (V,E) be a k-SMT, interconnecting a finite set of points in the Euclidean plane. Consider a vertex v of T and two of its neighbors vl and vl. Since T is a tree of minimal length, the side viva has maximal length in the triangle K spanned by the nodes v, ul and ~5 Hence, the angle a t v is the greatest in K and is therefore a t least 60'. Consequently, if me apply this argument t o all neighbors of v, we see that v has not more than six of these angles a t v. So, the degree of v is a t most six. Assume t h a t the degree of a Steiner point v in T equals 6. Let v l , . . . , v6 be the neighbors placed in a cyclic order around v. Then conv{v, vi, l i ( i + l ) mod 6 ) is a n equilateral triangle for all i = 1,.. . , 6, and consequently, v l , . . . , v6 forms a regular hexagon. Let T ' = ( { q , . . . , 7 1 6 ) , E') be an hlST for the neighbors of v , so we have 6
L ( T 1 )= 5 . 1
< 6 .I
=
1I v
- vijl,
(4.10)
i=l
where 1 denotes the side-length of the hexagon. Then the tree
T = (V \ {v}, E U E' \ {vv, : i
(4.11)
= 1,.. . , 6 ) )
is shorter than T (and contains one Steiner point less). This contradicts the assumption that T lvas a k-SMT. Hence, in the Euclidean plane we may assume that such a number c exists and is a t most 5. The systematic study of such upper bounds is the core of determining the quantity c ( X , p ) . I t is not hard to see that the number c = c ( X , p ) can be determined for 1-SMTs only. Recall the Steiner ratio m(X,p), which is a measure of the quality of the Minimum Spanning Tree as a solution of Steiner's Problem in the metric space (X,p). If m(X, p) = 1, then any SMT and any k-SMT is a n MST. Otherwise, if m ( X ,p) is less than one, then we may assume that c ( X , p) 3.'
>
g ~ o Banach-TvIinkowski r spaces, the existence of the number c is shown in [go] and [113], and a complete discussion of the values for this quantity is given. To understand the following facts we define Hadwiger numbers of such spaces. Let B be a unit ball of the d-dimensional affine space A,/. A translation of B is a congruent copy of B moved to another location in space while the original orientation of B is preserved. The Hadwiger number zCl(B)for B in A,/ or the Hadwiger number of the Banach-Minkowski space ,Zri,l(B)is the maximum number of nonoverlapping translations of B which can be brought into contact with B. Therefore, this number is also be called the kissing number of B . Griinbaum [19G] showed that
z d ( B )= max{lWl : W
C boundary
of B , / w - ~
' 1 1 1 l~
, ~W' , E TVw
# w').
(4.12)
Theorem 4.2.1 (C. [92], [93]) For the Banach,-Minkowski space AIcL(B), the q u m t i t ~c is a t most the Hadwiger number.
Network Design Problems
+
Assumption B (k) : For each number n between 3 and k . c ( X ,p ) - k 1 there is a n algorithm S,, for finding the location of a t most k Steiner points in a shortest tree for a finite set of n points in which each Steiner point has degree a t most c ( X ,p). For such algorithms in some classes of metric spaces see [92], [107],[231] and [292]. Then we have This fact was independently proved by Robins and Salowe [ 3 6 2 ] for finite-dimensional L p spaces. Unfortunately, our knowledge about the exact values for the Hadwiger number z d ( B ) for a unit ball B in a d-dimensional affine space is limited But, we do know the following: Observation 4.2.2 Let z d ( B ) be the Hadwiger number of the unit ball B i n the ddimensional a f i n e space A d . T h e n i t holds that
( a ) (Hadwiger [201]) z d ( B ) 5 3" ( b ) (Groemer [l92]) z d ( B ) = 3d
1. -
1 if and only if B is a parallelepiped.
A famous controversy between Gregory and Newton in 1694 concerned the determination of z 3 ( B ( 2 ) ) , the Hadwiger number of the Euclidean space, compare [283] and [ 4 2 1 ] . It is not simple to see that this number is 1 2 , [ 4 ] . Odlyzko arid Sloane [ 3 2 4 ] find good estimates of z d ( B ( 2 ) ) for the dimension d between 1 and 24. A nice overview of the kissing numbers is included in the book about sphere packings by Zong [475]. Observation 4.2.3 (Larman, Zong [273]) T h e Hadwiger number z d ( B ( p ) ) of L,-spaces grows exponentially i n the d i n ~ e n s i o nd. 30.1072...d(1+o(1))
5zd(-R(p)) 5 s d .
(4.13)
hIoreover, the Hadwiger numbers for planar convex bodies are completely determined: Observation 4.2.4 (Grunbaum [ISG])
zg(B) =
8 6
: :
B is a parallelogram otherwise.
And, Theorem 4.2.5 (Swanepoel [.1'14]) T h e number of edges zncident t o a Stezner point of a k - S M T i n a Banach-Minkowski plane M a ( B ) can never be siz or more, except for the case when the unit ball B zs a n a f i n e l y regular hezagon.
It seems that the classification of all planes in which the maximal degree is exactly 5 is too hard, because we even have difficulties deciding this question in the Euclidean plane. Here s z ( B ( 2 ) ) = 4, see [ 3 6 9 ] . For more facts about this quantity in Banach-Minkowski spaces, compare [92] and [ 9 9 ] .
Theorem 4.2.6 (C. [94]) Let ( X ,p) be a metric space which fulfils both the assumptions A and B ( k ) . Let i\; be a finite set i n X . Then a k:-SMT for iV can be found by the following procedure: 1. Compute an MST T ( 0 )= ( N , E) for N ;
+
2. For all subsets N' of N with n = liV'1 = 3 , . . . , min{lNl, kc - k 1) do Generate all partitions of N' in subsets AT! ,...;!V,'; Then (i) Use algorithm S,,to fin,d a sh,ortest tree H(ATt) = (I/;', Et)for AT:;
u;=,
u:=,
(ii) G := ( N U 14'; E u E,'); (iii) Determine a minimum spanning tree T(_4J1) in G ; 3. A shortest tree in the family
N , I N ' = 3, . . . , min{liVl, kc - k
( T ( 0 ) )U {T(AT'): N'
+ 1))
is a k - S M T for N i n ( X ,p ) . The algorithm enumerates all subsets AT' and computes for each N' first a family of shortest trees and then a minimum spanning tree in the graph G. This algorithm has a running time of
Hence, assuming that k = O(1). we have that Corollary 4.2.7 A k - S M T for a finite set of n poisnts in a metric space whlch fulfils the assumptions A and B(k) can be found in 0 ( n c k - k + 2logn) time. Clearly, if AT is a finite set of given points, then we have L(MST f01 AT)
=
>
> > =
>
L(o - SnrT for N) L ( 1 - SMT f o ~N)
L((jA7 - 3) - SMT for N) L ( ( j N - a) - SLIT for N) L(SMT for AT) 1 - . L(1UST for N). 2
Network Design Problems
%loreover it is shown, see [94]. that L(k - SMT for AT) > L ( ( k - 1) - S l I T for S)- k for all k
k
+ y ( X , p)
> 0, where
Xow. we can attack
The Weighted Modification of Steiner's Problem Given: A finite set hTof points in a metric space ( X , p) and a predetermined nonnegative real number a. Find: 4 connected graph G = (If>E) such that N C V and the modified length C(G) = C ( a , p)(G) = a . 1V \ N / L ( S , p)(G) (4.17)
+
is minimal. Such a graph must be a tree and is called a Steiner Miniiml Tree weighted by the real a , or briefly a n S M T ( a ) , for N in ( X , p ) . For a = O we obtain a usual SMT. For a > 0 a n S M T ( a ) can assume different structures than those available to SMTs. More precisely, a n SMT(O) is a n ordinary SMT, while on the ot,her hand, if a is the length of a n MST for a finite set A7 of points, then a n S M T ( a ) must be an MST. Consequently, number of Steiner points produced decreases as the n-eight a runs from zero to infinity. Consider the folloxing introductory example: Interconnect the four points (1.0), (-1,0), (0,1) and (0, - 1). which are the corners of a square in the Euclidean plane: Shortest tree
Length L ( . )
Number of Steiner points
Then it is easy to calculate that TI is a n ShIT(0.2) and To is a n ShIT(0.4). Underwood 14301 presents many properties of Si\ilT(a)s in the Euclidean plane
MST
=
0-SAIT
1-SMT
SAIlT
=
2-SkIT
Figure 4.1 LIinimal Networks
and a modified I\/Ielzal<procedure which computes an SiLIT(a!) for a given set of points. In (4.15) we saw that the best addition of k Steiner points to the initial set of given points ca,nnot improve drastically the approximation in conlparison to the best addition of k - 1 Steiner points, if k is a large number. More precisely: Let N be a finite set of points in a metric space. Then the relative defect when going from a (k - 1)-SMT to a k-ShIT for N is monotonely decreasing in k and tends to zero as k runs to infinity. This fact is useful to estimate t,he number k for k-SMT's depeliding on the number a for SATT(a)s:
Theorem 4.2.8 Let ( X ,p) be a m e t r i c space which fulfills both the a s s ~ ~ m p t i o n s A and B ( k ) . Let N be a finite set i n X . If we seek a n SMT(a) for a set N of given points i n a m e t r i c space ( X ,p ) we are only interested i n the k-SMTs for N with k 5 2 . L ( M S T for AT), (4.18) a!
where y ( X , p ) is defined in (4.16).1° Using the theorem we have an a priori bound for an ShlT(ol), namely
and a better bound, found during the computation from the (k - l ) t h step to the kth step, namely
l0Since we assume that c ( S , p )
> 3 we have -,2
Netu~orkDesign
problem,^
Paying attention to 4.2.7 we find
Corollary 4.2.9 T h e search for a n SMT(a) for a set N of n given points consumes O ( n '-. ( a c - j + : ) . L + a , og n ) (4.21)
time: where L = L ( M S T for N ) and c = c ( S , p).
4.2.2
A monotonic iterative procedure to approximate trees of minimal length
Using 1-SMTs we can find a fast running, and in general, good iterative procedure t o produce shortest trees. We apply a procedure for creating a 1-SMT repeatedly, meaning that we start with the given finite set, and successively add Steiner points, one Steiner point a t a time. Note that once added, a Steiner point cannot be removed.'' We call such a method a monotonic iterative algorithm. During the course of such a n algorithm, a sequence
of sets of points is constructed such that L ( X , p)(MST for I + 1 ) 5 L ( X ,p)(MST for IX),
(4.23)
for k 2 0. It is, however, possible t>hatsuch constructions do not produce an SLIT; though it appears to produce shorter trees on average than other known heuristics in many metric spaces.'" The iterated 1-Steiner heuristic of Kahng and Robins [247] is a n exa~npleof a monotonic iterative algorithm. 51k generalize this greedy st#rategyin t,he following way: Procedure 4.2.10 Let AT be a finite set of n points sin a m e t r i c space ( X ,p) which satisfy the assumptions A asnd B(1). l l l n this sense, this met,hod is greedy. l%aloare and \Varme [377] show that for a specific set of given points in the plane with rectilinear distance, such a monotonic iterative algorithm does not construct an SMT. But empirical evidence suggests that in general this procedure creates a tree whose length is not far from the length of an SMT.
1. Determine a n MST T(')= (Vo,E o ) for AT; 2. For k
> 1 find
a 1-SMT T(" = (Vk,Ek) for l'k-l:
3. Terminate as soon as one of the following things is true: n - 2 zteratrons have been executed;
L(X, p)(T(") w
= m ( X ,p) . L ( X , ,o)(T(')):
L ( X ,p ) ( ~ ( " )= L ( X , p) ( T ( ' ~ ' .)I 3)
Clearly, this method only consumes polgnomially bounded time, namely
0 ( n 2 )+ ( 1 2
-
2 ) . O ( ~ " ( ~ ~log P n) )+=~~
(
n
"0 ~
g (n ) . ~
.
~(4.24) ~
Moreover, let t l - s n l T ( n ) be the time required to find a 1-SMT for a set of n given points. We may assume that
I. tl-s,bf.r is polgnomially hounded; 2. t l - s n I T ( n )is a t least the time to find a n M S T for n given points, and consequently (4.25) t l P s n . r ~ ( n ) CL(nlog 7 2 ) ;
>
3.
t 1 - s ~ is a~n
increasing function in the size of the input.
All these facts imply Remark 4.2.11 The procedure 4.2.10 runs i n polyn,omially bounded time. If t l p s n l ~ ( n )is the time needed to fin,d a 1-SMT for n given points then 4.2.10 needs ~ ( . t ln- S , w ~ ( n ) )j o ( ~ ~ ( log ~ n) ' P ) + ~ (4.26)
For applications of this strategy in the rectilinear plane and in networks see [471] and [193],respectively. The length of the tree produced by the algorithm 4.2.10 is a t most the length of an MST, and on the other hand me have 13Kote that these conditions are not independently valid. In particular, if the first or the second holds, then the third also holds.
Network Design Problems
101
Observation 4.2.12 Let ~ ( be~ tthe1 tree for a gisuen finite set constructed by 4.8.10 in the kth step. Then
We define the relative performance ratio of the metric space ( X , p) by
1 = inf e r r o r ( S , p) (k : k')
I
L ( X , p) (T(") for 12') : 3 is a finite set in (X,p) , L ( S , p)(T(")) for AT) (4.28)
L ( T ( * ~ ) L(T(") > L ( S I I T for 3') >L(T("1) - L(T(O)) - L(I\IST for K ) This implies
Observation 4.2.13 For the relative performance ratio of the ,metric space (X, p) the inequality I 5 error(X,p)(k : k') 5 ---- < 2 m(x,p) -
(4.31)
hold. And. moreover
Theorem 4.2.14 (C. [97]) For the relati,ve perfo~mance ratio of the ,metric ,space ( X , p) satisfylng the assumptions A and B(1) it holds that I 5 error(X,p) (k : k - 1) 5 1
for all k
+ y ( Xk , P)
-
(4.32)
> 0,where y(X,p) is defined 'in (4.1G).
Now, we have two performance error bounds: The absolute, a priori. bound given in 4.2.13 and the relative, a posterion, one given in 4.2.14. If k runs to infinity then the relative performance ratio tends to zero. Of course, we call also apply algorithm 4.2.10 in metric spaces which do not satisfy the assumptions A and/or B(1). but then we do not obtain the nice performance ratio of 4.2.14.
4.2.3
Component-size bounded Steiner Trees
There is a n a p p r ~ x i m a t ~ i omethod n for Steiner's P r o b l e ~ nwhich uses trees that can contain Steiner points, but not in an arbitrary sense: Let N be a finite set of points in a metric space ( X , p). Let T = (V, E) be a tree interconnecting N . For such trees n7eassume t h a t the degree of each given point is a t least one and the degree of each Steiner point in V \ ! Y is a t least three. However. a given point in such a tree may not be a leaf. When a given point v is not a leaf, T can be decomposed (by splitting a t the given point) into several smaller trees, so that given points only occur as leaves. More precisely:
1. Define G = (11 \ {v), E \ {d : u' is a neighbor of v)). ( G is a forest with g(v) cornponents G, = (I/;, E,), i = 1,.. . , g(v).) 2. Define for L = 1, . . . , g(v) the graph : 7;' is a neighbor of G(,) = (If',U { u , ) , E,U where v, is not in V.
71 in
G and v 1 is in If,)),
In this way, every tree interconnecting N is deconiposed into so-called full components. The size of a full component is the number of given points in the full component.
A k-size tree for N is a tree interconnecting all points of AT with all full components of size a t most k . A k-size SLIT is the shortest one among all k-size trees. The k-size Steiner's Problem Given: A finite set AT of points in a metric space (X,p ) and a n integer k 2. Find: A network G = (I E) i u r h that
>
(ii) Every full component contain a t most k given points, and (iii) L ( X , p ) (G) is minimal. For k = 2 we look for an S E T . For every k
> 4 this problem is A'P-hard,
[355].
Clearly. we are interested in the greatest lower bound for the iatio between the lengths of an SAIT and a k-size ShIT for the same set of points in a metric
Network Design Problems
space: m ( Q = =("((X, p) = inf
{
L(SMT for IY) :N L(k-size SMT for N )
5 ( X , p ) is a finite set
(4.33) This quantity is called the k-size-Steiner ratio of the metric space (X,p). In any metric space ( X . p) a n 2-size SivIT is a n h.LST. Hence. the 2-size-Steiner ratio is the Steiner ratio:
~n( (X, ~p) ) = m(X, p).
(4.34)
Furthermore, Observation 4.2.15 For the k-size-Steiner ratio m(", k
> 2 the following is
known,: (a) (Zelikovsky [473]) For a n y metric space ( X ,p) it holds that
(Du [136]) This lower bound is the best possible one over the class of all metric spaces. ( b ) ( D u [IdGI) For a n y metric space ( X ,p) i t holds t l ~ a t
where r = Llog,, kj Now we can describe the performance ratio of approxinlations for Steiner's Problem more exactly. Zeliltovsliy [473] showed that there exists a polynomialtime approximation A for Steiner's Problem in a metric space (X,p) with performance ratio error(A) =
-
.
1
m(3) (X,p)
f
m(" (X,p)
provided that a n SMT for three given points can be computed in polynomial time. Using a similar idea. Berman and Ranmiyer [40] showed that there is a polynomial-time approximation $Ik with performance ratio error(Ak)
> -.11. 2
1 m ( 2 ) (S, p)
2 1 + --. +-. 1 2 . 3 m ( 3(X, ) ,o) 3 . 4
1 +. . . , (4.38) ~n("(x,p)
provided that for any k a n SMT for k points can be computed in polynomial time. Clearly, vie are interested in the k-size-Steiner ratio for specific spaces. For the plane with rectilinear distance we have k =2
m(" = 2
=3
"5
2"1
Source Hwang, [228] Berman and Ramaiyer, [4O] Borchers et al.. [52].
Such nice results for thc Euclidcan plane are not yet known. Borcliers and Du [51] determine the k-size-Steiner ratio for graphs exactly: For k = 2' s , where 0 5 s < 2 r . this quantity is
+
4.2.4
The relative neighborhood problem
The MST problem has numerous applications in geometric network design. We saw, and will see again, that it will be useful in approximation algorithms for some R'P-hard problems. Consequently, it will be of interest to investigate the geometric structure of MSTs more thoroughly. Let i V be a finite set of points in a metric space ( X , p ) . Two points v and w' of fV are said to be relative neighbors if and only if p(v, v') p(v, w) or p(v,vl) 5 p(vl,w) for all w E AT. The geometric interpretation for this is that the so-called lune of v and v',
contains no points of N. Now.:
The Relative Neighborhood Problem Given: A finite set N of points in a metric space (S,p). Find: A graph G = (W, E) in which all relative neighbors are connected by a n edge.
Network Design Problems
105
A solution to this problem is called a relative neighborhood graph RNG for AT. For a finite set of points the RNG and MST are relatives:
MST
5 RNG.
We now prove this fact. Theorem 4.2.16 (Katajainen [253]) An MST for subgraph of the RNG for N .
R in a metric space is a
Proof. Let vv' be an edge in an MST T = (AT,E) and assume that it is not an edge in a RNG. This would imply that there exists a point w of AT which is inside C(v, v'). Without loss of generality. we can assume that there is a path in T which connects w to v and which does not contain the edge .uv'. Let T' = ( N .E \ {d) U {h)). T'is another spanning tree for N whose total length is less than L ( T ) since p(v. v') > p(vl,w).This contradiction proves the assertion.
TTJe have established that in an RNG for a given finite set N two points v and v' are adjacent if and only if
Katajainen [254] presented a method for computing all relative neighbors constructing the RNG for a set of given points in quadratic time.14 14~\lloreover,for a finite set of points the RNG and D T are relatives of the LIST. More precisely, in the plane with a norm derived from a smooth unit ball B we have hIST (I RNG
C DT
Here, a graph G = ( h r . E )is called the Delaunay triangulation (DT) for AT if G has the following property: A11 edge & is in E if and only if there is a homothetic copy T B + U (with a real number r > 0 and a vector u of the space) such that
and 7u $! i n t ( r B f
u)
(4.43)
for all w € h r \ {w,wl). This is the so-called empty circle condition, which means that a triangle appears in the D T if and only if its circumcircle encloses none of t,he otliex. given
4.2.5
Steiner's Problem in spaces with a weaker triangle inequality
Up to now, we have used the triangle inequality as a property of the metric. It is conceivable that slight violations of the triangle inequality should not be too deleterious with respect to performance guarantees of an approximation. Andreae and Bandelt [15] consider the deviation from the triangle inequality captured by a para~neterT in the following relaxation:
for all .c. v', w E X . Such a parametrizied triangle inequality is given in the situation that the input data are from a fixed range of values. Assume that all distances under consideration are bounded by real numbers L and U in the following way:
L
< p(v, v') 5 li
(4.45)
for different points u and v'. For instance. for a netrork G we have L = 1 and U = diamG. If L > 0 then p(v, w) +p(w, v') 2 2L, so that U(p(u, w) +p(w, v')) 2Lp(v, 2;'). Hence,
>
Observation 4.2.17 T h e m e t r i c p satisfies t h e inequality (4.45) with th,e parameter U 1 T=->-. (4.46) 2L - 2 This scenario applies to the minimum spanning tree approximation for Steiner's Problem: When the parameter T approaches 112, the performance guarantee factor 2 decreases and eventually reaches 1; recall 3.5.2. 16% can see that the factor decreases when n-e make the additional assumption that. for some T with 0 < T 1, the set N of given points satisfies the following inequality:
<
for all v, v' E N and w E X
\ N.
Then the following is true:
points
A D T is not necessarily a triangulation with minimal length. To find a triangulation (that is, a maximal planar graph) that minimizes the length is not ltnown to be N P - h a r d , nor is it known to be solvable in polynomial time. For more information on these graphs compare Bern [41], Bern and Eppstein [42],and Eppstein [l55].
107
Network Design Problems
Theorem 4.2.18 (Andreae, Bandelt [Is]) Let ( X , p ) be a m e t r i c space, und let N be a finite subset of X with IN = n > 1. Let 0 < r 5 1 . Suppose that N
satisfies equation (4.47) with respect to 7 . Let T be a n S M T and TI be a n M S T for 1V i n ( X ,p). T h e n
if r
> nl(2n
-
2) , and L(T1) = L ( T )
otherwise. The following example shows that the bound given in 4.2.18 is the best possible: Consider X = N U {z) with the distances p(u, d ) = 27 for different points v and vl, and p(v, 2 ) = 1.
4.2.6
Eulerian cycles and the Chinese Postman Problem
If we allow that more than one edge in E to join two vertices in IF,meaning that we allow parallel edges in the graph, we shall call the pair ( I / E) a multigraph. In this sense, any graph is a also a multigraph. Let G be a graph. A Eulerian chain of G (Eulerian cycle of G, respectively) is defined as a chain (cycle, respectively) that uses each edge of G exactly once.15 A graph which contains a Eulerian cycle is called a Eulerian graph. One of the oldest combinatorial problems, accredited to Euler and written in the terninology of graph theory. can be stated as follows: When does a multigraph have a Eulerian chain or a Eulerian cycle?16 The amwer is: Remark 4.2.19 (Euler) A multigraph has a n Eulerian cycle if and only if it
i s connected and all vertices have even degree. 15Note t h a t an Eulerian cycle is not a cycle in t h e usual sense, since it can contain a vertex more than once. 16This is a generalization of the so-called "Kijnigsberger Briickenproblem". (In English: T h e problem of bridges in the Prussiari city Konigsberg). For a history of this problem and its influence in the development of the theory of graphs see Promel [354]and Sachs [374].
T h e proof of 4.2.19 is well known a n d yields a n algorithm for finding such a cycle effectively: S t a r t with a cycle through the multigraph a n d a d d a "detour" cycle until all edges are in the tour17:
Algorithm 4.2.20 (Hierholzer, compare [246]; [266]) Let G = (17, E) be a Eulerian graph. Choose a vertex vl arbitrarily and apply the following recursiv~e procedure Euler(G,wl) to find a Eulerian cycle: 1. Set C : = v l ; w : = v l ;
2. If g ~ ( v = ) 0 then yoto 4. else let w E -hTG(u), e = m, 3. Set C := C , e , w an,d v := w; Set E := E \ {e); got0 2.;
4. Let C = v l , e l , v2, e 2 , . . . , u k , e k ,v ~ + ~ ; For i := 1 to k do Ci:= Euler(G,vi);
5. Set C = Cl, e l , C42,e 2 , . . . ,Clc; e k , vli+l
+
I t is not hard to see t h a t this algorithm runs in O ( V IEl) time. Furthermore, the remark 4.2.19 also has two simple consequences: First,ly, a multigraph has a n (open) Eulerian chain if and only if it is connected and has exactly t ~ 7 overtices of odd degree. Moreover, Observation 4.2.21 Any graph contains a chain that uses each edge exactly twice. T h e Euler problem is a purely combinatoiial question.1s Now n-e are interested in t h e optimization version making a given (connected) graph Eulerian 171n other words, a connected graph is Eulerian if and only if the set of edges can be partitioned into cycles. 18A question similar to the problem of Euler was raised by Hamilton in 1856. Let G be a graph. A Hamiltonian cycle is a cycle that contains all vertices of G. The problem is to decide whether or not G has a I-Iamilton cycle; if so then C: is called a Hamiltonian graph. Hamilton's problem sounds quite similar to Euler's, but this is not the case: as there is an essential difference: An Eulerian cycle contains also all vertices of the graph, but a Hamiltonian cycle need not contain all edges. And indeed, no efficient method is known to check whether a given graph has a Hamiltonian cycle. The problem is ."\/P-complete [251].
Netuiork Design Problems
109
by adding edges. This problem was introduced by the Chinese mathematician Guan [I971 and later named:
The Chinese Postman Problem E, f ) . Given: 4 network G' = (V, Find: Positive integers n ( e ) for each edge e E E, such t h a t for the niultiE l ) the following properties hold: graph GI = (V, (i) GI arises from G by taking n ( e ) copies of each edge e E
E;
(ii) GI contains an Eulerian cycle, and (iii) The length of GI
is minimal. In view of 4.2.21 it makes no sense t o t,raverse a n edge more than twice; in other words, we may assume t h a t n ( e ) = 1 or 2. So the problem is to find a subset E C: E with n~inirnalvalue for L(1'; E) such that the rnultigraph G' = (17, E U E) contains a n Eulerian cycle.lg Here we have to use so-called matchings. A matching of a graph is a subset of edges such that no two edges share a common vertex. A perfect matching is such a graph in which each vertex has degree exactly 1, i.e. it is 1-regular. Of course, a perfect matching exists only for an even number of vertices, and we know by 1.2.1 t h a t in every graph the number of vertices with odd degree is even. In view of these observations Tve find
Algorithm 4 . 2 . 2 2 Let G = (V,E,f ) be a network. T h e n we find a Chinese Postman Tour by the following procedure: 1. Let T;' be the vertices of G which have a n odd degree; Compute a perfect n~mtchingAd = (T/, El) with minimal length; Create G U I d ;
2. Find a n Eulerzan cycle i n G U Ad, 3. Transform it into a solu,tion of the Chinese Postman Problem o f the orig14 with the edges of the shortest inal graph G , i.e. replace each edge of 1 path between the vertices. lgOf course, if in the network all vertices of even degree. then the problem is already solved.
To apply this algorithm, we need a method of constructing a perfect matching with minimal length for a set containing an even number of points::
The Minimum Perfect Matching Problem Given: X set N = {vl, . . . ,v,) of points in a metric space (X, p ) , where n is an even number. Find: A perfect matching for 1Y.such that the length of 111 is minimal. Introduce a variable x,, = z,, for each edge between v, and u,, and let f,, = p(v,,u,). Then the minimum perfect matching problem can be formulated as the follorving integer linear propiam:
subject to
Cy=,zi, = 1 z,j E {(],I)
i = 1, . . . ,n ; 1 5 i <_ j I n .
This is a simple description, but not an easy st,rategy,since integer linear program~ningis JVP-complete 11791. On the other hand, it is known that the minimum perfect matching problem can be solved in cubic time, compare 12661 or [333]. Moreover, for solving this problem in the Euclidean plane Varadarajan [434] has found an 0 ( n 3 I 2log5 la)-time algorithm. Consequently, the overall time of 4.2.22 is also cubic.
4.2.7
The traveling salesman problem
The following is a well-studied problem in conlbinatorial o p t i m i ~ a t i o n . ~ ~
The Traveling Salesman Problem Given: *4 finite set AT of points in a metric space ( S , p). Find: A cycle G = ( N ,E) embedded in (X,p) such that L ( X , p)(G) is minimal.
A solution is called a Traveling Salesman Tour (TST) for N. Since we are in a metric space, we may assume that a T S T does not pass a vertex more than "This problem is perhaps the most investigated problem of this class; but in several years we will probably see that Steiner's Problem is discussed rno1.e due to its wide application.
Network Design Problems
once. Moreover. the relation to the graph version is easy t o see: additionally points are not necessary and we obtain
Observation 4.2.23 We may understand the Traveling Salesman Problem for a given finite set N of poisnts in a metric space ( X , p ) as a problem in the complete graph (lV, equipped with the length function f defined b y
(g)),
f
(d) = p(v, u ' ) .
(4.49)
For the Traveling Salesman Problem a grecdy technique is not "ideal" helpful: Consider the four points ul = ( 0 , 3 ) . vr = ( - 8 , 3 ) , vy = -ul and u4 = -vl in the Euclidean plane. .4 tour obtained by a greedy method is v l ,v 2 .v 4 ,us, ul of , u4. ul is of length 22 2 . fi= 3 9 . 3 2 . . ., whereas a T S T is given b y u l . u . ~us. length 36."
+
I t is ,UP-hard t o find a T S T , compare [I771 or [266]. Hence, we will discuss several approximation algorithms.'" Lower bounds can be found by using spanning trees. First observe, that if we have any cycle through the given points and remove one edge then we get a spanning tree. Hence,
L ( T S T for N )
> L(I\/IST for N).
(4.50)
O n the other hand: with the same proof as that given for 3.5.2, we find
L ( T S T for N ) 5 2 . L(I\!tST for N). These observations imply "The Traveling Salesrnan Problem is closely related to the problem of Hamiltonian cycles: Observation 4.2.24 T h e problem of whether a given graph has a Ham.iltoniun cycle can be reduced to the Traveling Salesman Problem.
Proof. Let G = (V, E) be a graph with n vertices. \Ve construct a network G' = (If, with length function f given by
(y))
1 : &€E : otherwise
2
Then G contains a Hamiltonian cycle if and only if there is a TST in G' with length exactly 12.
"The more general version that allows an arbitrary length function is essentially more difficult than the problem in the metric closure. Not only is it .UP-hard to solve this problem exactly, but also approximately; compare [375].
Algorithm 4.2.25 Let AT be a finite set of points i n a m e t r i c space. T h e n I . Find a n &!ST
T for N ;
2. Double every edge o f T t o obtuin a n Eulerlan graph G ;
3. Find a n Eulerian cycle i n G ;
4.
O u t p u t the t o u r that visits all vertices of G in, the order o f their first appearance i n the cycle.
But, mTecan do better.
Algorithm 4.2.26 (Christofides [82]) Let iV be a finite set of poi7~t.sin a m e t ric space. T h e n 1. Find a n MST T = ( N ,E ) for AT,
2. Let V be the vertices of T which lzave odd degree; C o m p u t e a perfect matching Ad = (If, E')with m i n i m a l length; Create T U Ad; 3. Find a n Eulerian cycle G i n T U A[;
4.
Create from G a t o u r th,at ,uisits the vertices of AT i n order of their first appearance i n G .
For the properties of both methods. we have: Algorithm 4.2.23 4.2.26
Performance ratio
2 1.5
Running time quadratic cubic
To prove the 112 error bound. recall that the graph G consists of T and Al. hence the length of the resulting tour r satisfies
Network Design Problems
113
be the set of odd-degree vertices in T, in the order Let 1' = {vl, v2,. . . . that they appear in the shortest tour .T. Consider the two perfect niatchings of
L'
:
By the triangle inequality
Substituting (4.50) and (4.55) in (4.52) give the assertion.
Finding a better approximation algorithm for the case of a general metric space is currently one of the high-profile open problems i11 the area of netwok design. For references t o the Traveling Salesrnan Problem and its variations see [42], [119], [127], [199], [245], [266], [333] and [334].
4.2.8
Shortest multiple-edge-connected networks
We consider multiple conriected graphs: For a positive integer k a k-edgeconiiected graph is a graph such that for each pair of distinct vertices there are k edge disjoint paths between them." Equivalently, Theorem 4.2.27 A graph G = (V, E ) i s k-edge-connected if a n d only ifG = (V, E \ E ' ) i s connected for a n y s e t E' C E of a t m o s t k - 1 edges.
For a proof compare [47]. In this sense, a connectled graph is 1-edge-connected." connected graph.
A T S T is a 2-edge-
2 3 ~ o t that e the degree of each vertex in a k-edge-connected graph is at least k. "Another form of multiple connectedness is the so-called vertex-connectivity: 4 collection of paths in a graph G joining two vertices v and v' is called independent if any two paths G I and Gn in this collection have only the vertices 7) and 2:' in common. A graph G is said to he
Theorem 4.2.30 A graph is 2-edge-connected if a n d o n l y if t w o of i t s edges lie o n a c o m m o n cycle. T h e problem of finding a shortest graph that multiply connects a finite set of points has applications in the study of fault tolerance of networks. Sl'e will search for such graphs: but using a weaker condition.
k-edge-connected Steiner's Network Problem Given: A finite set N of points in a metric space ( X , p) and a positive integer k. Find: A graph G = (I7,E) embedded in ( X , p) such that (i) AT
cV;
(ii) For each pair of distinct vertices in N there are k edge disjoint paths between them"'; and (iii) L ( S ,p) (G) is minimal.
A solution is called k-edge-connected Steiner Minimal Ketwork (k-edge-SMN) for N. Clearly, a I-edge-ShIN is an SiMT for ,IT.A 2-edge-SMN is a network
+
k-vertex-connected, k a positive integer, if it has at least k 1 vertices and any two distinct vertices of G can be joined by at least k independent paths. Here, of course, parallel edges are not helpful. Note the following theorem taken from [47]: Theorem 4.2.28 A graph G = (1.: E) is k-vertex-conn,ected if and ondl1 if G[V conn,ected for a n y set V' C V of at m o s t k - 1 vertices.
\ V']is
1-vertex-connectivity is the same as 1-edge-connectivity. For k > 1 this is not true: each k-vertex-connected graph is also k-edge connected, but not vice versa. Briefly, we call a graph k-connected if it is k-vertex-connect,ed. Theorem 4.2.29 Let G be a connected graph with at least three vertices. T h e n the following statements are pairwise equivalent: (2) T h e graph G is &-connected.
(zi) For any two vertices of G, there is (L cycle contair~irigboth. (iii) For a n y vertex and an,y edge of G, there is a cycle containing both. (iv) For a n y two edges of G , there is a cycle con,tainin,g both. ( v ) For anv two vertices an,d one edge of G , there is a path containing all three. (vi) For anlj three distinct vertices of G, there is a path contazning all three. (vii) For a n y three distinct vertices of G , there zs a path containing a n y two of them, which does not contain the third. 2 5 ~ o tthat e this condition need not hold for any pair of vertices in IT! But, obviously, it is satisfied if G is k-edge-connected, and, moreover, if G is k-vertex-connected.
Network Design Problems
of minimal length containing no bridge, and
holds true for each finite set of given points. The edge-connected SMN problem is ArP-hard. since it is a gencldlization of Steiner's P r o b l e n ~ . ' ~ An approximation is given by the following observations: An hIST is an approximation for a 1-edge-SlIN and we have an approximation for a 2-edge-SMN, namely for a TST. Co~lsequently,
Algorithm 4.2.31 Let iV be a finite set of points i n a metric space. following algorithm is a n approzimation for a k-edge-SMN:
The
I . Let k = 2 . 1 + 6 , where 6 = 0 o r = 1; 2. If l > 0 , t h e n apply algorithm 4.2.26 t o find Ts zlihich approximates a T S T for N ;
3. If 6
# 0 , t h e n apply algorith,m 1.2.9 t o find a n M S T TK for 1V;
4 . Con,struct a spann,in,g multigraph for N consisting of 1 duplications of Ts and 6 copies of TK. It is easy to verify that this algorithm approximates a k-edge-connected spanning network in cubic time. Moreover. Observation 4.2.32 (DIL,H u , Jia [466]) For the algorithm Ak i n 4.2.31,
holds.
or
the similar problem of constructing a k-edge-connected graph from a given network by adding the minimum number of edges see [62] and [429]. About algorithms for minimum-length vertex-connectivity compare [21]. For references for approximation algorithms for the problem of finding highly connected subgraphs of a network and its variations see [258].
To discuss this observation more precisely, we compare the length of a k-edgeSi\/lIY for N with the length of such a network nhich does not use Steiner points, a so-called k-edge-connected Minimum Spanning Network (k-edge-AISN) for N: rk
( X , p) := inf
L ( k - edge-SMK for N) : AT C (X,p) is a finite set L(k - edge-1fSK for N) (4.57)
The quantity r k ( X , p) is called the k-connected Steiner ratio. Of course,
holds for any metric space ( X , p). The following facts are essentially deeper:
Theorem 4.2.33 (Du, Hu,Jiu [466]) In any metric space (X,p) the inequality
holds for k
2 2.
Clearly, for more specific spaces we expect better estimates. And indeed, from [466] and [225]. we have k
4.2.9
Euclidean plane
arbitrary 2
rk
3
r~
1
1-4
Rectilinear plane
> $ = 0.86602.. . r, < -
< <
= 0.93301.. .
13
5
= 0.85714 = 0.875
= 0.97850.. .
The Bounded Degree Minimum Spanning Tree Problem
Applications often mould benefit from having LISTS with low maximum vertex degree. In this sense, we consider methods to solve
The Bounded Degree Minimum Spanning Tree (BDMST) Problem V of points in a metric space and a n integer P > 1. Given: A finite set i Find: A tree interconnecting the points of AT in which each vertex has degree a t most P and with minimal total length.
Network Design Problerns
A BDMST of maximum degree P is called a P-MST. In particular, for B = 2 we look for a path of shortest length through the ~ indeed, the BDMST points, which is a computationally hard p r ~ b l e m . ' And, problem is .&'?-hard for any fixed number /3 > 1 in general, see Garev and Johnson [179]. For /I = 3 we look for a so-called binary MST. Our investigations will concentrate on spaces in which BDMSTs for several values p have a solution which can be found in polynomially bounded time. Observation 4.2.34 Assum,e that there exists a n,urnber c' = c l ( X ,p) such that the degree of any vertex in an MST in a metric space ( S ,p ) ,is at most c'. Then a P-MST with P 2 c' can be found ,in polynomial bowmded time.
Proof. X BDMST can be computed as efficiently as an ordinary hIST. since the degree constraints are dutomatically satisfied.
For specific spaces such a number c' indeed exists. We saw this in the beginning of the present section." In particular, we found that for points in the Euclidean plane any MST has degree a t most six, and a pertubation argument shows that there always exists a I \ E T with degree a t most five. However, it is interesting to consider the construction of trees with even smaller degree bounds, namely between two and five. The complexity of finding a P-MST in the Euclidean plane and in the plane with rectilinear norm is shown in the following table: 27This problem is similar to the Travelling Salesman Problem. 2 8 ~ ofinite-dimensional r Bmach spaces we have Theorem 4 . 2 . 3 5 (C. [92], [93]) For Banach,-Minkowski spaces the quantzty c' is at most the Hadwiger number.
Robins and Salowe investigated C,-spaces more extensively. In general, these numbers grow exponentially in the dimension: Theorem 4 . 2 . 3 6 (Robins, Salowe [362]) Let c l ( B ( p ) ) be th,e m a x i m u m degree of u mznimum-degree M S T over a n y finite set of points i n the d-dimensional C,,-spaces. Then, this number has the following pmperties:
C ; ( B ( ~ ) )=
~ ( 2 ~ 1: . ~ ~ ~ ~ ' "
(4.60)
c:i(~(p)) =
a(&.
(4.61)
c&(B(x)) =
2 " 1 - 0 ( ~ ) k ( l / a ( p ) ) + ( l - " ( ~ ) )l d l l ( l - a ( ~ ) ) ) ) )
where a ( p ) = 1/2P,1 2".
< p < x; (4.62)
p 2 3 4 >4
Euclidean plane
Rectilinear plane
,VP-hard h'P-hard Open Polynomial
JC'P-complete NP-complete Polynomial Polynomial
For a complete proof and additional commentjs compare [96] In view of the hardness of finding a 8-MST for ?!, = 2 , 3 , 4 . approximation techniques are of interest:
O , =
Performance ratio
Source
The performance ratio is given relative to the length of an ordinary MST
A more general s-ersion of degree-constrained trees is: The generalized BDMST Problem Given: A set N = { v l , ..., v,,) of (labelled) points in a metric space and a sequence (4.63) {PI...., $n C { 1 , 2 , . . .) U {m) of positive integers. Find: A tree T = (AT, E) interconnecting the points of iV with nlininlal length such that no vertex w, has a degree greater than O,, z = I, ...,n.
>
This problem is a generalization of the BDAIST problem in which v e assume that 81 = . . . = D, = p. and of the MST problem in ~vhic1-1we assume that Dl = . . . = O,, = a.It is easv to see that
Observation 4.2.37 A solution of the gen,eralized a n d o n 1 ~if 7L
BDMST problem exists if
Network Design Problems
119
Clearly, we look for an ordinary MST if a11 P, = oo. More generally, if just one degree constraint for just one point is given, meaning only for one value j is pj # oo,the problem is solvable in polynomial tirne using a "quasi-greedy" algorithm, because this problem is linear-time equivalent to the unconstrained minimum spanning tree problem, see [173], [307]. The general case in which more t h a n one degree is constrained has been shown t o be N P - h a r d [179]. Our well-known niethodi for computing a n MST create a heuristic to find degree-constrained trees with minimal length: In each step where a new edge is added check whether the constraints are satisfied. In general the problem of finding a spanning tree with a bounded number of leaves is Jt'P-complete [179], We will now see t h a t the number of leaves in a P-LIST for a set of n points cannot be too large. Consider such a tree, then. in view of 1.2.5, we have
where n, denotes the number of ~ e r t ~ i c eofs degree i and A is the maximum degree in the tree. Hence,
Theorem 4.2.38 (C. [96]) In, each metric space w h r e u number c' exzsts; a 8-MST has at m o s t
leaves, where A = iniii{,8, c').
4.2.10
Small diameters and spanners
T h e previous problems have all been based on the length of the tree constructed. TTre now consider another criterion for the quality of a tree:
The Minimum Diameter Spanning Tree Problem Given: A finite set AT of points in a metric space. Find: A tree T = (AT,E) interconnecting the points of N such that diamT = max L ( T ( u , . . . , v')) = Min! z'.u'E!Y
(4.66)
A solution is called a Minirnum Diameter Spanning Tree ( N D S T ) for n'. I t is easy t o see t h a t a tl-pica1 MST can have large diameter. But surprisingly, can be shown t h a t there exists a n WIDST such that the longest path in the tree consists of no more than three edges: Lemma 4.2.39 ( H o et al. [220]) For a n y set of giuen points i n the Euclidean p l m e there i s a n MDST in which there are at m o s t t w o internal vertices.
Proof. Let T = (AT,E) be a AIDST. We perform a sequence of diameterpreserving transformations until it is in the above form. Let
be a longest path in T. Consider the forest G = (I\;, E \ El). For each vertex v of TI, let T,denote the tree in G containing 71. For any other vertex u , let P, denote the vertex v such t h a t u is in T,.Then we construct a new tree TI = TI u ( N ,{UP,,)).
(4.68)
TI has the same diameter as T , since the distance between any two vertices cannot increase except when they are in the same tree T , , and in that case. by the assumption for TI, the distance of each point t o v is less than the distance from v t o the endvertices of TI. Row suppose that TI has four or more edges and the length of the path v l , v2, vs is a t most half the lerigt,h of T'. Form a tree TL by removing every edge uvz and reconnecting each such vertex u to 7 4 . This can only decrease the lengths of paths already going through ~ 3 in T I . So the only pairs of vertices with increased path lengths are those newly connected to us. But the length of any such path is a t most twice the length of the path v l , vz,v3, so the dia~neterof T2 is no more than that of T . Each repetition of this transformation decreases the number of edges in T 1 until it is a t most three, and preserves the property that each vertex is within one edge of T'.
Network Design Problems
Consequently, T h e o r e m 4.2.40 (Ho et al. [220]) In, the Euclidean plane, we can find a n M D S T in cubic time. T h e problem of finding a spanning tree whose length and diameter are both minimal is :\/?-hard, see Ho e t al. [220].
A related problem: Let t > 1 be a real number. Consider a finite set N of points in a metric space (X,p). We intend to design a graph G = (!V, E) t h a t approximates the complete graph G = (AT, mit,h length-function f : E 7' lR,f (vl;') = p(v, v'), in the following sense: (A:))
1. El = O ( I N ) . I n particular, this is satisfied if G is a planar graph 2. For each pair v , v l E N there is a shortest p a t h T ( v , . . . , v l ) in G t h a t connects the vertices 2: a n d v' in the graph G , and it holds t h a t
L(T(?),. . . , v')) 5 t . p ( v , v l ) .
(4.69)
Such a network is called a t-spanner for :Ii T h e existence of t-spanners in each Banach-Lhkowsl 1, a,nd let t > 1 be a given real number. Then there exists a number c = c ( M d ( B ) ,t ) such that each finite set N of points in Adcl(B) has a t-spanner wath at most c . N l edges. An improved version of this theorem gives a procedure t o construct t-spanners. T h e procedure is the following greedy algorithm P r o c e d u r e 4.2.42 A network G = (V;E, f ) and a real number t given. 1. Sort the edges in E in nmn-decreasing order of the len,gths,
>
1 are
2. L e t E' := 0; G':= (11, E'),
3. For each edge & from t h e sorted list of E d o zf f f ~(v, " u') > t . f ~ ( vv') , then E' := E' U { t d ) a n d G':= (If, E');
4.
S t o p w h e n all edges are checked.
T h e n G'= (If, E')is a t - s p a n n e r of G . This procedure needs 0 ( n 3logn) time. For a faster method t o construct spanners in the &spaces see Chandra et al. [76], [77].Spanners in Euclidean spaces are discussed by Salowe [376]. An application of our considerations in networks is given by Primer [333]: of course, if T is a spanning tree for the graph G,n-e have pG(v. v') 5 pT(v,vl) for all vertices v and v'. Prisner construct spanning trees with p T ( u . v') 5 t . pG(v, v') for specific classes of networks and numbers t .
A N E W CHALLENGE: THE PHYLOGENY
As it became accepted t h a t evolution mas to be uiiderstoocl in terms of Mendelian genetics and Darwinian natural selection, so too it became clear that this understanding could not be sought only a t a qualitative level. A fundamental problem is the reconstruction of species' evolutionary past, which is called the phylogeny of those species. Trees are widely used to represent evolutionary relationships. In biology, for example, the dominant view of the evolution of life is that all existing organisms are derived from some common ancestor and t h a t a new species arises by the splitting of one population into two or more populations that not do not crossbreed, rat,her than from the mixing of two populations into one. Here, the high level history of life is ideally organized and displayed as a tree. A phylogenetic tree is ail evolutionary tree for a given set of taxa.' Trees rnay also be used to classify individuals of t,he same species. In historical linguistics, trees have been used t o represent the evolution of languages, while in the branch of philology known as stemmatology, trees may represent the way in which different versions of a manuscript arose through successive copying. Often trees are used to describe the relatedness of objects which have developed tree showing the from a common ancestor. In [222] we find ail ev~lut~ionary architectural connections and influences during the the development of parallel computers from the early 1950s; in [360] me see a tree showing the history of the common computer languages. We mill discuss the problem of reconstruction of phylogenetic trees in our sense of shortest connectivity. To do this we introduce so-called phylogenetic spaces. These are metric spaces whose points are arbitrary words generated by letters 'Such a tree may h e called a "pl~ylogeny",a "dendrogram". or a "cladogram". We will define phylogenetic trees more precise in the following sections.
(or symbols) from some (finite) alphabet, and whose metric measures "sameness" of words according to some cost measuie on the letters, or a similarity of the ~vordsgenerated by a scoring system.
5.1
PHYLOGENETIC TREES
Nothing in biology ~naliessense except in the light of evolution. Theodosius Dobzhansliy The most surprising application of Steiner's Problem is in the area of phylogenetics. Trees are widely used to represent evolutionary, historical, or hierarchical relationships in various fields of classification. T h e underlying principle of phylogeny is to try to group "living entities" according t o their level of similarity. In biology for example, such trees ("phylogenies") typically represent the evolutionary history of a collection of extant species or the line of descent of some gene. No two members of a species are exactly the same - each has slight modifications from their parents. As environmental conditions change, nature will favour t h a t branch of a species with some particular modification; as time goes on another mutation of the basic stock will become dominant. In this way: all species are continually evolving. This evolution occurs in a number of mays a t the same time: some species die out and some become new species in their own right. This was already seen by Darwin [120]. He recognised that the characteristics which identified the species could indicate a history of descent, that is, a tree of evolution. Darwin wrote: T h e affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green and budding twigs may represent existing species; and those produced during each former year may represent the long succession of extinct species ... The limbs divided into great branches, and these into lesser and lesser branches, were themselves once, when , twigs; and this connesion of the former the tree was s ~ n a l l budding and present buds by ramifying branches may well represent the classification of all extinct and living species in groups subordinate t o groups ... From the first growth of the tree, many a limb and branch has decayed and dropped OR, and these lost branches of various sizes
A new challenge: The Phylogeny
125
may represent those whole orders, families, and genera which have now no living representatives, and which are ltnown to us only from having been found in a fossil state ... As buds give rise by growth t o fresh buds, and these, if vigorous, branch out and overtop on all a feebler branch, so by generation I belive it has been with the great Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications. Historically, this was a new idea: The concept of species having a continuity through time was only developed in the late 17th century; higher life forms were no longer thought to transmute into different kinds during the lifetime of a n individual. I t took over 150 years from the developnlent of this concept before a rooted tree was proposed by Darwin. Note t h a t in Darwin's fundamental book The origin of species [120] there is exactly one figure, and this shows the description of the evolutionary history by a tree. In other words, Darwin means t h a t his theory of evolution, today called Darwinism, implies the existence of a n evolutionary tree. T h e phylogenetic tree can therefore be thought of as a central metaphor for evolution, providing a natural and meaningful way to order d a t a , and with a n enormous amount of evolutionary information contained within its branches. Clearly, this idea is attractive, but how are we t o find the tree? Note that there are s e v e ~ a difficulties. l even in the definition of the problem: What is the tree of life? A tree which is given by a classification or the evolutionary tree? What is the mechanism of evolution? Darwin provided mutation and natural selection, which suggested a scientific model for the relation of species. Darwin's evolutio~larytree is neither obvious, nor easy t o find. There must be some criterion for deciding which of the many phylogenies that may be drawn most closely resembles the act,ual evolutionary changes. Darwin saw another difficulty in the underlying problems. In a letter to Huxley he wrote: " T h e time will come, I believe, though I shall not live to see it, when we shall have fairly true genealogical trees of each great kingdom of Nature."
Considering the origin of life: Was there just one, or more than one "starting point"? W h a t does we know about the last universal common ancestor, if it exists? It has been argued t h a t tlle "Tree of Life" is perhaps really a "Web of Life". as mechanisms such as hybridization. recombination and swapping of genes probably play a role in evolution.
A nice representation of this subject has been given by Davies [122], Pennisi [336], and STard and Bromnlee [443]. A surrey about What Evolutzon zs mas given by Mayr [301]. For the history of Darwin's theory compare Bowlel [54] and Weber [450]. Each species can be described in terms of a sequence of specific values, called characters. These characters were originally morphological, t,hat is deri.ved from a n analysis of a n organism's form and structure, but how are these values measurable? In biology, "characters" describe attributes of the species under consideration and are the d a t a that biologists typically use t o reconstruct phylogenetic trees. SVe wish t o consider characters for species in a morphological sense. To do this we assunie that there is gi.i,en a (finite or infinite) state space C of characters. We also assunie that there is a metric in C. Discrete character data are those for which a function f assigns a character state fi, t o each taxon i for each character j . T h e most important problem in morphological pliylogenetics is selecting the characters. Here opposing side picking out is the favourite method. On the other hand, characters must be coded if there are more than two distinct possibilities. LVe think of characters as independent variables. This assumption is common t o virtually all character-based methods. If we could not assume independence, we would be forced t o take covariance among characters into account, and the computational methods would by necessity become more complicated. .Another assumption required of character data is t h a t the characters be homologous, that means that a character must be defined in such a way t h a t all of the states observed over taxa for that particular character must have been derived from a corresponding state observed in the common ancestor of those taxa. As sequence data became readily available it mas predicted a n end to this conflict. Kow, the biological units are written in words constructed from the letters corresponding either t o amino acids, which generate proteins, or to nucleotides forming DNA or RNA molecules. By comparing such words one can construct
A new challenge: The Phylogeny
127
evolutionary (phylogenetic) trees showing how closeness of the words in the tree corresponds t o the closeness of the unit. In other words,
The Phylogenetic Tree Problem Given: A set of sequences, each representing a taxon. Find: Their phylogenetic tree. representing its evolutionary history T h e set of leaves represent the given taxa. the internal vertices are the ancestors, and the root of the tree represents the common ancestor of all. The phylogenetic tree of life shows when groups of organisms arose and gives the basic relationships between then?. First, molecular sequence data was used by Fitch and Margoliash in their landmark paper [I611 from 1967 dealing with cytochrome c sequences. The basic idea in that field is that species (given by their sequences) which appear t o be closely related should have diverged more r e c e n t l ~than species which appear to be less closely related. To find such a phylogenetic tree we construct a metric space which forms a model for the phllogeny. Nore precisely, Bein and Graham 1451. David Sankoff of the University of Rlontreal and other investigators defined a version of the Steiner problem in order to compute plausible phylogenetic trees. The workers first isolate a particular protein t h a t is comnlon to the organism they want to classify. For each organism they then determine the sequence of the amino acids t h a t make up the protein and define a point a t a position det,ermined by the number of differences between the corresponding organism's protein and the protein of other organisms. Organisms with similar sequences are thus defined as being close together and organisms with dissimilar sequences are defined as being far apart. In a shortest network for this abstract arrangement of given points, the Steiner points correspond t o the most plausible ancestors; and edges correspond to relations between organisms and ancestor that assume the fewest mutations. The latter remark explains the importance of trees having the least possible length in phylogenetic spaces for evolutionary relation investigation. This approach to Evolution Theory was suggested first by Fitch 11621 in 1971, and also
explicitly written by Foulds et al. [170], [395] in 1979. Unfortunately, this idea .~ Bern and Graham [GI: does not give a simple m e t h ~ d Again, Since the phylogenetic Steiner problem is no easier than other Steiner problems, however, the problem - except as it is applied t o small numbers of organisms - has served more as a thought experiment than as a practical research tool. In other words, reliable tree building algorithms do not (yet) exist. On the other hand, for specific questions, examples, and investigations this approach will be helpful. Hence, it seems impossible to describe the "Great Darwin Tree" since the diversity of the living world is staggering: more than two million existing species of plants and animals have been named and described; many more - both existing and past - remain to be discovered. On the other hand, it will be useful to describe the phylogeny between several organisms by their DNA\ sequences taken from their genomes. On this topic Vingron et.al. [436] wrote Many similar DNA sequences from different species have common ancestors in evolution. The relationship among sequences are described by a phylogenetic t,ree. Phylogenetic trees do not merely allow for an exact classification of life forms, but also give hints to yet unknown properties of organisms, as well as insight into mechanisms of evolution. This holds true even for comparatively short periods of time, for example the evolution of the HIV 1w-us.. ' .. The notion of a Steiner Tree subsumes both tree topology and multiple alignment. In a graph that has biological sequences as nodes, edges represent evolutionary operations t h a t modify sequences. This view of the problem . . . unifies two optimization steps t h a t are commonly treated separately - the Multiple Alignment and the Parsimony problem. By treating the two problems a t the same time one can hope for better results in terms of the sirnplicity of the resulting tree. The principle of Maximum Parsiniony involves the identification of a conibinatorial structure that requires the smallest number of evolutionary changes. I t 'And seems to have been rather forgotten in the field of biology after tree-building program packages became widely available.
A n e w challenge: T h e Phylogeny
129
is often said that this principle abides by Ocltham's razor. according t o which the best hypothesis is the one requiring the smallest number of assumptions. Or in other words: (a) It is futile t o do with more what can be done with fewer (b) More precisely in Latin: Entia non sunt multiplicanda praeter necessita(c) More roughly spoken: Keep it simple. This is true, but not in a simple sense. Cavalli-Sforza [72]:
... it does not necessarily follow that a method of tree reconstruction minimizing the number of mutations is the best or uses all the information contained in the sequences. The minimization of the number of mutations is intuit,ively attractive because we know t h a t mutations are rare. There may be some confusion, however, between the advantage of minimizing the number of mutations and sometimes invoked parallel of Ocltham's razor ..., which was developed in the context of medieval theology. T h e extrapolation of Ocltham's razor to the number of nlutations in an evolutionary tree is hardly convincing. Note t h a t in this case minimizing the number of assumptioils does not mean minimizing the nuniber of mutations, or the steps of an evolution, it means t h a t among all possible network structures we seek one which satisfies only few conditions. With the "razor", Ockham cuts out all superfluous, reclundaiit explanations. As a conclusion, me find that Steiner Minirnal Trees in sequence spaces are SIaximum Parsimony Trees. And in this sense, we will investigate Steiner's Problem in spaces of sequences equipped with a any desired chosen m e t r i ~ .I t~ means that among all possible structures we seek one which satisfy only one, namely the condition of minimal length. What other condition can be more natural in a metric space? For the biological background and a more detailed discussion of these problems see Graur and Li [191], v.Haeseler and Liebers [202], and Page and Holmes [331]. In particular. a broader discussion of the application of the principle of Maximum Parsimony can be found in Farris and Kluge [158]. 3For a broder philosophical discussion of Ocltham's razor see Brown [57] and Russel [371], [373]. 4 ~ o t that e parsimony does not point to the root of the tree. To find the root, we n e ~ d additional information.
Note t h a t this approach to describing the evolutioiiary history has a deep consequence for the following quest,ion: Is evolutioii a scientific theory? On this topic, Hendy [214] recalls:
I began a mathematical study into evolution, after attending a debate, a t Massey University in 1973, between a creationist and a local scientist, on the Theory of Evolution. The creationist made reference t o the work of the philosopher of scientific process, Karl Popper. Popper [350] had stated that "Darwinism is not a testable scientific theory, but a metaphysical research program - a possible framework for testable scientific theory". I discussed this issue with a colleague a t I'Iassey University, David Penny, who had a research interest in molecular evolution. David suggested a mechanism t h a t might provide a testable hypothesis t h a t we could be apply to the theory of evolution t o ~ n e e tPopper's criterion for a scientific theory. We succed in this quest [339], using the tree building method of "Maximum Parsimony" to derive evolutionary trees from a number of independent protein sequences, for a common set of mammalian species. We then compared the resultant trees. Compare also Penny, Hendy and Poole [342]. Moreover, in this sense, each organism is a n experiment for the hypothesis of biology, in particular, of evolution. The principle of Ocltham's razor suggests t h a t one should choose the simplest possible hypothesis. For more facts about the denial of the theory of evolution compare Pigliucci [344]. Note a n essential difference in the application of Steiner's Problem in engineering and in biology. In the first case we search for a tree which is as short as possible: a n approximation may be acceptable. In the second case me look for the shortest tree (or all shortest trees); i.e., we are interested in an exact solution. Here, a n approximation gives only an upper bound for the length of a n SMT. Moreover, the idea grew out of a n investigation into the accuracy of a n SMT. I t is not possible t o directly test the "accuracy" of such a tree-building method, as the "true evolutionary tree" is not, and in general cannot be, known with ~ertainty.~ "0
example, consider the phylogenetic tree for Darwin's finches in [188].
A new challenge: The Phylogeny
5.2
PHYLOGENETIC SPACES
Einstein said: "God does not play dice." He mas right. God plays scrabble. Philip Gold We will introduce metric spaces which are of interest t o describe the genetic d a t a in evolutionary processes. Here the input data is a set of sequence information. The sequence information is usually DNA, RNA or protein sequences. In more detail:
DN.4 sequences are the informational-containing molecules and are comDNA of a,ri posed of nucleotides from a n alphabet of four letters."he organism plays a central role in its existence. Its sequential arrangements forms chromosomes. These strings may be millions of nucleotides long, measured in base pairs (bp). The entire set of genetic information of a n organism is called its genome. Fitch [I631 gives the following exemplary genorne sizes: Domain
Organism
Size (bp)
l~iruses Bacteria Eultaryotes
HIV E, coli mammals
9 . lo3 4 , lo6 3 . 10"
Roughly speaking, the order of genorne size is kbp. NIbp and Gbp for Viruses, Prokarya and Eukarya, respectively. Proteins, which are the operational molecules, are composed of of amino acids from a n alphabet of 20 letters. Typical proteins contains about 300 amino acids (aa), but there are proteins with fewer than 100 or as many as 5000 a a . Structural proteins act a,s tissue building blocks, whereas other proteins known as enzymes act as catalysts of chemical reactions. RNA sequences, which stand between DNA and protein and composed of nucleotides from an alphabet of four letters. "he informational aspect combined with the massive parellelism and the complementarity in the double strand present the possibility of a computing paradigm which is rather different from those customary in present-day computer science. For a survey about this " D N A Computing" see PXun et al. [358].
(It is remarkable the the niolecules which are the carriers of information and the operational units which make life work are all linear polymers.) The Central Dogma of Molecular Biology7 describes the roles of these polymers: DNA acts as a template t,o replicate itself, DNA is also transcribed into RNA, and RNA is transla~teclinto protein. So we start our investigations with spaces of these sequences (strings) reflecting the "written nature of life".
5.2.1
Alphabets and words
An alphabet A is a nonempty and finite set of distinguished letters (or symbols). If -4 contains exactly one letter, all further discussed concepts and problems are senseless or trivial, respectively. Hence, we assume that A contains a t least two elements. If 4 contains exactly tm7o letters it is called a binary alphabet. Important examples of alphabets are:
A = {0,1) is a n alphabet which play a central rule in coding theory. Moreover, we consider a word of 0's and l ' s as a description of some individual, perhaps a genetic sequence in which each entry may take on one of two possible values. A = {a, c. g. t ) is the alphabet which codes the nucleotides of a DNA molecule, where n stands for adenine, c for cytosine, g for guanine and t for thymine. A similar alphabet, namely A = {a, c. g. u) is used for tlie nucleotides of RNA, where u codes for uracil. Derived from this alphabet there is a binary alphabet 4' = { r ,y) in which r codes for a purine ( a or g ) , and y codes for a pyrimidine (c or t ) . The amino acids comnionly found in proteins are coded by tlie alphabet
.4 = {nla, nrg, . . . , v a l ) , where the letters abbreviat the amino acids alanine, arginine, ...,valine. In the usual genetic code / A /= 20 amino acids are coded. T h e English language needs 26 letters: A,B,...,Y,Z, and a letter for the empty space. German needs several letters more: A4,0, 0, fi. 7 ~ o m e t i m e also s called "The Holy Trinity of Molecular Biology".
A new challenge: Th,e Phylogeny
133
*4 word over an alphabet A is a finite sequence of letters from A. The length wl of the word w is the number of letters composing it. We additionally define an empty word X of length 0. Note that the description of a word contains a left-to-right order of the letters. We will write u: = a l a z . . . ad for a word 21: consisting of the letters a l , a2, . . . ad in this order; or using the notions for algorithms, w = n [ l ] a [ 2.]. . a [ d ] for a one-dimensional array; then we will also speak about sequences or strings. The letter ai = a [ i ] in the word, sequence respectively, is called the i-th position. We say that two words w = alas . . .ad and w' = blba . . . bdi over the same alphabet are equal, and we write w = w',if d = d' and a, = bi for all i = 1,. . . , d.
-
Let IU = a l a z . . . ad ancl w' = b l b 2 . . . bdj be two words over the Yame alphabet A. The concatenation of w and w', written ww',is the word slat . . . a d b l b 2 . . . bdl over A. Hence, lww'1 = Iwl Iw'l.Moreover, we will write wh = LL . . . w and
+
k-times
m0 = X for each word w.
The set Ad contains all words over -4 with length exactly d. Clearly, A0 = {A), A' = A, and = illd. (5.1)
Asd denotes the set of all words of length a t most d; and we have
In particular, when a set of words contains only words of a predetermined bounded length, then this set is finite. More about the combinatorics of words can be found in [288]ancl [296]. If there is an order 5 of the letters in '4, then the set Ad is endowed with the following partial order
S H w' if and only if ai 5 bi for all i
2. w
The set
w' if and only if
tu
=
1,.. . , d ; and
w' and w # w'
contains all words over the alphabet A. In any case. this is an infinite, but countable, set.8 The set A* equipped with concatenation as a binary operation forms a semigroup which contains the unity A. If there is an order < of the letters in A, the set A* is endowed with the following linear order < L of the words, which is called the lexicographic order: For two words w = a l a r . . .ad and w' = blbr . . . bdi we define w
I. d < d ' a n d a1 = b l , . . . , ad = b d ; or
2. a1 = bl.. . . , ak = bk for k
< d, d'
and ak+l< bk+1.
Any subset of A* is called a language over A. The set of all languages over an alphabet is not countable, since it is the power set of A*. It is usual to classify languages which are generated by a finite granirnar in the sense of the Chomsky hierarchy. For biological applications we use Type
Class
Abbreviation
Application
1 2
Context-dependent Context-free
cs cf
3
Regular
Protein tertiary structure Protein securldary structure Palindrornic DNA4structure Motifs and profiles
For a survey about Chomsky languages see P5un et al. [358]. In the strict sense all practical sequence spaces are languages of type 3, because they are finite.'' But this fact is not helpful, since the generating grammar is too big to be applicable. It is necessary to understand the phylogeny in the sense of cf-grammars. 'TO see the countahleness, first count t,he word A, then the members of A itself, then the words of length 2, and so on. ' ~ o t ethat this order is fundamentally different from the order which we used to count A*. ' O ~ h efiniteness comes from the finiteness of the real world; compare [I841and [316].
A n e w c h d l e n g e : The Phylogeny
5.2.2
Two well-known examples
There are two well-known approaches for turning sets of words into metric spaces. First, the so-called sequence space, defined as follo~vs: Let d be a positive integer. We define the Hamming distance p~ over Ad as pH((nl, . . . , ad),(bl, . . . , bCl))= 1 { i : a ,
# b,
for i = I , . . . , d ) 1.
(5.4)
Then (Ad,p ~ is) a finite metric space. This space plays a n important role in coding t>heory,namely in the sense of error-correcting codes. More precisely: If (code-)words are transmitted then it is possible that errors mill arise and so the received words may differ trom those that were sent. T h e basic idea behind a n error-correcting code is t o choose the words t o be sufficiently different from each other so that even if some error in trarlsmission occurs, each received word is closer to the transmitted word than t o any other. This is the concept of distance between words.'' Compare Hankerson et al. [208] or Schulz [388] for a common description of information and coding theory, and Casti [66] or Yockey [470]for its application in molecular miology.
(Ad, pH) is also a graph, realized by defining .Idas the set of vertices and making two vertices to, w' E Ad adjacent if and only if p ~ ( u 1w') , = 1. 4 specific space is the so-called d-dimensional hypercube
where lB = ( 0 , l ) . T h a t is the graph whose set of vertices consists of all binary vectors of size d, with a n edge joining two vectors if and only if they differ in exactly one coordinate.12 T h e hypercube has the following properties: (a)
Q,,has
2"vertices
and d . 2"-I edges;
(b) it is a bipartite graph; (c) each vertex in
Qd
has degree d ;
llFor example, when for any two different code-words 2 ~ : and w' of a binary code the inequality P H ( W , W ' ) 2 2 ' t + 1, with a desired chosen integer t holds, then the code can correct errors affecting up to t binary digits. 1 2 ~ h iiss the specific form of the fact that each finile metric space is essentially a graph.
(d) the diameter of Qd equals d, and for a given vertex w there is a unique vertex w' with pH(w, w') = d; and (e) we may also define Qd inductively by letting Qo be a single vertex and then obtaining Qd by taking two copies of QdPl and joining corresponding vertices.13 T h e metric space (A" pH) has a strange property: on one hand, it is a "big" space, since it contains lAld many points; on the other hand, it is a "small" space, since its diameter equals d:
For some deep consequences of this observation for molecular evolution see Eigen [153]. Now consider the set A* of all words over the alphabet A. T h e edit distance p ~ between , two mords of not necessarily equal length is the minimal number of "edit operations" required t o change one word into the other, where an edit operation is a deletion, insertion, or substitution of a single letter in either word. This distance is also called Levenshtein distance, since it was introduced by Levenshtein [284] in connection with error correcting codes. As a n example consider the two German words w =APFEL and w' = P F E R D , where we have P L ( W , w') = 3. I t is not difficult t o see t h a t the problem to compute the Levenshtein distance between two words w and w' is solved by a serial algorithm in O ( w . lwl/)time, through dynamic programming. We will discuss this method more precisely and generally later. T h e set A* equipped with the Levenshtein distance is called the phylogenetic space (over A). (A*, pL) is an infinite discrete metric space. More precisely: let w and w' be two mords in ( A * , P L ) , 1-41 2. Then
>
In particular, the second inequality implies t h a t any bounded set of mords is a finite set. *4t first glance, it seems t h a t the sequence spaces are subspaces of the phylogenetic space, but this is not true: Consider the two -wolds v = ( ~ b and ) ~ 2~ = (ba)"; then p ~ ( vw) , = 2 but p ~ ( vw) , = 2d. To extend the Hamming dist,ance to a metric for all words we may use the following way: Let A be a set of let,ters. Add a "dummy" letter "-" to A. We 131n view of this fact we have that Qd must be Hamiltonial~;compare [185].
A n e w challenge: The Phylogeny
define a map
cl : ( A U {-))*
i
,4*
(3.8)
deleting all dummies in a word from (A U {-))*. Then for two words w and w' in A* me define the extended Hamming-distance as
Observation 5.2.1 T h e extended Hamm,ing-distance coincides with the Leuenshtein metric. In one sense, the phylogerlet,ic space is of interest in "pure" network design. Remember t h a t we showed t h a t the Steiner ratio of any metric space is a t least 0.5, but me did not describe a space with this value as the Steiner ratio. To determine the Steiner ratio of the Phylogenetic space, consider the words wi which consist of the letter a repeated d times, except the 1;-th position where another letter b is located, i = 1 , . . . , d . Then define the set
of d points. For i # j it holds t h a t p L ( w , , w , ~ =)2 . Hence. L(1IST for A T ( d ) )= 2 ( d
-
I).
(3.11)
T h e word w = a . . . a has distance 1 to any w,.Consequently, the star with the z = 1,.. . . d is a n SAIT for K ( d ) for which center w and the leaves w,,
L(ShlT for N ( d ) ) = d .
(5.12)
Both equations (5.11) and (5.12) give
>
2. Now. we have found a metric space which for all positive integers d achieves the lower bound 0.5 f o ~the Steiner ratio:
>
Theorem 5.2.2 For the S t e i n e r ratio of the plzylogenetic space (A*,p ~ ) l ,i l l 2 , it holds that 1 (5.14) m ( - 4 * , p ~= ) -. 2
Note t h a t we don't have a finite set !\To of points such t h a t L(SR1IT for N O )- 1 L (MST for No) 2 ' and, moreover, in view of 3.5.5, we cannot find such a set.'"
5.2.3
Distance and similarity
In the biological context the equality of words makes no sense, since mutations do not allow identical sequences in reality. On the other hand. in biomolecular sequences, high sequence similarity usually implies significant functional and structural similarity.15 Let A be a n alphabet. U7e consider the set A* of all worcls over A. Our interest is to define measures on .4* which reflect the "proximity" of two words. Here, two different approaches are t o be distinguished: distance and similarity. Historically, the origin of the first was the result of investigations for a rigorous mathematical solution t,o a n important biological problem; the second was the result of a heuristic a,pproach. We mill introduce both measures in the greatest possible generality. This is necessary, since evolution, as reflected a t the molecular level, proceeds by a series of insertions, deletions and substitutions of letters, as well as other far rarer mechanisms which me are ignore here, since we observe not complete genomes, only genes or other "smaller" words.16
A cost measure (c, h) is given by w
A function c : 14x A
+ R>O, which satisfies the following conditions:
14similar considerat,ions about the Steiner ratio of sequence spaces give
Consequently, m(rld
w
1
(5.17)
if d )> 1, see Foulds [167]. 15But note that the converse is, in general, not true. Arid in realit?., for applications in biology it is sometimes necessary to take into account several other properties of the macro-molecules to measure their similarity, for instance structure. expression and pathway similarity, compare [248]. lGNote that gene trees and species trees may not match due lineage sorting, hybridization, recombination and other events. LVe will discuss this question later more extensivly.
A new challenge: The Phylogeny
(i) c is non-negative: c(a, b)
> 0;
(ii) c(a, a ) = 0;and (iii) c is symmetric: c(a, b) = c(b, a ) for any a , b E A.
*4 positive real number 11 T h e substitution of a letter b for a lettei a costs c(b, a ) = c(a, b). The insertion or deletion of a letter effectively transforms a non-gap letter in one word t o a gap in the other. Since me do not know the direction of the change through time. it is useful to group both operations under the term indel. Each indel costs 12. The distance p(w, w'), between two sequences w. w' E A* according to a cost measure is the iniliimuni of the costs running over all series of operations transforming w into w'.
Observation 5 . 2 . 3 The functlon p Is a p,seudo-n~etric.If, moreover, the function c satisfies the non-degeneracy property, i.e. that c(a, b) = 0 holds if and only if a = b, then p is a metric.
Consequently a given cost measure for an alphabet A generates a metric (or pseudo-metric) space (A*, p) . Note t h a t we do not assume that c satiesfies the triangle inequality, but we can assume this. The reason for this assumption is that even if we start with a cost measure (c, h) that does 11ot satisfy it, we can always define a new pair (c': h) t h a t does satisfy it and produces the same metric. Namely, if three letters a1 , as and a3 are such t h a t c ( a l , a 2 ) > c ( a l , a s ) c(a3,a s ) , then every time we need to replace a1 by a:! we will not do it directly but rather replace a1 by a3 and later as by a2, producing the same effect a t a lower cost. Moreover, using the the same reasoning, the restriction of tlie metric p to the alphabet itself need not be c. This is only true if t,he function c satisfies the triangle inequality.17
+
An example for a cost measure is given by c(a, b) = I for any pair a and b of different letters and 12 = 1. This creates the Levenshtein distance discussed in the section before. Another example: For tlie cost measure (c, h) defined by l7cornpare our investigations about the metric closure of networks, see 2.5.1.This also give hints for our later work.
and h = 4, we find p(agc, n3c) = 5, p(acg, a3c) = 7 and
T h e (pseudo-) distance p(w, w') between two words m and w' is attained with some (finite) operation sequence transforming w into w'. Moreover, Observation 5.2.4 T h i s m e t r i c space (A*; p) i s a discrete one, that m e a n s , if for a subset T/V of words over A i t holds that
sup{p(w, w') : w, w' E It7}
< cc
(5.18)
t h e n also
1 T/V' 1 < oc .
(5.19)
To see this we recall that: 1. If we consider the substitutions, there are a t most A l possible different letters a t each position; 2. T h e "gap penalty" is chosen as a positive real. Hence, the distance between two words hounds the difference of their lengths:
which is in any case a positive real if the words of different lengths. Consequently, in a bounded set of words there are a t most finitely many different ones.
Another approach uses similarity. T h e procedure used to find such quantity is called sequence alignment and depeiids on a scoring system.
141
A new challenge: Th,e P h y l o g e n y
Given two sequences w and w' over the sa,me alphabet, a n alignment of w and w' is a partial mapping from letters in w to w',or vice versa, which preserves the left-to-right ordering. Such a n alignment can be represented by a diagram with aligned letters above each other. and unaligned letters placed opposite gaps. An alignment can be viewed as a way t o estend the sequences t o be of the same length using gaps or "dummy symbols". For instance consider the two words w = ac'g2t2 and 70' = agct. T h e following arrays are all alignments for w and w': a a
c g
c c
g t
g -
t -
t -
a a
c -
c -
g g
g c
t t
t -
and
where "-" denotes a "dummy" symbol. In other v a r d s , we are search for a diagram such t h a t (i) T h e elongated sequences are of the same length; (ii) There is 1-10position for which the elongated sequences both have a dummy (i.e. we do not use pairs of dummies). T h a t means, a pairwise alignment for two words w and w' over a n alphabet A is a 2 x I-array with values from *-IU {-) and
Consequently, there are only finitely many alignments for a given pair of sequences. Consider two words w = a l a z . . .a,, and w' = b l b 2 . . . b,,,. To count alignments is t o identify aligned pairs (z,, IJ:,) and simply to choose subwords of w and w' to align. This gives
c(3(a) (T)
k>O
alignments. Hence,
=
(5.22)
Observation 5.2.5 There are
alignments of t w o words with n and 7n letters, respectively. In particular, if both words have the s a m e length 7% there are
Nore about the combinatorics of alignments can be found in T k t e r m a n [447]. F~lrthermore,the eloilgated sequences in a n alignment should be as si~nilar as possible according t o some predefined scoring system. Given an alignment between two words, Tve assign a score to it as follows: Each column of the alignment will receive a certain value depending on its contents and the total score for the alignment will be the sum of t,he values assigned t o its columns. Let a n alignment between two words be given. If a column has two identical symbols we mill call it a match, two different symbols is called a mismatch, and finally, a space, that is a dummy in one row, is called a gap. More generally:
A scoring system (p, g) is given by A symmetric function p : d x A
+ I/?, and
A non-positive real number g. The array of p is called the (substitution) score matrix. The value p ( a , b) scores pairs of aligned letters a and b. The penalty g is used t o penalize gaps. In general, we assume that p ( a , n ) > 0, for a E A, and g < 0.'' Clearly, '"ere we not count the pairs of (a-,-6) and (-a,b-) as distinct. Otherwise, the number , f ( n ,m) of such alignments for two sequences of n and m letters fulfils the equality
which does not have a nice explicit description. But it can shown that
f ( n , n)
(I +
,
fi,
see [446]. " ~ n dunlikely substitutions are penalized with a negative score.
(5.26)
A new challenge: The Phylogeny
143
the selection of a n appropriate score matrix is crucial for achieving "good" alignments. A scoring system assigns a value, called the score, to each possible alignment. The si~nilaritysim(w, w'); between two sequences w, w' E A* according to a scoring system is the maxiinurn of the scores running over all alignments of .(I: and w ' . ~ O T h e concepts of distance and of similarity are essentially dual. More precisely:
Algorithm 5.2.6 Given a cost measure (c, h ) and a constant K , we can define a scoring system (p, g ) as follows:
under the constraint K 5
212.
(5.27)
And conversely, given a scoring system ( p :g ) with the property that p(a, a ) = K for all a E A, we can define a cost measure (c; h ) as follows:
under the constrain,ts
K I<
> >
max{p(a, b) : a;b E A ) , and 29.
'O1n a biological context a scoring matrix p is a table of values that describe the probability of a residue (amino acid or base) pair occuring in an alignment. Substitution matrices for amino acids are complicated because they reflect the chemical nature and the frequency of occurrence of the amino acids, see [20].Such matrices for bases in D N A or RNA sequences are very simple: in most cases, it is reasonable to assume that a:t and g:c occur in roughly equal proportions. But sometimes the following score matrix is used:
In other words, we have the following interrelation between a cost measure (c, h ) and a scoring system (p, g):
for all a , b E '4,which obviously reflects the duality. Roughly speaking. "large distance" is "small similarity" and vice versa. Moreover. distance computation can be reduced t o similarity computation: Theorem 5.2.7 (Smith, Waterman,, Fitch [4O2], Setubal, Meidanis [394], Waterman [446]) A cost measure and th8ecorresponding scoring system as i n 5.2.6 are given for a certain value K . Let w and w' be words over A. Then
Both the cost measure and the corresponding scoring system yield th8e same optim8al alignmen,t~.~' Sketch of the proof. Let w and w' be words of length rn and n respectively, and let a be an alignment between w and !w'. We define a series a of operations transforming w into w 1by dividing oi int,o columns corresponding t o the operations in a natural way: matches and mismatches of letters correspond t o substitutions; gaps correspoiids t o indels. We shall now compute the score of a and the cost of a . Suppose there are exactly 1 letters which are matched or mismatched in a , occupying positions wi in w and lo: in wl; 1 5 i 5 1. Suppose further t h a t there are exactly r gaps in a. Then
+
score(rr) =
p ( q , w:) rg. z= 1
On the other hand, the cost of a is 1
cost(o) = = ( w L , w:)
+ rh.
(5.31)
1=1
Memberwise addition of (5.30) and (3.31) in conjunction with 5.2.6 give score(a) "Although same
+ cost(a) = 1K + r -.K2
(5.32)
with different scores. B u t using the formula given in 5.2.6 the distance is the
145
A n.ew challenge: The Phylogeny
Moreover the values of 1 and 1- are not independent: each match uses two letters and each gap uses one. Therefore, the total number of letters must be
Then (5.32) can be written as score(cr)
+ cost(a) = K2
-
. ( m + n).
(5.34)
Since this is true for any alignment. we have one half of the assertion. The other half follows similarly.
All these considerations imply that, from the mathematical standpoint, an alignment and an edit transformation are equivalent ways to describe a relationship between two words. alignment can be easily converted to its dual edit transformation and vice versa: two opposing letters that mismatch in an alignment correspond to a substitution; a gap in the first word of an alignment corresponds to an insertion of the opposing letter into the first word; and a gap in the second word corresponds to a deletion of the opposing letter from the first word. Thus the edit distance of two words is given by the alignment minimizing the number of opposing letters that mismatch plus the number of letters opposite gaps. But we should note what Gusfield [I981wrote: Although an alignment and an edit transcript are mathematically equivalent, from a modeling standpoint, an edit transcript is quite different from an alignment. An edit transcript emphasizes the putative mutational events (point mutations in the model so far) that transform one string to another, whereas an alignment only displays a relationship between two strings. The distinction is one of process versus product. Different evolutionary nlodels are formalized via different permitted string operations, and yet these car1 result in the same alignment. So an alignment alone blurs the mathematical model. This is often a pedantic point but proves helpful in some discussions of evolutionary modeling. We will switch between the concepts of edit transformations and alignments whenever it is convenient to do so.
A simplified scoring system, called a match-mismatch-gap system, is given if all matches have the same value 111= p(a. a ) and likewise all mismatches have the same value rn = p(a, b), a # b. Of course. we assume that M > 0 and g < 0. Additionally, a substitution ( a , b) must be "cheaper" than two indels (a-. -b). Hence, we have Corollary 5.2.8 Let (114, m, g ) be a scorin,g system with only ,ualues for matches, mismatches and gaps. T1ie.n a cost 7neasure (c, h) having c(a, a ) = 0 and c(a, b) = c > 0 is given by
provided that
> > 29, in which at least one inequality is strict, 114 > 0, and g < 0 . il/I
7n
As examples we consider several standard systems:
I. T h e Levenshteiii distance. that is c = 1 and 11 = 1. We may choose match score hl = 2, mismatch score 7n = 1 and gap score g = 0. I\lore generally, if we wish t o measure the distance by p(w, w') = # substit,utions
+ h . # indels,
(5.36)
>
for 12 1 (i.e. that gaps are h times as costly as substitutions), we may choose M=2.m=1andg=l-h.
11. T h e standard match-mismatch-gap system (1.-1,-2) implies the cost measure c = 2 and 1z = 512. 111. A "normed" match-mismatxh-gap system with one free parameter is given by (1,m, 0) where 1 m 0. Equivalently, we have a cost measure with c = 1 - m and h = 112. In particular, the search for a longest common subsequence for a pair of words uses the match-mismatch-gap system (1,0,0) which implies c = 1 and 17, = 112.''
>
77
>
--The converse of the longest common subsequence problem is The problem of shortest supersequence Given: A set of sequences over the same alphabet. Find: A shortest sequence that contains each of the given sequences as a subsequence. This problem is AfP-complete [435].
A new challenge: The Phylogeny
147
How can we find the similarity of or the distance between two words? Clearly, the consideration of all possible alignments does not make sense, since there are too many; see 3.2.3. Observe that we cannot change the order of the letters in the words. This fact suggests that a dynamic progra~nmingapproach will be useful. A dynamic programming algorithm finds the solution by first breaking the original proble~ninto smaller subproblems and then solving all these subproblems, storing each intermediate solution in a table along with a score, and finally choosing the sequence of solutions that yields the highest score. The goal is to maximize the total score for the alignment. In order to do this, the number of high-scoring residue pairs must be maximized and the number of gaps and low-scoring pairs must be m i n i m i ~ e d . ~ ~ Due to the widespread applications of the problem, however, a solution and several basic variants were discovered and published in literature catering to diverse disciplines. It is usual to credit Needleman and Wunsch [319] for creating in 1970 the algorithm for finding the similarity, and Sellers [392] for describing in 1974 the method to compute the distance. Both are designed to produce an optimal measure of the minimum number of changes required to convert one given word into another given word, and may be viewed as an extension of the original Hamming sequence metric. In 1981 Smith, Waterman and Fitch [402] proved the equivalence of both techniques. Two years later they discussed optimal sequence alignments on an important example; see [164]. Let w and w' be two words over A with length m and n, respectively. The algorithms use a ( m 1) x (n 1) matrix, and determine the values of this matrix in the following way:
+
+
Algorithm 5.2.9 Let w = n[l]a[2]. . . n[m,]and w' = b[l]b[2]. . . b[n] be two sequences in A*, equipped with a scoring system (p, q ) . Then, we fin,d the similarity sim(w, w')=sim[m, n] by the following procedure. 1. for i := 0 to m do sim[i, 01 := i . g; .
-
sim[O,j ] := j . g; 2 3 ~ e c a l that l we used a dynamic programming technique to find a shortest path in a network. And indeed, me can frame the task of finding an optimal alignment as such a problem, compare [447]. But it turns out to be easy to reduce the running time by choosing a better algorithm.
3. for i := 1 to m do for j := 1 to n do sim[i, j] := max{sim[i
-
1,j]
+ g , sim[i
-
1,j - 11 + ~ [ ij], , sim[i, j
-
+
11 g)
An alignment of two words w and w' is called a n optimal alignment if its score equals sim(w, w'). T h e algorithm, as stated above, only computes the similarity of the words. For the explicit construction of a n optimal alignment, the algorithm has t o be supplemented by a baclttraclting procedure. This alignment corresponding t o the similarity may well not be unique; b u t all such alignments can be found "baclttraclting" from the cell sim[m. n ] to t h e cell sim[O, 01 in all possible ways. Dual, we have a n algorithm to compute the distance between two words:
Algorithm 5.2.10 Let w = a[l]a[2]. . . a[m] and w' = b[l]b[2].. . b[n] be two sequences in A*, equipped with a cost measure (c, h ) . Then we find the distance p(w, w') = p[m, n] by the following procedure I . for i := 0 to m do p[i, 0] := i . h;
2. for j
:= 0 to n do p[O, j] := j . h;
3. for i := 1 to m do for j := I to n do p [ i , j ] := min{p[i - 1,jl + h , p [ i - 1,j - 11 + c [ i , j ] , p [ i ,j - 11 + h ) Obviously, in both cases, the algorithms run in quadratic time: Observation 5.2.11 Let w and w' be two words over the same alphabet A. Let a scoring system o r a cost measure be given for A. Then the quantities ) sim(w, w') and p(w, w') ca8n be determined in O(lwl . 1 ~ 1 time.
Note t h a t this method t o determine the similarity of tn7o sequences is relatively fast b u t still too slow for most practical work, where t h e length of the sequences and the number of sequences to be compared are very large. Consequently,
A new challenge: The Phylogeny
149
there are heuristic methods which are more efficiently for "similarity-searching" a n entry in a collection of sequences.24 T h e similarity-based approach is more general than t h a t of distance, since T h e distance-based approach is restricted t o global comparisons only, it is not suitable for local ones. Here, a local alignrnent between two sequences w and w' is a n alignrnent between a subsequence of w and a subsequence of w'.Our algorithm 5.2.9 can be adapted t o find the highest scoring local alignment between two sequences:
Algorithm 5.2.12 Let w = n[l]n[2]. . . n[m] and w' = b[l]b[2]. . . b[n] be two sequences in A*, equipped with a scoring system ( p ;q ) . Then, compute the local alignment scores as follows: 1. for i := 0 to m do sim[i,01 := 0;
2. for j := 0 to n do sim[0, j] := 0;
3. for i := 1 to m do for j := 1 to n do sim[i,j] := rnax{sim[,i 11 + g, 0)
-
1,j ]
+ g, sim[i
-
1,j
-
11
+ ~ [ j], i , sim[i,j -
I n th8e end, it suffices to find the m8aximurn en,try in the whole array sim: this will be the score of an optirn,al local alignment. For this algorithm and derivations of our basic technique compare [394]. With similarities we can penalize gaps depending on their lengths. This cannot be done with metrics. This is a n important observation, since if two aligned sequences are for functional protein coding genes, then any gaps would be expected to have lengths that were multiples of three, to preserve the reading frame of the gene; and for ribosomal genes there may be aspects of the secondary structure that can be used t o evaluate the plausibility of the various gaps introduced in a n alignment. In any case me assume t h a t for a cost measure (c, h) the equality c(a, a) =
0 holds for all letters a. O n the other hand, there are scoring systems "1n particular, the well-known BLAST method runs in linear; that is O ( w l compare 13941.
+1 ~ ' ) :
time,
(p, g) conceivable in which for different letters a and b we have p(a, a) # p(b, b). T h e PAM (Point Accepted Mutation) series of score matrices are frequently used for protein alignments [13] and [124]. Each entry in a PAM matrix gives the logarithm of they ratio of the frequency a t which a pair of residues is observed in pairwise comparisons of homologous proteins to the frequency expected due t o chance alone.25 For a generalized scoring system, derived dissimilarity need not satisfy the triangle inequality.
5.2.4
Multiple Alignments
In the context of molecular biology, multiple sequence comparison is the most critical cutting-edge tool for extracting and representing biologically important commonalities from a set of sequences. I t plays a n essential role in two related areas: Finding highly conserved subregions among a collection of sequences; and Inferring the evolutionary history of some species from their associated sequences. One central technique for multiple sequence comparison involves multiple alignment. Here, a (global) multiple alignment of n > 2 sequences t u l , . . . , w,,is a o T h a t means that we natural generalization of the alignment of t ~ sequences. insert gap characters (called dummies) into, or a t either end of, each of the sequences to produce a new collection of elongated sequences t h a t obeys these rules: (i) All elongated sequences have the same length, 1 ; (ii) There is no position a t which all the elongated sequences have a dummy. Then the sequences are arrayed in a matrix of n rows and 1 columns, where m a s wrl i=1.....n
< 15
lwil. i=l
25Arnino acids that regularly replace each other have a positive score, while amino acids that rarely replace each other have a negative score.
A new challenge: Th,e Phylogeny
151
Consequently, there are only finitely many multiple alignments for a collection of sequences. Furthermore, the elongated sequences in a multiple alignment are as similar as possible according t o some predefined scoring syst,em, cost measure a length of a network. Although the notation of a multiple alignment is easily extended from two t o many sequences, the score or the cost of a multiple alignment is not easily generalized. There is no function that has been universally accepted for multiple alignment as distance or similarity has been for pairwise alignment. T h e essence of first idea is to extend the dynamic programming technique 5.2.10 from pairwise alignment t o the alignment of n > 2 sequences. A cost measure (c, h) for a n alphabet A to compare two sequences can be also written as a f ~ n c t ~ i ofn : (A U (-1)" E , where - is the "dummy" symbol, - $ ! A , and
(f (-, -) is not defined.) A U (-1 is called the extended alphabet, and such a function f , extended to n 2 values, is called a generalized cost measure. More precisely: A generalized cost measure is a function f : (AU{-))n t R>,-,, which satisfies the following conditions:
>
(i) f is non-negative: f ( a l , . . . ,a,,)
> 0;
(ii) f ( a , . . . , a ) = 0, for each a E A; f (-, . . . , -) is not defined; (iii) f ( a l , . . . ,a,,)
> 0 if
a, = - holds for a t least one index i ;
(iv) f is symmetric:
holds true for any permutation
T
With this in mind, we have
Algorithm 5.2.13 (Clote, Backofen [log], Waterman 14471) Let A be an alphabet. Let w = a[l]a[2]. . . a[k], w' = b[l]b[2]. . . b[m]and w" = c[l]c[2]. . . c[l]
+ lR be a generalized cost be three sequences i n (A u (-1)". Let f : (A u measure. W e find the "generalized" distance R ( w , w', w") = R [ k ,m , I ] by the following procedure:
Applied to the case with n sequences, we have the following strict generalisation of 5.2.11:
Observation 5.2.14 Let N = {wi : i = 1,.. . , n ) be a set of words over the s a m e alphabet A. Let a generalized cost measure be given for A. T h e n the quantity R ( w l , . . . , w,) can be determined in 0(II:",,Iwil) t i m e . Another approach is t o use single pairwise alignments. Given a multiple alignment .2/1 for the sequences wl , . . . , w,, . the induced pairwise alignment of two sequences w,and w, is obtained from ,M by
1. removing all rows except the two rows for wi and wj; 2. removing columns consisting of a dummy opposing another dummy. To find the cost, or the score, we use the cost measures, or the scoring systems, respectively, in the standard manner. Then, we define the cost (score) of JM by summing up the distances (similarities) of several pairs of induced alignments. This can be described in graph-theoretical terms: Let N = { w l , . . . , w,,) be a set of sequences from the same phylogenetic space. Then define the generalized distance of a graph alignment G = (T'E), AT C I/, by
A multiple alignment of a collection of words is called a n optimal multiple alignment if its generalized distance is minimal aniong all multiple alignments
A new challenge: The Phylogeny
153
of these words. Such a n alignment may well not be unique. Some specific examples of generalized scoring systems are of interest: The sum-of-pairs or complete alignment, which is the sum of all pairs cost, This definition that means we consider the complete graph G = (N, is mathematically natural but not biologically intuitive; in particular, evolutionary relationships are ignored. This formulation of an optimal multiple alignment for a set of sequences has been shown to be ,UP-hard; see Wang and Jiang [442].
(y)).
The tree Here we are near evolutionary trees: Given a set AT of n sequences and a partially labelled tree T = (V, E) with n leaves, where each leaf is associated with a given sequence", we want t o reconstruct a sequence for each internal node to minimize the length of T. A complication, however, is t h a t the alignment may change depending upon the tree on which the sequences are aligned. This is not a simple issue, since most of the phylogenetic studies align the sequences first, then compute a phylogeny based on that alignment. One solution to this dilemma is to infer both the alignment a,nd the tree a t the same time, so t h a t the "optimal" alignment and the phylogeny and tree are obtained together. We will discuss this approach below, a n overview is given by Jiang and Wang [243]. This formulation of a n optimal multiple alignment for a set of sequences has also been shown t o be JVP-hard; see Wang and Jiang [442]." A heuristic approach has been created by Schwikowski and Vingron [390]. The more specific star alignment, in which it is assumed that the underlying tree is a star. This implies t h a t all sequences share a common ancestor. Restricting the "topology" makes this approach much more tractable, but nevertheless it too is not solvable in polynomial time. If we pick one of the given sequences as the internal vertex of the star, we can find a n optimal alignment in
time [394]. And to find the center sequence we can compute all 0 ( n 2 ) optimal pairwise alignments and select as the center the sequence w, t h a t '6Note that this term is used to mean several different things in the literature. 2 7 ~ a t ewe r will call such a tree an N-tree. L8Moreover, they show that there is no polynomial time approximation scheme for the problem, unless ? = ,kr?.
154
minimizes
For a broader discussion of the relationship between multiple alignment and phylogeny construction, compare Vingron [437]. T h e generalized distance for these applications can be very different: Let A = {r,y ) . Consider the costs for one column consisting of ml instances of the letter r and mz instances of the letter y, where m l my = n, and using the length function
+
for a , b E {r,Y). I t is easy t o check that the complete align~nenthas length ml . mz, t h a t there is a star alignment of length min{ml, mz), and that there is a tree alignment of length 0 : ml=Oormz=O 1 : otherwise =
i
Surveys of multiple sequence comparison methods are given in [75], [I481 and [438]. In any case, the alignment array can be summarized in a single sequence called a consensus sequence, which is frequently added a t the end of the alignment. I t is common in computational molecular biology t o compute a multiple alignment for a set of sequences, and then represent those sequences by the consensus sequence derived from the alignment. T h e consensus sequence consits of letters that summarizes the letters of the alignment in each column. A simple way t o calculate a consensus sequence is t o use the so-called majority rule (MR), which chooses the most frequently occuring letter in each column. We distinguish between two rules: T h e normal rule uses the alphabet A U (-1. The restricted rule uses only the alphabet A. An example compares the word for
SCHOOL in different languages:
A new challenge: Th,e Phylogeny
Language German English French Italian Consensus, MR Consensus, restricted MR
-
S S
C C
H H
U
-
0
0
L
-
E
-
C
O
-
L
E
-
S
C
-
U
0
L
A
-
S S
C
Hor-
H
OorC 0 or U
Oor-
C
L L
E E
E
0
L E
More generally, assuming that there is a cost measure (c. h ) , written as a generalized cost function f : (A U { - ) ) 2 -+ R.we define the consensus sequence as follows: Given a multiple alignment M = (a,,) of a set N of n sequences, the consensus letter of column a of JU is the letter a that minimizes 12
(If we allow a = -, then v,-e have to define also f (-, -), in general by setting f (-, -) = h.) The consensus sequence derived from JU is the concatenation of the consensus letters for each column of M. Using the generalized cost measure defined by
f (a,
=
{0
: a = b 1 : otherwise
gives the majority rule. I t is easy t o find the consensus sequence for a given multiple alignment. The following problem in a phylogenetic space, given by a n alphabet A and a generalized cost measure f : (A U (-1)' + R,is not so simple:
The Consensus Sequence Problem Given: A set N = {wl, . . . , w,,) of sequences. Find: A multiple alignment .#U= (aji)j,l ,....,, sensus sequence w = a1 . . . a1 such that
i,l,,.,, 1
for N , and a con-
is minimal. For a broader discussion of this problem, compare Gusfield [198]; for the consequences of this observation for molecular evolution see Eigen [153].
5.2.5
Steiner's Problem in phylogenetic spaces: The question
Until now we have vaguely defined the maximum parsimony problem as the problem of reconstructing the evolutionary history with the fewest number of mutations. Phylogeny construction is a prominent application of the notion of a Steiner Minimal Tree: one of the first formal versions of phylogeny construction interpreted the ancestral sequences as Steiner points in a hypercube, namely in {a, c, g, t I d . We are given a set of aligned sequences and a tree topology, where the leaves are labelled with the given sequences. For any assignment of sequences t o the internal vertices of the tree, the length of the tree is defined as the number of mismatches between the pairs of sequences incident t o each edge. A most parsinlonious assignment of sequences is one t h a t minimizes the total length. An algorithm for its solution that is linear in the number and length of the sequences was given by Fitch [I621 in 1971. In 1975 Sanltoff [378]generalized this approach t o handle assumed tree topologies and unaligned sequences. We will discuss this technique below, and in more depth, in the last chapter of the book. Equipped with the proper terminology, we can now give a precise definition of the maximum parsimony problem: Consider a phylogenetic space A over the alphabet A with the scoring system ( p , q ) generat,ing the similarity function sim, and the equivalent cost measure (c,h) generating the metric p. In A , the length of a tree T = (V, E) is given by
The metric p may be a pseudometric, in which case we call L a pseudo-length. The most important principle in molecular evolution, namely that the degree of similarity between genes reflects the strength of the evolutionary relationship between them, gives rise to the following observation: Let i V be a finite set of points (sequences, words) in a phylogenetic space A. A most parsimonious tree is an SMT for N in A . An SAIT for N must exist and it is only necessary to search the Steiner points in the set & = {w E A : p(u, W ) L(A)(I\IST for N ) ) , (5.47)
<
where v is a point in N. T h e set Q is bounded and consequently finite. So Steiner's P r o b l e ~ ncan be solved simply by enumerating all subsets N' C Q \ N
157
A new challenge: The Phylogeny
with a t most IN - 2 points, computing an N E T for iV U N ' , and keeping the smallest one. Although such an approach leads to a finite algorithm, it is not a very efficient one. We will discuss this more rigorously in the next chapter. T h e notion of a n S M T subsumes both "pure" trees and (multiple) alignments. T h e formulation as Steiner's Problem gives hints on how t o handle the problem, but the standard methods do not readily translate, since the underlying "sequence space graph" is too large for the application of the general methods described in the literature. This does not make these methods worthless, but special properties and alternate algorithms are needed.
5.2.6
A dynamic programming approach for finding an SMT
How hard is it to find a n S3IT in a phylogenetic space? We will now give a result t h a t is a direct generalization of our approach in 5.2.10 for the calculation of the distance between two words. We start with a given (combinatorial) structure of a graph and will compute its length in a phylogenetic space, and the location of the Steiner points there. Let T be a n arbitrary tree with given vertices IV = { v l , . . . ,v,,), and let {wl, . . . , w k ) be (mobile) Steiner points. Further, let us consider the extended alphabet A. = A U (-1, - $ A, and let lB = {0,1). If x E A. and E E lB,then let
Recall t h a t the set B1' is endowed with the partial order S H . For define BC = {E E B n : E < H x).
x E B"
We write B;' = E?: for x = ( 1 , . . . , I ) . Note t h a t IB; = x = ( 0 . . . . , 0 ) . and BY c B;' for any X.
and only if
0 if
we
(5.48)
Let z = XI . . . z, he a word of length n over Ao. We define the indicator = -. X(X)= (XI,. . . ; x n ) E B7' as follows: xi = 0 if and only if T h e o r e m 5.2.15 (C., Ivanov, Tuzhilin [105], [lO6]) Let 2 1 , . . . , x , E A. be letters not equal to - simultaneously, and let v l , . . . : v,, E A* he words such
that if xi = -, then vi' = A. Set x = x ( x l , . . . ,x,,). Th,en for any tree T with given points {vl , . . . , v,,), th8eequality
L ( A * )( T for (711 21. . . . , v7,xi,)) = min [ L ( A *()T for { V ~ X ; ' . . . . , v,,x;; )) tEB;
+ L ( A o ) ( Tfor {x;-'~, . . . , x,l-Fn})], holds, where
E
=
. . , E,,).
Theorem 5.2.15 gives a n opportunity to calculate the (pseudo-)length of a n SMT of a given type with a given set of points. Also it permits the location of the (mobile) Steiner points of this tree t o be found. 5.2.15 immediately implies a n algorithm which finds local minimal trees of a given combinatorial structure T for a given set 1 V = {vl. . . . , v,,} C A*. The algorithm consists of sequentially filling in a table of size (lvz1 1).
n,
+
Observation 5.2.16 The algorithm 5.2.25 consumes time of the order
5.2.7
The Perfect Phylogeny problem
We introduce a character-based approach to reconstructing evolutionary history. The input is a set of attributes called characters that objects may possess. T h e basic assumptions regarding characters are: T h e characters being considered are "meaningful" in the context of phylogenetic tree reconstruction. T h e characters can be inherited independently from one another. All observed states for a given character should have evolved from one "original state" of a common ancestor of the objects.2g "characters that obey this assumption are called homologous
A new challenge: The Phylogeny
159
Note that character in this context does not refer to a member of an alphabet; for simplicity we will use natural numbers for character states.
A taxon v over a set C of m characters is a vector v E N m . c(v) is the state of v on character c or the state of c for v. A, is the set of allowed states for c(v), assuming that A, = (0,.. . , r, - 1) for some integer T , 2. Let N = { v l , . . . , v,,) be a set of n taxa, represented by an 12 x m characterstate matrix i\l = ( f i j ) , where f i j is the state of taxon vi on character j . tree T = (T/; E) with h r as the set of leaves is called an N-tree. It should represent the phylogeny for iV, with internal vertices (which may also be labelled) representing hypothetical ancestors to the given taxa. We are interested in (rooted) trees T with the following properties:
>
(i) Each of the taxa labels exactly one leaf of T , and vice vcrsa; (ii) Each of the characters labels exactly one edge of T, but not vice versa; and (iii) For any taxon v, the characters that label the edges along the unique path from the root to v describe the character states of 1'.
A character c is called convex in a N-tree if for every f E A,, the set of vertices {v E IT : c(v) = f ) induces a subtree of T. An N-tree T is called a perfect phylogeny if every c E C is convex in T. The interpretation of such a tree for 111 is that it gives an estimate of the evolutionary history of the taxa, based on the following biological assumptions. (i) The root of the tree rpresents an ancestral taxon that has none of the present m characters. (ii) Each of the characters change from one state to another state exactly once and never changes back to the zero state.30
The Perfect Phylogeny Problem Given: A set of taxa on a set of characters, represented by a cliaracterstate matrix. Determine: Whether a perfect phylogeny exists. Later we will discuss the solution of this problem; for now we examine its relationship to Steiner's Problem. Observe, that in any AT-tree T ,for every 3 0 ~ e n c eany taxon below that edge definitely have t h a t character
character c the number of edges & where c ( v ) # c ( v l ) is a t least r , - 1, assuming that each state of c is exhibited by some taxon. Furthermore, if T is perfect, it must have a n internal labelling with exactly r , - 1 edges \\-here c ( v ) # c(vl). Thus, we have that the perfect phylogeny problem is a specific case of Steiner's Problem: Theorem 5.2.17 (Fernrindez-Baca [160]) A set N of taxa o n a set of ch,aracters has a perfect phylogeny if and only ,if
L H ( S M T for the leaves N ) = x ( r c - 1)
(5.30)
ctC
holds true, where LH denotes th,e length derived from the H a m m i n g distance. Thus, the perfect phylogeny problem can be viewed as the question: Does Steiner's Problem have a solution whose length is exactly equal t o the obvious lower bound?
5.3
APPLICATIONS AND RELATED QUESTIONS
Trees are widely used t o represent evolutionary relationships. T;17e find such relationships in many applications of the study of evolutionary history.
5.3.1
Biology: Taxonomy and Classification
Naming is classifying. Brian Everitt One of the most basic abilities of living creatures is the grouping of similar objects t o produce a classification. This has been a preoccupation since the very first biological investigations. The theory and practice of classifying organisms is generally known as taxonomy. In 1737 Linnaeus published his work Genera Plantarum: he wrote:
A n e w challenge: Th,e Ph,yloyeny
161
All the real knowledge which we possess, depends on methods by which we distinguish the similar from the dissimilar. The greater number of natural distinctions this method comprehends the clearer becomes our idea of things. T h e more numerous the objects which employ our attention the more difficult it becomes to form such a method and the more necessary.
... For we must not join in the same genus the horse and the swine, though both species had been one hoof'd nor separate in different genera the goat, the reindeer and the elk, tho' they differ in the form of their horns. We ought therefore by attentive and diligent observation to determine the limits of the genera, since they cannot be determined a priori. This is the great work, the important labour, for should the genera be confused, all would be confusion. In other words, taxonomy is necessary, but must be done carefully. In the book The System of Nature Linnaeus invited a system still in use today. He gave every species two Latinized names; the first for the group it belongs to, the genus; and the second for the particular organism itself. Today we divide life into (i) Domain3' ; (ii) Kingdom; (iii) Phylum; (iv) Class; (v) Order;
(vi) Family; (vii) Genus; (viii) Species. More or less all these groups are artificial, insofar as their members are categorized according t o agreed-upon levels of similarity rather than precise definitions. The exception is species, which are defined as a group of individual 31There are three domains. The first two, Bacteria and Archea, are made up of many microscopic single-celled organisms. The third domain. Eukarya, is diverse.
organisms that are able to interbreed and produce fertile offspring. Linnaeus' purpose was not evolutionary, but rather to provide a set of universal names. However it turned out that the hierarchical nature of his system has considerable similarity with the modern phylogenetic view. Classification has played a central role in other fields too. In particular, the classification of the elements in the periodic table, given by Mendeleyev 150 years ago, has had a profund impact on the understanding of the structure of atoms. Another example in astronomy is the classification of stars in the Hertzsprung-Russel plot, which has strongly affected theories of stellar evolution. Or, consider that the two aircraft that are closest a t any time instant will have the largest likelihood of collision with each other: we are interested in how far apart, or in other words how dissimilar, these objects are. More examples are given in [156]." The general question is:
The Problem of Classification Given: A collection of objects, each of which is described by a set of characters or variables. Derive: A useful (whatever that means) division into a hierarchy of classes.
5.3.2
Biology: The evolutionary history
Molecular sequences contain a variety of different historical signals. The high level history of life is ideally organized and displayed as a (rooted) tree. The extant species are represented a t the leaves of the tree. and each internal vertex represents a point when the history of two sets of species diverged. Each such node also represents a common ancestor of those species contained in the subtree rooted a t that node.33 Here, roughly speaking, we have the following relationship: 3 2 ~ radically o simplify, in the cases, human beings and behaviour may be classified into classes named by low, medzum and hzgh. " h d may be an extinct species.
A n.ew challenge: The Phylogeny
Level In taxonomy Specieslgenes Placement in time Classification Vertex in the tree Vertex in the S M T
OTU = operational taxonomic unit extant existing units single individuals leaf given point
HTU = hypothetical taxonomic unit extinct ancestors class of individuals internal vertex Steiner point
Clearly. this only a first step in our discussion. This approach must be modified when considering the evolution of viruses or genes, but it remains the dominant point of view. Our goal is to reconstruct phylogenetic trees from molecular data. An established principle is parsimony. Here, parsimony guides the search for a n explanation of the given sequences towards scenarios that require the least number of "evolutionary events", such as mutations, insertions and deletions in the DNA sequence, perhaps weighted by a scoring system. In a graph t h a t has (biological) sequences as vertices, edges represent evolutionary operations that modify sequences. Hendy [217] notes that there are several difficulties t o overcome in building such a tree: (i) The rate of substitutions is too slow to identify short branches in the evolutionary tree; (ii) T h e rate of substitutions is too fast if we exanline divergences close to the origin of life (perhaps 3 . l o 9 years ago);34 (iii) Today we have the DNA world, but the R N A world was a n essential step in the evolution of the modern world; (iv) The genetic data examined gives the history of genes, which usually, but not always, reflects the history of the species; (v) Not all subwords of a sequence are of the same importance, but the length function p assigns equal weight to all; (vi) There is no "continuity" in the sequences. T h a t is the value of the similarity of two words and may be very high, but the "meaning" of the sequences (e.g. the function of the proteins they express) can be very different. 3 4 ~ o ahe r average rates of substitutions in various organisms and genomes compare [I911 and [331].
In other words: What are t,he expected limits of the process of reconstructing evolutionary trees from sequence data?35 For a treatment of several of the major transitions that occurred during the history of life see Maynard Smith and Szathmiry [299], [300].
5.3.3
The common ancestors and looking for LUCA
T h e underlying principle of phylogenetics is t o try t o group living entities according t o their level of similarity. In this context we assume t h a t the more similar two entities are, the closer they are t o their common ancestor. It is a central tenet of modern evolutionary biology that all "living things" trace back t o a single common ancestor. Humans and other ~ n a m m a l sare descended from shrew-like creatures t h a t lived more than 150 hfya (million years ago); mammals, birds, reptiles and fish share as ancestors aquatic worms t h a t lived 600 Mya; all plants and anima,ls are derived from bacteria-like organisms t h a t originated more than 3000 Mya. If we go back far enough, humans , frogs, bacteria and slime moulds share a common ancestor. Then in the series of species from the origin of life up till today there must be a last universal common ancestor (LUCA).3GNot,e that this proposition does not assert t h a t life arose just once, but that all starting points except one went extinct.37 To find the LUCA for a set of species. or a set of populations, or a collection of genes is a very difficult To find LUCA for species is discussed in [458]. Eigen [I521 found the LUCA for genes which is a RYA-molecule. More generally, let N be the set of of extant species (genes) and let AT+ be the set of all past and present species (genes). Then we consider the binary operation * : AT+ x N+ -+ N+ defined by
u * V' = most recent common ancestor of v and v',
(5.51)
3 5 ~ a t ewe r will see that there are also several mathematical difficulties in finding such trees. 36Aiso called the most recent common ancestor. 3 7 ~ o more r facts about early (molecular) evolution see Eigen [153]. 3 8 ~ particular, n find this entity for all humans! A centaur is not a common ancestor of human and horse.
A n.ew challenge: T h e Phylogeny
165
for v, v' E iV+, whereby v * v = v. I t is not hard t o see t h a t this operations is well defined and makes the set into AT+ t o a conimutative semigroup. Moreover, define for .u E N f the set N ( v ) as the set of extant species (genes) descented from v. Then, we have t o assume that for two species (genes) v and v', either the two sets N ( v ) and N ( v l ) are disjoint or one is contained in the other. Note that Observation 5.3.1 The conditions (i) AT(.) n AT(vl) # (ii) N ( v )
0, and
2 N (v') or N (v') C N (v),
are equivalent.
5.3.4
Biology: Taxonomy and Diversity
T h e theory of evolution is concerned with the extraordinary diversity of life on Earth. T h e diversity of the living world is staggering: more than 2 million existing species of plants and animals have been named and described; and many more remain t o be discovered - until up to 10 times this number according to some estimates. What is impressive is not just the numbers but also the incredible heterogeneity. These virtually infinite variations of life are the fruit of the evolutionary process. Taxonomy is the classification of organisms for the first aspect in any view of the life. Each phylogenetic tree is also a classification. but not vice versa.39 The classification of animals and plants played a n important role as a basis for Darwin's theory of evolution. Moreover, taxonomy is necessary to describe the diversity of living organisms. The diversity of genomes is twofold: The presence of numerous species on Earth; and T h e polymorphism within each species. 39\\Je will discuss this further i n the next chapter.
There are many reasons why knowledge of the biodiversity is necessary, compare [187], [287] and [428].40 There are several subquestions: (i) How many species are there? (ii) How many go extinct? In both the past and in the present. How many are lost every year? (iii) How long did species typically survive? (iv) How much of evolutionary history is knowable? For the idea for using evolutionary history for describing the biodiversity see Schleifer and Horn [383].
5.3.5
The prehistoric past of mankind
So the Lord God caused a deep sleep t o fall upon the man, and while he slept took one of his ribs and closed up its place with flesh; and the rib which the Lord God had taken from the man he made into a woman and brought her t o the man. ... The man called his wife's name Eve, because she was the mother of all living. T h e Holy Bible Clearly, it is of great interest t o understand the evolutionary past of mankind, t o specify the location of the human branch of the tree of life. This is one of the biggest questions in evolutionary biology. Darwin in 1871, for instance, claimed t h a t the African apes are mans closest relatives: and suggested tha,t mans evolutionary origins were t o be found in Africa. In other words, the commonly held view was that humans were phylogenetically distinct from the great apes (chimpanzees, gorillas and orang-utans), being placed in different taxonomic families; and that this split occurred a t least 15 Mya. These conclusions were based on fossils. Genetic studies of human prehistory started 100 years ago considering blood groups. By 1964, knowing much more about blood groups and their worldwide 4 0 ~ particular, n there no successful vaccine to prevent or halt HIV infection. In part, this is because of the high genetic diversity of HIV. For this specific case see Dress and Wetzel [132]. Here, the main question is the prediction of the winning strain (or strains).
A new challenge: The Phylogeny
167
distributions, Cavalli-Sforza and Edwards constructed the first family tree of human species. In 1967, Sarich and Wilson [380] measured the extent of immunological crossreaction in the protein serum albumin between various primates. The results were striking: humans, chimpanzees and gorillas were genetically equidistant and clearly distinct from the o r a n g u t a n . Furthermore, Sarich and Wilson estimated t h a t human, chimpanzee and gorilla separated only 5 Mya." For more facts about this question compare Ayala [25], Bandelt et al. [31], Page and Holmes [331], and Pesole et al. [343]. Today we say that the proper story of human evolution began 4 Mya with a group of apes in the African jungle. Many have attempted t o describe this ' are two different story, but in general have done with little s u ~ c e s s . ~There models: T h e multiregional model posits the evolution of Homo sapiens from a convergence of various distinct hominid lines in different geographic regions. T h e Out of Africa model posits the evolution of a lineage of hominids who left Africa not more than 1 Mya. The breakthrough came with a publication in Nature in 1987 [64] by the late Wilson (once again), and two of his students, Cann and Stoneking, entitled " Mitochondrial DNA and human evolution". They used mother-only genes, known technically as mitochondria1 Wilson and his colleagues examined the mother-only genes in 134 individuals from around the world. They found remarkable similarities as well as differences in all the recipes. The centrepiece of the article was a diagram which bears a superficial resemblance t o a tree. I t contains a hypothetical female common ancestor of all extant humans, called Eve, or in more scientific terms Mitochondria1 Eve (mtEve). T h e existence of a mtEve is a consequence of our understanding and description of evolution, compare Korner [263], [264], and is directly related to our notions of a common ancestor. For a n estimation of the age of mtEve compare Kimmel and Axelrod [260]. To conclude. it seems most likely that anatomically modern humans evolved in 4 1 ~ h work e of Sarich and Wilson was one of the first examples of molecular systematics, that is the use of gene and protein sequences to reconstruct the evolutionary history. It changed the perspective on human origins and opened the "molecules versus morphology" debate. "sykes [419] gives a history of these projects and shows a way out of this impasse. 4 3 ~ h egg-shaped e powerhomes inside your living cells, the mitochondria, have their own independent genes, and you inherited them exclusively from your mother.
Africa around 200 liya (kilo years ago) and then spread around the In 1996 Sykes and his team published a n analysis of Palaelithic and Neolothic ancestors in Europe, based on mother-only genes in 821 individuals in Europe and the Middle East. They distinguished seven lineages. Harking back t o Wilson's mtEve, Sykes called the last shared grandmothers in each cluster of mother-genes "the seven daughters of Eve"; compare [419]. For a nice discussion of this subject see also [24], [35], [63], [69], [70], [327]. [387] and [461].
5.3.6
Historical linguistic and stemmatology
If we possessed a perfect pedigree of mankind, a genealogical arrangement of the races of man would afford the best classification of the various languages now spoken throughout the world: and if all extinct languages, and all intermediate and slowly changing dialects, were to be included such a n arrangement would be the only possible one. Charles Darwin Language is the defining characteristic of humans. Languages, like genes, provide vital clues about human prehistory. Starting in 1957 Chomsky led a revolution in linguistic theory. According to Chomsky, all of the roughly 6000 human languages that exist today have the same underlying universal grammar, which is the product of special circuitry in the brain, a language organ. The evolutionary trajectory of lluman languages required a t least two steps: 1. A small number of phonemes are used t o generate a large number of words; and
2. A large number of words are used to produce an "almost unlimited" number of sentences, by logically combining words using a finite number of grammatical rules.45 4 4 4 s reported in [244],Templeton later obtained several distinct trees, similar to Wilson% tree, and most of them support a non-African hypothesis. But the "Out of Africa" hypothesis is also supported by several other observations. 45Note what Searle in [67] remarked: "I say that syntax by itself is not constitutive of nor sufficient for semantics.'' In the same sense, the genome is much more than the sum of its genes.
A new challenge: The Phylogeny
169
The first, and simpler, step is the core of studying the evolution of languages, compare Crystal [I181 and Nowak [323]. For a classification of languages see Comrie et al. [112]. Cavalli-Sforza [68], [69], 1701 discuss the consequences of such an evolutionary approach to languages." For a treatment of several of the major transitions that occurred during the history of languages see 1Maynard Smith and Szathmiiry [299], [300]. A short description of the history of languages is given by Janson [241]. In general. in historical linguistics. the following types of characters occur: (a) Lexical characters, or word meanings; or grammatical features; (b) Morphological ~haract~ers, (c) Phonological Characters, or sound changes. For these concepts see Warnow, Ringe and Taylor [445]; for its application to studying the evolution of the Indo-European family of languages see Bonet et al. 1491.
A simple, but instructive example is to compare the word for ISLAND in different languages, written in a suitablly chosen multiple alignment: Language German English French Latin Italian Consensus
1 I I I I I
/I
S
E
-
S
N -
-
-
S S
U -
L L L L L
-
s
-
L
N
-
-
-
D
*4 N E A A -
-
A
-
-
"By all means let's agree that the faculty of language evolved in a biological manner", Comrie said. The evolution of languages is in several aspects fundamentally different from the evolution of genes: ' (a) The evolution of languages depends on historical and cultural e mwonments. (b) The evolution of genes is very slow in relation to that of languages. "1n particular, there is the q u ~ s t i o nof whether a "proto-language" existed. See Ross [366].
(c) There are many exchanging between languages. Consequently, in general, a network of languages will not be tree-like. We now consider the written versions of languages. Trees may represent the way in which different versions of a manuscript arose through successive copying; such a tree is called a stemma. Curiously, one of the first niathematical papers about phylogenetic trees created by Buneman [59] dealt not with biology but rather with reconstructing the copying history of manuscripts. Mink [309] wrote: T h e same data as used for creating the new printed Editio Criticu Maior of the New Testament, commencing with Catholic Letters, allows a genealogical analysis of the witness. T h e objective is t o establish a comprehensive theory of the structure of the tradition. Because the tradition of the New Testament is highly contaminated this theory has to handle the problem of contamination, and also the problem of accidental rise of variants, and must be able t o be verified a t any passage of the text. Where there are variants, the witnesses have a relation that can be described by a local stemma of the different readings. These local stemmata allow or restrict relations among witness in a global stemma, which must be in harmony with the total of the local stemmata. In the first phase, local stemmata were established only a t places where the development of the variants is very clear. The coherencies within each attestation were analysed.. . . Then the local stemmata must be revised in the light of the total of the genealogical d a t a included in them. Now a n analysis of genealogical coherence is possible and may help to find local stemmata for passages unsolved so far. Finally, the global stemma (or stemmata) mirroring all the relations of the local stemmata will be established by combining optimal substem~nata,each containing a witness and its immediate ancestor, t o produce the simplest possible tree.
AN ANALYSIS O F STEINER'S PROBLEM I N PHYLOGENETIC SPACES
In chapter 2 we gave an analysis of Steiner's Problem in metric spaces. Later we discussed the specific case of the Euclidean plane. Now, we will apply our knowledge to analyse Steiner's Problem in phylogenetic spaces. In the present chapter we nil1 introduce and describe several concepts which
1. Require our full attention, since they show us difficulties which need to be managed; and 2. Help to create methods for handling shortest trees.
6.1 6.1.1
DIFFICULTIES The complexity of tree building algorithms
Recall that a phylogenetic space is a discrete metric space. Consequently, Steiner's Problem in phylogenetic spaces deals with finite sets only. So, in principle, the problem can be solved by the following algorithm:
Algorithm 6.1.1 Let N be a finite set of points i n a (generalized) phylogenetic space A = (A*,p ) . T h e n perform the followin,g steps t o find a n S M T : 1. C o m p u t e a71 M S T To for -V;
2. Describe a Steiner hull Q for N;' 3. Enuinerate all subsets N' of Q \ N with IN'I 5 NI For each iV' compute an MST for AT U N'; Keep the smallest one.
-
2;
This algorithm is certainly not very efficient. Its complexity depends exponentially on the size of the Steiner hull, and we will see t h a t this number is not small. In the chapter before we gave another algorithm for constructing a n SMT for N = {wl,. . . ; w,,)in a phylogenetic space. The algorithm is a generalization of a dynamic program~ningapproach and has the order
which means it can only applied if the number and the lengths of the sequences are very small. In the next chapter we will describe a third exact method to find a n SNIT in a phylogenetic space. We will see that its complexity is also exponential in the number of given points. In phylogenetic spaces all linown exact algorithnls need exponential time- frequently more time than that talien by the evolutionary processes themselves. Penny [338]: T h e real evolution runs faster than the calculation can follow it But nature Performs many computatiorls in parallel; and Does not check all possibilities. h his is a subset of {w E .4* : p(v, w)5 L(A)(To)}, v E N . Since space, Q is a finite set.
A
is a discrete metric
An analysis of Steiner's Problem in phylogenetic spaces
6.1.2
173
Steiner's Problem in sequence spaces is NP-complete
Consider a specific space, the &dimensional hypercube
T h a t is also the graph whose set of vertices consists of all binary vectors of size d, with a n edge joining two vectors if and only if they differ in exactly one coordinate. In this case, Steiner's Problem is to find the smallest possible number of edges in any subtree of the hypercube t h a t spans the set of given points. T h e relationship between Steiner's Problem and phylogenetic trees is given by the following consideration: We consider a word of 0s and 1s as a description of some individual, perhaps a genetic sequence in which each entry may take on one of two possible values. Then a set of taxa may be viewed as a set of points in Qd.-4 (rooted) SMT for such a set N is then a possible explanation of how these taxa are related and how they evolved from a common ancestor (namely, the root). Here, each edge of the ShIT represents an evolutionary change in exactly one of the d entries.
Theorem 6.1.2 (Foulds, Graham [169]) Steiner's Problem i n sequence spaces is N'P-complete. On the other hand, there are several facts about the length of SAITs in hypercubes. Let L(d, k) = ~ ~ ~ { L ( I B ~ ~ ~ for H ) N) ( S: M iV TC lBd, NI = k ) . Then, of course, L(d, k ) 5 2d - 1 for any positive integer k , and L(d, 2) = d.
Theorem 6.1.3 (Miller, Perkel [308]) Consider SMTs i n hypercubes.
(b) L ( d , 4 ) = (c)
Ly];
L(d, 5 ) = 2d -
-
[el,
(6.3)
and, asymptotically,
Moreover the problem t o decide if
L ( B ~ . ~ ~ ) ( s Ifor \ I N) IT5 b is JV'P-complete.
6.2
MORE ABOUT TREES
Phylogenetic trees provide a standard representation of evolutionary relationships in biology and related sciences. However; from a mathematical perspective, it is natural t o consider these relationships in a more generally setting.
6.2.1
The number of graphs and networks
Often we are interested to know how many instances of a certain mathematical object may exist. This type of problem is called a n enumeration or counting problem. For instance How many non-isomorphic graphs are there with n vertices? How many non-isomorphic connected graphs are there with n vertices? How many non-isomorphic trees are there with n vertices? Two graphs GI = ( I f 1 , E l ) and Ga = (V2.E 2 ) are said to be isomorphic if there exists a one-to-one, onto mapping f : IT1 + 1%such that & E El if and 0 1 1 1 ~ 7 if f ( v ) f ( v l )E E2.2 "t is interesting to attempt to characterize the concept of isomorphism, Isomorphic graphs have (i) the same number of vertices; of edges; of components; (ii) an equal number of vertices of any given degree; (iii) for each integer k , the same number of paths of length k ; and
An analysis of Steiner's Problem in phylogenetic spaces
175
A graph G = (T/: E) with n vertices is called a labelled graph if a bijective mapping from 1' onto the set (1,. . . , n} of integers is given.3 We will also consider partially labelled graphs, where an injective mapping from a subset V' of the vertices into the set (1,.. . , n'), n' = I T / ' is given. In counting problems on graphs the word "different" is of utmost importance and must clearly understood. If the graphs are labelled, all graphs are counted. On the other hand, in the case of unlabelled graphs the word "different" means non-isomorphic, and each set of isomorphic graphs is counted as one. As an example let us consider the problem of counting all graphs with n vertices. Such a graph has a t most (t) = n ( n - 1 ) / 2 edges. Hence.
Theorem 6.2.1 The number graph(n) of labelled graphs with n vertices equals
Many of these graphs, however, are isomorphic. Consequently, the number of unlabelled graphs with n vertices is much smaller than that given by the theorem. On the other hand, by considering all n! labellings, we find that the number g(n) of all non-isomorphic graphs obeys g(n) . n! graph(n). Hence,
>
Corollary 6 . 2 . 2 The nu'mber of n,on-isom,orphic graphs with n vertices is at leust fi7L(n-1)/n!. Often we have no exact formula for counting the number of combinatorial objects of some kind, but we can describe its asymptotic behavior. Then we use the following notation: Let f and g be functions from the positive integers to the real numbers, then (i) The function g ( n ) is said to be growing faster than f (n),denoted f ( n ) = lim
n+m
f (n)
-=
g(n)
0.
(iv) for each integer k , the same number of cycles of length k . However, these properties are necessary but not sufficient criteria for isomorphism. It is strange, but the computational complexity to verify whether two graphs are isomorphic is still unknown: No polynomially hounded algorithm is known, on the other hand it has not been proved that this problem is in N'PC. Maybe, this problem is a member of A'PZ. A monograph on isomorphism detection is given in [223]. There is a quadratic time algorithm which decides whether two trees are isomorphic; see [433]. 3 0 r onto another set of n distinguished names.
(ii) The function g ( n ) is said to approximately f ( n ) ,denoted f ( n )= g ( n ) , if lim nim
f
(72)
-
g(n)
= 1.
This notation allows us to concentrate on the dominating term in an expression describing a lower or upper bound and to ignore any multiplicative constant^.^." T h e o r e m 6 . 2 . 3 Denote by c o n n ( n ) the n u m b e r of connected graphs with n labelled vertices. I t holds t l ~ a t
Proof. We show that the sum
is the number of disconnected graphs: A proper component of a graph has at least one and a t most n - 1 vertices. Let i be the number of vertices outside of such a component and let n - i be the number of vertices inside, 1 i 5 n - 1. For a fixed number i there are
<
different graphs with i vertices. T5'e can choose
vertices. (6.1 1 ) and (6.12) together imply the assertion. 0
Gnfortunately, an esplicit formula for the function conn is unknown, but broader discussion of the growth of functions can be found by Aigner [ 3 ] . 5For instance, we have Stirling's approximation to find that the number of non-isomorphic graphs is at least exponentially in the number n of vertices. 4~
An analysis of Steiner's Problem in phylogenetic spaces
Number n of vertices
Number conn(n) of connected graphs
I t s e e m that the fuiictiorl corm ir~creasesexponentially, and indeed asymptotically, the number of connected graphs is the same as the number of all graphs:
Remark 6.2.4 A l m o s t all graphs are conmected; that m e a n s
For these numbers and other facts about counting on graphs see Harary and Palmer [209].For an introduction to the "random theory of graphs" see Bollob& [47].
6.2.2
The number of trees
Steiner's Problem is that of "Shortest Connectivity". We saw t h a t the length minimality criterion forces the net~vorkto be without cycles. Thus we are interested in trees. Clearly, we have to distinguish between labelled a n unlabelled trees. In phylogenetics we search for a tree interconnecting a set N of "living entities" (species, genes, sequences, words - roughly speaking: names). Such a partially labelled tree is usually called a n iV-tree, which means: The tree has exactly
IN1 leaves, each labelled
by a different element of AT;
411 internal vertices are unlabelled; The degree of each internal vertex is at least 3.
Sometimes we accept an exception. namely that exactly one internal vertex is marked, and is permitted to have degree 2. Then this vertex is called the root of the tree. and such a tree is called a rooted N-tree. Similarly, and more generally, we define a tree for AT by a tree T = (I; E) with the property that each vertex of degree a t most two is labelled by one element of N.G T4'e will write this as N 5 V . Each 1V-tree is also a tree for N. As noted above Kirchhoff and Cayley introduced the concept of a tree, and Cayley enumerated all labelled trees in 1889 [73]. Since that time, enumerative metllocls for counting various classes of graphs, including trees, have been developed, but are still from completely scientific. The main originator of this field I? as Polya 13491. It is not the purpose of this chapter to provide a complete survel of counting methods for trees. We will focus on the counting of specific classes of trees, which are important in our investigations about phylogeny. We start counting with the number of different labelled trees7 and we will describe this number in terms of the vertex degrees. Let T = (V, E) be a tree with n vertices v i , ..., v,,, and let gi = g(vi) be the degree of each vertex vi. Then, obviously, each of the numbers gi is a positive integer, and, in view of 1.2.1 and 1.2.4,
Conversely, by an induction argument, we find t h a t this equality is also sufficient for the existence of a tree on n vertices with the predetermined degrees g l , . . . , g,,. Moreover, the number of different trees increases exponentially, but not faster: T h e o r e m 6.2.5 L e t g l , ...,g,, be n sequence of positive integers a n d d e n o t e by t ( n ,g l , ...,y,,) t h e n w n b e r of different labelled trees T = ({.ul; ...,'u,,}, E ) of n vertices ,with t h e degree sequence
Y T ( V ~=) gi
(6.15)
the literature these trees are often called S-trees, but for us the letter S is occupied for metric spaces. 111the present book the set of given points is usually denoted by N , where AT means a set of "names". In other words; an N-tree is a named tree. 7With most enumeration problems, counting the number of unlabelled things is harder than counting the number of labelled things. So it is with trees.
An analysis of Steiner's Problem in phylogenetic spaces
for i = 1,..., n. Then,
i f (6.14) holds, and t ( n , g1, ..., g,,) = 0
otherwise. For a proof compare [38] or [296]. M'e are interested in several consequences of this observation. At first, summing up over all degree sequences satisfying (6.14). we have one of the most beautiful formulas in enumerative combinatorics:
Theorem 6 . 2 . 6 (Cayley's tree formula, [73]) T h e n u m b e r of different labelled trees with n vertices equals n7"-". More generally, the n u m b e r of different labelled forests with n vertices and c components i s c . nn-c-l. Priifer [357] established a bijection between trees and sequences of n - 2 integers between 1 and n , providing a constructive proof of Cayley's result. This bijection can then be exploited to give algorithms for systematically generating labelled trees.$ Now we may estimate the number of non-isomorphic trees. Let t ( n ) be the number of non-isomorphic trees with n vertices. By considering all n!labellings, we have n!. t ( n ) n"-? Hence,
>
On the other hand, we will see later that t ( n ) 5 C,,-1. where C,, denotes the n t h Catalan number. Consequently.
All together we expect t,hat
8
~ will e discuss this later more exactly
< <
with e a 4 and the function f ( n ) bounded by a low degree polynomial. And indeed, accorcling to a difficult result of P d y a , compare [209] or [456], the number of unlabelled trees is asgrnptotical complete determined:
Theorem 6.2.7 Let t ( n ) be the number of non-isomorphic trees with n vertices. Th,en where a = 2.9557.. . and c = 0.5349.. .. Otter [329] find a generating formula for t ( . ) which implies: Number n of vertices
Number t ( n )of tmrees
For these numbers and other facts about counting of trees see Carter et al. [65] and Hendy et al. [215].
6.2.3
Binary trees
A tree in which each vertex has degree one or three is called a binary tree. Binary trees play an important role in the theory of evolution, since it is assumed tha,t a phylogenetic tree is a "bifurcation" tree. This follows from the assumption that evolution is driven by bifurcation e ~ e n t s . ~ 9 ~practice ~ 1 phylogenetic trees are allowed to be multifurcating when the bifurcations are sufficiently close together.
An analysis of Steiner's Problem in phyloyenetic spaces
181
Moreover, in the discussion of Steiner's Problem in Euclidean spaces me saw l are essential to compute SMTs. But these are binary trees. Perthat f ~ d trees haps the most fundamental and difficult problem with inferring such trees is the very large number of possible trees. More precisely: a binary tree with n leaves has exactly n - 2 internal vertices. In particular. a binary tree has a n even number of vertices, namely 271 - 2. With this in mind we have the following consequences of 6.2.5: Theorem 6.2.8 The following holds true for the ,number of binary trees.
( a ) The num,ber of binary trees with n labelled leaves and n - 2 labelled internal vertices is
(b) (Cavalli-Sforza, Edswards [71]) The number of binary trees with n labelled leaves and n - 2 unlabelled i,nternal vertices (i.e. binary N-trees having IN1 = n ) is
In particular. the number of binary N-trees with n leaves g r o w rapidly with the number n.
A helpful description of binary trees with labelled leaves is given by the following procedure: Let T = ( I f , E) be a binary N-tree for N = {vi, . . . , v , ~ ) . 1. If n = 2, then write T as (u1v2);otherwise, 2. Let v, and v, be t ~ leaves o of T which are adjacent to the same vertex v. Then (i) Delete the leaves v, and v,, and its incident edges; (ii) Replace the vertex v by (v,va), which is now a leaf; (iii) Consider the new tree with n - 1 leaves and repeat the procedure. Clearly, this procedure gives a simple written description of the tree, called the "bracket" or Renick format. But note t h a t it is not unique, for example for the one N-tree for n = 3 we have the descriptions ( ( v 1 v ~ ) v 3and ) (vl (vzvs)) and ((01~3)112).
6.2.4
Trees and splits
Let A T be a finite set. A split for AT is a bipartition, t h a t is a partition of N into two non-empty sets. Clearly. a split is completly described by one of the subsets, since the other is its complement. Let I N = n , then the number of splits for iV with k (and n - k) elements equals
but we count each split twice. Hence, the total number of splits for N is
Observation 6.2.9 T h e n u m b e r of splits for a finite set of 1.
71
elements equals
212-1
Let T = (V, E) be a n A-tree, l N 2 3. For an edge e of T the graph G = (I7,E \ {e)) has exactly two components G1 = (1'1, El) and GZ = (1'2, Ez),and creates a iplit S ( e ) = {ATl, N 2 ) of the set N of leaves by setting
This means in particular that for a split S ( e ) = { f i .&). each path from a vertex in Nl t o a vertex in 1\72 contains the edge e. The collection S ( T ) = { S ( e ): e E
E}
(6.28)
denotes the set of all splits of AT induced by the tree T. The following result provides a fundamental equivalence between N-trees and a certain type of collection of splits of N . A pair {AJl,A$) and { M I .M J ) of V is called compatible if a t least one of the sets ATsn lU1, ATsn A 1 2 . splits for i N2 n M1 and AT2 n M2 is the empty set. Then we have the following central theorem:
Theorem 6.2.10 ( B u n e m a n [59], compare [28]) Let S be a collection of splits c h S = S ( T ) if and only if the for the set AT. T h e n there i s a n N - t r e e T s ~ ~ that splits i n S are pairwise compatible. Moreover, if such a tree exists, then, u p t o i ~ o r n ~ o r p h i s mT, i s unique.
An analysis of Steiner's Problem in phylogenetic spaces
6.2.5
183
Metric spaces of all trees
I t is often helpful t o have a measure of distance between two phylogenetic trees. More precisely, To begin with, let Tndenote the set of all N-trees.
7; and '& each contains exactly one tree. 5 contailis one, and 7; four trees. A tree in Tnhas a t least one and a t most n - 2 internal vertices, and consequently, a t least n and most 2n - 3 edges. Using 6.2.6 we find that
On the other hand.
In other words, 7,will be a finite metric space with a number of elements which is exponential in n, but not more or less. We are interested in creating a metric between the trees in 7;, which reflects the "difference" between the trees in the sense of different phylogeny. A commonly used measure of dissimilarity between two N-trees is Penny and Hendy's [340] method based on tree partitioning. It uses the binary operation A which is the symmetric difference between sets, defined as
for sets S1,S2. l o ' O ~ e t 5'1,. . . , Sk be a family of subsets of U . An element of U is a member of S l A S 2 A . . . A S k if and only if is contained in a odd number of t h e St's. In particular, the symmetric difference of a set with itself is empty.
Lemma 6 . 2 . 1 1 1SlAS2/ is a metric.
It is sufficient to show the triangle inequality.
c
S ~ A S ~S,AS, u S~AS,.
(6.31)
Moreover, it holds t h a t s l A s 3 n S 3 A & = ((S1 n & )
\ & ) u (S3\
(Sl u 5 3 ) .
(6.32)
T h a t means, if a n element is in SlAS3 n & A S 2 . then it cannot be in Sl A & .
We define the metric p as follows: Let T I , T2 be two trees in 7;,. n the induced split collections S ( T l ) , S ( T 2 ) , respectively. Then
> 3. with
is. in view 6.2.11. a distance between TI and T2, which is called the split metric.
Observation 6.2.12
( X Ips)
is a m,etric space.
Note t h a t it is algorithmically easy. i.e. achievable in polynomial time, to compute the distance between two trees in ( T Ips). , We call a n edge of T a n internal edge if it connects two internal vertices. However, all splits comprising a leaf on one hand and the rest of the tree on the other are not " p h j logenetically informative" in the iense t h a t all possible N-trees will contain those splits. Gsing an internal edge implies for a split t h a t IN1, IN2[ 2. Since each tree in 7;,, n 3, contains a t most n - 3 internal edges, we observe the following:
>
>
Observation 6.2.13 It holds for any two trees TIand 572 in ps(Tl, T2)
5 #internal edges in TI
<
-
2n - 6.
+ #internal
(7,;p s )
edges in T2 (6.34) (6.35)
An analysis of Steiner 's Problem in phyloyenetic spaces
185
In particular, the diameter of the metric space (T,, ps) equals 2n - 6. Again we find a "strange" metric space: many elements and small diameter.''
(x,
The metric space p s ) is a natural object of study. be lor^, we will pose several optimization problems for certain real-valued functions. related to network design problems of "Shortest Connectivity".
6.2.6
Digraphs
For further discussions we introduce the concepts of digraph. A digraph or directed graph is a pair G = (b; E) consisting of a finite set V of vertices and a set E C V x 17 of (ordered) pairs of vertices, which we called arcs. Hence, a digraph G = (17, E) is essentially a relation over IT. The terminology used in discussing digraphs is quite similar to that used for graphs. Moreover, we xi11 understand each digraph also as a graph and me will use the graph-theoretical methods for digraphs too. Let G = (b: E) be a digraph. For two vertices v and v' with e = (v, v') E E we say that v is the immediate ancestor of v', v' is the immediate successor of v and the arc e is directed from v to 'u'. The indegree g ~ ~ " ( vof) v is the number of immediate ancestors of v ancl the outdegree gOZLt(v) of v is the number of immediate successor of ,u. Obviously,
+
g(v) = g ~ ~ y v )gOZLt (v),
(6.36)
for each vertex v in a digraph. It is easy to see that
In general, we say that the vertex v is an ancestor for v', ancl the vertex v' is a successor for v if there are arcs (vi, ~ i + ~i )=, 1;. . . , k , with u = v1 and uk+l = u', roughly speaking, if there is a directed path f r o ~ nv to v'. A vertex of a digraph is both a successor and an ancestor of itself. A4digraph is called strongly connected if for any pair v and v' of distinct vertices v is a successor and ancestor of v'. The concept of digraphs is more complicated than the concept of graphs, since there are several new questioni. Also there are many more digraphs than "More facts about the geometry of the space of phylogenetic trees can be found by Billera, Holrnes and Vogtmann [46].
graphs: Let G = (If, E) be a graph. then there are 2iEI ways t o orientate G t o a digraph. For a complete survey about digraphs see the monograph by Bang-Jensen and Gutin [32].
6.2.7
Rooted trees
The most important point in a pliylogenetic tree is its root. In a rooted tree exactly one distinguished vertes is marked as the root. For each of the labelled trees we have n rooted trees, because any of the 7~ vertices can be made a root. Hence, as a consequence of Cayley's tree formula we find:
Corollary 6 . 2 . 1 4 T h e n u m b e r of different rooted labelled trees with n vertices equals n'" l .
A unique path leads from the root t o any other vertex of the tree. Let w be the root and v be a n arbitrary vertex in a rooted tree T = (If. E). The length of the path1' from w t o v is called the level of u :
T h e depth of the tree itself is defined by
bye may consider a rooted tree T = (1'. E) as a digraph if me direct the edges vp,' E E from v to v' if and only if level(vl) = level(v) I, where w is the root of T. Then g Z n ( w )= 0 characterizes the root. and gO'"(v) = 0 characterizes the leaves of T. In this sense we have a n ancestor/successor-relation for the vertices of a rooted tree. In particular, the root is the common ancestor of all vertices of the tree. In other words, a rooted tree has a vertex identified as the root from which ultimately all other vertices descend.
+
For a rooted tree T = (V, E) a natural partial order I T on the set V of vertices is obtained by setting v LT v' if - T h e path from the root of T to v' includes u or, equivalently, '"emember
that in this case the length is the number of edges in the path.
An analysis of Steiner's Problem in phylogenetic spaces
- v' is the successor of v, and v is the ancestor of
Obviously, v
I T v'
u'.
implies level(v) 5 level(v1) (but not vice versa).
Let T = (V,E) be a rooted N-tree and let !Y' be a subset of h . M'e will refer t o the unique vertex v of T t h a t is the greatest lower bound of N' under the as the last universal common ancestor of 1W' in T. T h a t means order - v is a n ancestor for each vertex in AT',and - level(v) = max{level(ul) : v' ancestor for each member in lY1).
-4 tree T is called a rooted binary tree if for its vertices
1 : if v is a leaf 2 : if v is the root 3 : otherwise holds. In other words, we create a rooted binary tree from a binary tree by choosing a n edge and place the root there. This procedure is called rooting a tree. Rooted trees are representations for el olutionary relationships. For a rooted Ntree T we view the edges as being directed away from the root, and then regard T as describing the evolution of the set N of given "names" from a common (hypothetical) ancestral name; the other internal vertices of T correspond t o further ancestral n a i n e ~ . l ~ > ~ " Hence, the most important point in a phylogenetic tree is its root. The root is placed a t this position to indicate that, (i) it corrsponds to the (theoretical) last universal common ancestor of everything in the tree; (ii) gives directionality t o evolution within the tree; and (iii) it identifies which groups of vertices are "t,rue", given if the root does not lie within a group. The question is: On which edge should the root be placed? There are three popular ways t o find this position: 13Gnrooted phylogenetic trees are also biologically relevant since they are typically what tree reconstruction methods generate. 1 4 ~ o o t i n ga tree has a strong relationship to the molecular clock; but especially, proteins evolve at different rates, malting it difficult to relate the (evolutionary) distance to the historical time.
1. On the longest edge1'.
2. In the middle of the longest path between t,wo leaves. 3. An "outgroup" can be added to the set of given points. Now we count rooted trees. Together with 6.2.8 we have Theorem 6.2.15 T h e n u m b e r of rooted binary trees with n labelled leaves and unlabelled internal vertices (2.e. rooted binary AT-trees h,aving N I = n ) is
Moreover, together with 6.2.8(b), we have that the number of rooted binary trees with n labelled leaves and unlabelled internal vertices equals 2n - 3 times the number of binary trees with the same kind of vertices. Next, we will discuss the relationship between the number of leaves in a rooted binary tree and its depth. It is not hard to see that Observation 6.2.16 Let T be a rooted binary tree of depth d . T h e n least d 1 and at m o s t 2d leaves.
+
T has at
Conversely, the depth of such a tree with n leaves lies between R(1ogn) and
O(n).
6.2.8
Generating graphs and trees
To deal with graphs it is often necessary to generate these structures algorithmically. This question is closely related to the problem of counting graphs.16 We will use these interrelations to create methods which generate specific types of graphs. To generate all labelled graphs is not hard: remember that a labelled graph is completely described by its adjacency matrix. Furthermore, 15This approach of course requires t h a t there is a length-function for the graph lGAnd the problem of stormg a graph i n a computer, compare [290].
A n analysis of Steiner's Problem in ph,ylogenetic spaces
189
Observation 6.2.17 There is a one-to-one correspondence between labelled graphs with n vertices an8d n x n s y m m e t r i c binary m,atrices with all entries o n the leading diagonal equal t o 0. Hence, we have the following optimal generating technique: Algorithm 6.2.18 Let n be a n integer greater t h a n 1. T h e following procedure generates all labelled graphs with, n nertices: 1. D e t e r m i n e b := n(n - 1)/2;
2. Initialize a = ( 0 , .. . 0) in Bb; ;
3. A s s u m i n g a is the upper half of a n n x n m a t r i x A; complete the m a t r i x by setting aji= a i j and aii = 0,yielding the adjacency m a t r i x for a graph; S e t a : = a + l ( i n lBb). For more facts about generating all graphs see Yagler 13181. Remember that the number of labelled trees with n vertices equals nl"-". we have given different proofs for this result, but the one presented here, due to Priifer, is considered among the most elegant. The strategy of the proof is to establish a one-to-one correspondence between the labelled tree and the Prufer code, vhich is a sequence of length n - 2 of integers between I and 11, with repetitions allon-ed; in other words. a me~nberof ( 1 , . . . n)'"'. Algorithmically this coding is described by
.
Algorithm 6.2.19 Let T = (V = ( ~ 1 ,. . ,v,), E) be a labelled tree. T h e n the Priifer code for T can be constructed by performing the following steps:
1. Initialize T t o be the given tree;
2. For i = 1 t o n - 2 do Let v be the leaf with the smallest label; Let s , be the label of the only neighbour of T := T[V \ { v } ] ;
5. T h e code is ( s l , . . . , s,,).
1;;
We will now use the correspondence between Priifer codes and labelled trees to generate trees. We first note that the following decoding procedure maps a given Priifer code to a labelled tree:
Algorithm 6.2.20 A Priifer code P i s given. T h e n a labelled tree T = (V, E ) can be constructed by performing the following steps: I . Initialize the list P as the input;
2. Initialize the list V as 1,.. . , n ; 3. In,itialize T as the forest of isolated ,vertices o n V ;
4. For i = 1 t o n - 2 do Let k be th,e the smallest n u m b e r in, list 1' that is n o t i n list P ; Let j be the first n u m b e r in list P ; Add a n edge joining the vertices labelled k and j ; R e m o v e k from list 1)'; R e m o v e the first occurrence of j f r o m list P ;
5. A d d a n edge joining the vertices labelled with the two remaining numbers i n the list P . It is not hard to see that the decoding procedure 6.2.20 is the inverse of the encoding procedure 6.2.19. Combining all these considerations gives the following: Algorithm 6.2.21 Let n be a n integer with n dure generates all trees with n labelled vertices:
> 2.
T h e n the following proce-
I . Generate, by .simple counting, all Priifer codes in { I , .. . , n } l Z p 2 ;
2. For each code apply 6.2.20. This procedure coilsumes n71p2. O ( n ) = O(nl"') linear time. Hence, it is an effective technique.
tirne, since 6.2.20 runs in
Remember that counting only partially labelled trees is fundamentally harder, and so it is with generating.
A n analysis of Steiner's Problem in phylogenetic spaces
191
The simple process which we use to prove observation 2.4.17 is also useful to generate all binary N-trees: Let AT = {vl.. . . , v,). There is a single N-tree with IN1 = 3. The fourth leaf vLi can be connected to any of the three edges. This leads to three N-trees with IN = 4. each with five edges. Then, for each tree add the fifth leaf t o any of these edges. and so on. Note t h a t t o use this procedure to generate all N-trees, we have t o generate the set
We will describe a nonoptimal technique, involx-ing drawing a tree in the plane: Let n > 1 be a n integer. A planar code w (with respect to n) is a sequence in ,732(n-1) with the following properties: (i) In each prefix of w the number of 1s is a t least the number of 0s; In particular, the first letter in w must be 1; (ii) The number of 1s in w equals the nurnber of 0s; In particular, the last letter in m must be 0.
Algorithm 6.2.22 L e t u: be a planar code w i t h respect t o n. T h e n draw a tree g by th,e f o l l o w ~ n ~procedure: 1. P u t a v e r t e x a s t h e origin,;
2. R e a d w letter b y letter a n d if y o u see a 1 t h e n draw a n e w edge t o a ne8w vertex; ,if y o u see a 0 t h e n m o v e back by o n e edge toward t h e origin. Thus the t,ree is described by its planar code. Hence, after generating all planar codes, we can generate all unlabelled trees with n vertices. The number of planar codes is the Catalan number (compare [14] or [296]), which gives a n upper bound for the number of non-isomorphic trees. Note that the planar code is far from optimal; every unlabelled tree has many different codes. For instance all the codes 11010010, 10110100, 11101000, 10101100, 11011000 and 11100100 generate the same tree. The table below summarizing our met,hods.
Generating all
Optimal
Running time w.r.t. number of trees
Labelled trees Binary AT-trees Unlabelled trees
Yes \'es No
Linear Exponential Exponential
In the above "optimal" means that the algorithm generates each tree exactly once. Fliege [165], Lee. Lee, Wong [279] and Winter [462] describe several other methods t o generate trees and full trees.
6.3
CLUSTER ANALYSIS
Evolution implies that many different species have a common ancestor and that all forms of life probably stem from the same remote beginnings. Once these relationships are understood, they are summarized by grouping species into collections of related organisms, called taxa. We will describe the structures underlying these relationships.
Classifications
6.3.1
A classification is the formal naming of a group of individuals. In the sense of set theory a classification C of a (finite) set N of individuals is given by a collectio~iof subsets of N satisfying (i)
0 $ c;
(ii) N E C; (iii) {v)
C for any v E N ; and
(iv) For any two members N' and N" of
In other words, any two sets in other (see 5.3.1).
C it holds that,
C are disjoint or one is contained in the
An analysis of Steiner's Problem in phylogenetic spaces
193
A member of a classification is called a class or a cluster of N Let T be an N-tree rooted by the vertex w. Then we create a collection C of classes for the set N in the following way:
I. For each leaf u of T put { u ) in C; Mark the vertex v; 2. Let v # w be an umarked vertex adjacent to exactly one other unmarked vertex. All other neighbors v1, . . . , vk of v are marked and belong to classes Nl, . . . , Nk in C, respectively. Then - Put I"\', in C, and - Mark v;
u:=,
3. Mark w ; Put 1 Y in C
Y with the properties Conversely, if we have a collection C of classes of the set 1 that {v) E C for each element v E hr and K E C. we can form a tree T by: I. Each class of C is a vertJex of T: 2. Two vertices hiland 1V2 are adjacent if and only if - !LTln hi2 E {N1, N2), and - there is no class N' such that ATj n N' E {ATl,iV') for j = 1 , 2 . (That means, AT1 must be the inaxinla1 proper subset of AT2 or vice versa.) Summing up all these observations, we have the following fundamental equivalence between classifications and rooted trees. Observation 6.3.1 There i s a one-to-one correspondence between the collec-
V and the collection of rooted N-trees. t i o n of classifications for a set ! In other words, classifications for a set iV and rooted AT-treescontain essentially the same information. The classification C = C(T) which is induced by the tree T is called the content of T. In view of this observation, each evolutionary tree implies a classification of the given names. But we saw that such a classification is not applicable in practice, , is since the depth of the tree lies between n(1ogn) and O(n) for n = J N Jand
obviously too big. Taxonomists are interested in trees with a constant depth. I n particular Linnaeus' system has depth 8. Hence, in such systems the trees are not binary. 6.3.1 can be viewed as the rooted analogue of 6.2.10. We need t o describe equivalences between the families of rooted ill-trees and N-trees. and corresponding equivalences between classifications on N and collections of pairwise compatible N-splits. T h e following proposition describes the desired equivalences. T h e proof is a n application of 6.2.10 a n d 6.3.1.
Observation 6 . 3 . 2 (Semple and Steel [393]) Let V ! be finite set. C is a classification for N if and only if the collection
is a set of pairwise compatible splits on N ; and vice versa. For instance, consider the set N = {a, b, c, d, e). Coming from the (binary) N-tree (((ab)c)(de)) we have the split system
S
=
{{a, bcde), {b, acde), {c, abde}, {d, abce}; {e, abcd), {ah, cde), {abc, de)).
(6.44)
Using each of the three internal vertices as a root gives the following classifications:
CI = {a, b, c, d , e, ed, ced) U {AT) Cz = {a, b, c, d , e , nb; ed) U { N ) CS = {a, b, c, d, e, ab, abc} U { N ) .
(6.45) (6.46) (6.47)
W i t h 6.3.1 in mind. we have several c~nsiderat~ions. Firstly we determine the maximal number of sets in a classification. Let T = (V, E) be a rooted il'-tree with lhTI= n.t h a t k internal vertices each of degree greater t h a n 2. and a root w . Then 1.2.5 says t h a t k n - 2. Consequently,
<
Observation 6 . 3 . 3 Let C be a classification for a set uiith n elements. Then,
An analysis of Steiner's Problem in phylogenetic spaces
195
Secondly, me find a metric space for rooted trees. This measure p c of tree differences, is similar to p s ; it can be calculated easily; and the fact that it counts the different classes in the corresponcliiig (hierarchical) classifications is an indication of its biological relevance; see Hendy, Little and Penny [215],and Robinson and Foulds [363]. Recall that a rooted tree T can be directed so that each edge is directed away from the root. Then for each edge e of T = (11'; E) let C(e) be the set of the marks of the vertices below e in the tree. C(e) is called the content of e, and
R,, denotes the set of all rooted N-trees with n labelled leaves. Consider two trees TI and T2 in R,, mith contents C(Tl) and C(T2).respectively. Then pc'(T1, T2) = ( c ( T l ) A c ( T ~ ) l (6.50) is a distance between Tl and T2,and is called the classification metric. Similar to 6.2.12, and using 6.2.11, me have Observation 6.3.4 (R,,; p r ) is a m,etric space. If TI, T2 E El,, then we find that
and so p c ( T l , T2) can be computed by comparing the contents of the internal edges of TI and T2 only.
6.3.2
Ultrametric spaces
Ult,rametrics are a well-known class of clissirnilarities. Their importance in the domain of classification stems from their characteristic tree representation: by a rooted tree with all the leaves equidistant from the root. Let (S, p ) be a metric space. The set
>
is called the (closed) ball mith center z E X and radius r 0. One has that a ball is nonempty, since z E B,(z): moreover, Bo(z) = {z). The collection
is called the ball family of the space (X,p) Recall t h a t we called a metric space ultranietric. if
for any points x , y , z in X . In vien- of 3.4.7 this condition states that for all 2 , y, z E X of the three distances p(x. y ) , p(x. 2 ) . p(y . z ) two are equal and not less that the third. Moreover,
Lemma 6.3.5 Let (S,p) be a n ultrametric space and let B ( X ) be its ball f a m ily. If two *members of B ( S ) intersect t h e n one of these balls is contained in t l ~ eother.
Proof. Let B = B,(z), B' = B,t (2') E B ( X ) with zo E B n B'. Without loss of generality, we assume that r 5 r' and co~lsequently p(z, z') 5 max{p(z. zo) , p(zi,zo)) 5 max{r, r ' ) = r ' .
(6.55)
such t h a t z E B'. hIoreover, for a n arbitrary n: E B we find p(x, z') 5 max{p(z, z ) , p(z1,z ) j 5 m a x { ~r,' ) = r',
(6.56)
which implies z E B'.
Theorem 6 . 3 . 6 Let N be a finite set equlpped with a n ultrametric p. T h e n the ball family B ( N ) i s a classification. First of all each member of B ( X ) is a nonempty set. For each v E iV we have {u} = B o ( v ) . On the other hand, N = Bl(uo), where t = max{p(u. v ' ) : v , v' E N j and uo E N. Then 6.3.5 completes the proof.
In view of the considerations above me find that a finite ultrametric space is "tree-like", which we also saw earlier in the fact that such spaces are metric spaces in which a solution of Steiner's Problem is always a n SMT.
An analysis of Steiner's Problem in phylogenetic spaces
Hierarchical classifications
6.3.3
Let T = (11, E) be a tree. Rooting T creates a hierarchy for the vertices: For each integer k between 0 and depth(T) there is a set l/i of vertices. namely
which is a partition of I/ such t h a t for a vertex from Vk. the path t o the root must pass through a vertex of any of the sets l i - l , l i - 2 , . . . . VI, Ifo. Note t h a t in applications this is not only a simple fact; probably also each vertex has a meaning. In this senie consider the following "word garne" [408],[410]: Let '4 be a n alphabet and let M7 be a finite language over A, called a dictionary. For each pair I U . w' in W find a chain w = 1111 i wa --+ . . . + W L = W' sucli that
i = 1,.. . , k - 1; (i) w,+lis transformed by a single edit operation from w,, (ii)
Wi E
W , i = 1 , .. . , k .
As a n example. in the English language. consider w =SHIP and w' =DOCK. One solution is SHIP i SLIP + SLOP i SLOT + SOOT i LOOT i LOOK -+ LOCK i DOCK.^^.^^ We consider a rooted N-tree T. Let d be the depth and let N be the set of leaves of the tree T . Let k be a n integer between 0 and d. For any two leaves v and v' of T we define the relation v -k v' if there is a path from v to v' in T containing only vertices of a level k or higher. It is easy t o see t h a t - k is a n equivalence relation for any number k . N ( k ) denotes the family of the equivalence classes. Then we have a series ,hr(0),A'(1) , . . . , N ( d ) of partitions of AT with (i) { N } = ,sV(O) and N ( d ) = {{v) : v E N} (ii) For k = 0 , . . . , d - 1 the class
N(k + 1) is finer than , V ( k ) . l g
The first set A f ( 0 ) consists of a group of ancestors, the last A i ( d ) consists of individual leaves. Overall, we separate the "individuals" of AT into successively finer groupings.20 17We do not demand t h a t the number of steps from w t o w' equals the Levenshtein distance P L (w, w')!
1 8 ~ e r m a nreaders should t r y this with w =HELD and w' =STEG. l 9 ~ h inclusion e is stirct, since all internal vertices of T are of degree a t least three. 20A nice illustration of this point of view is given by Gould and Keet,on [188]:
6.3.4
Pair grouping
We create a proceeding of successive fusions of n = IN individuals into groups. These methods are well-known in cluster analysis. The related rooted trees are usually called dendrograms, compare [136]. T h e general idea of the algorithm is to repeatedly merge pairs of sets," arid so the technique is called a pair group method (PGM).
Algorithm 6.3.7 Let ATl = {.ul), . . . ,,:\I containing n single element. Thesn do
=
{7:,)
be a fafa.mily of sets each
1. Find th8e nearest pair of distinct sets, say ATi and 1Vj;
2. Merge AT, and N j to form N ' ; C o m p u t e a n e w distan,ce, o r similarity from R' t o each of the other sets; Decrement the n u m b e r of sets b y one; 3. If the n u m b e r of sets equal o n e t h e n STOP, else go to 1 Obviously, this is a very general approach, and we have to specify several facts more precisely: (i) What does "nearest" mean? We will use this term in the sense of shortest distance, or maximal similarit,y. (ii) How we can compute the new distance'? We will discuss this step later. (iii) We only merge two sets a t a time. Can we generalize this for more? We will discuss these questions in the next chapter, as a heuristic approach to finding shortest trees. Biological
Postal
Domain Kingdom Phylum Class Order Family Genus Species
Old/New World Country State/Province City Street Sumber Last name First name
' l O ~ ~ lpairs y are considerd in view of the bifurcation assumption of evolutionary processes
An analysis of Steiner's Problem in phylogenetic spaces
6.4 6.4.1
199
SPANNING TREES
The number of spanning trees
Let G = (I7,E) be a graph. A subgraph GI = (TI. El)is called a spanning tree of G if G' is a tree. If G' is a spanning tree of G, then G itself must be connected. Conversely, if G = (V, E) is a connected graph, then G contains a subgraph G' = (If, El) minimal with respect to the p ~ o p e r t ythat G' is connected. The graph GI is a spanning tree of G . Hence, a g ~ a p his connected if and only if it contains a spanning tree. In some situations it is necessary to be able to generate a complete list of all the spanning trees of a graph. This may be the case when, for example, t,he "best" tree needs to be chosen, but the crit,erion to be used for deciding what tree is the "best" is very conlplex. Hence, me are int,erested in the number of spanning trees for a graph. Observation 6.4.1 Consider (connected) graphs with n vertices and rn edges. ( a ) (Kelmans [255]) The nu.mber of spannin,~trees is at most
(b) (Cayley [%?])If the graph is complete, that is 2m = n ( n - l ) , equality lzolds, i.e. the number of spanning trees is exactly
Kapoor and Ramesh [249] present an algorithm for enumerating all spanning trees of a graph G having complexity O ( t ( G )-tn + m ) , where t ( G )is the number of trees. It will be helpful to associate the following matrix to a graph: Let G = (17, E ) be a graph and assume that the vertices are labelled, i.e. I/ = {vl , ..., v,). Then we defined the matrix of admittance M ( G ) by
M ( G ) = (mij)i,j=l,...,nwith mij =
gc(i) -1
0
: : :
ifi=j if the vertices vi and otherwise
vj
are adjacent
That is (6.60) M ( G ) = diag(gc: (1). ..., gc: ( n ) ) - z4(G), where diag(gG(l),..., gr;(n)) is the matrix which has the degrees of the graph on the diagonal and a,ll other elements equal to zero, and A(G) is the adjacency matrix.
Theorem 6.4.2 (Kirchhoff, compare [38]) Let G be a graph with the labelled vertices I,..., n. Then the number of spanning trees of G i s th,e determinant of the matrix obtained from the matrix of admittance M ( G ) b y deleting the i'th row and t l ~ ei'th column for some i between 1 and n.22 Another method to count the number of spanning trees of a graph is given by the following recursive procedure: Let G = (I7,E) be a graph and let e be an edge of G . G - e denotes the graph after deleting the edge e, and G 4 e denotes the contraction of G on e , that is the graph obtained from G by deleting e and then amalgamating its endvertices. We then have
Theorem 6.4.3 (Zylcov; cornpare [I941 or [476]) Let G be a graph and denmte the number of its spannin,g trees b y t ( G ) . Then
where e E E .
6.4.2
Generating all spanning trees
In general. methods to generate all spanning trees use the following metric of all spanning trees of a graph: Let G = (V, E ) be a connected graph. hloreover, let TI= (I1, E l ) and TL= (L: E2) be two spanning trees of G. Then
defines a distance. 221n particular; the value of this determinant is independent of the choice of the number i. Clearly the determinant of M ( G ) itself equals 0.
201
An analgsis of S t e i n e r ' s Problem in phylogenetic spaces
Observation 6.4.4 p is a m,etric of the set of all spanning trees of a graph.23
+
It is sufficient to prove the triangle inequality, p(T1,Tz) 5 p(T1, T3) p(T3. T2). First it is easy to see that
Moreover ( E I \ E ~n)( E Y\ E d = 0.
(6.64)
If the distance p(T1, Tz) = 1, i.e.
where e , E E L ,a = 1 . 2 , then T2 could be derived from TI by removing e l and introducing ea. Such a transformation is called an elementary tree transformation. The following then holds:
Theorem 6.4.5 (Christofides [82]) If To and Tk he spanning trees of a graph ,with p ( T o , T k )= k : t h e n Tk can be obtained from To by a sequence of k element a r y tree transforn~ations.
6.5
COUNTING T H E ELEMENTS IN DISCRETE METRIC SPACES
Remember that sequence and phylogenetic spaces are discrete metric spaces, meaning that each bounded subset of points is finite. TVhat do we know about the number of points in such sets? Let 1/1' be a bounded set in a metric space (X,p ) . We define the diameter of I.Ir as D(T/TJ)= sup{p(v,v') : v,vl E TY) (6.63) 2 3 ~ h imetric s is defined similarly to that for of trees.
Tn, but
is simpler, since we have a simpler set
and the radius as
the If W is a compact set we can define the diameter and the radius ~vit~ll operators " max" and " min" . 2 4 The diameter and the radius are nonnegative reals with
We wish to estimate the number of points inside a bounded set as a function of radius and diameter. This is of interest for Steiner's Problem, since in view of 2.4.9 the Steiner hull is a bounded set. and, moreover, Observation 6 . 5 . 1 Let iV be a finite set of poznts in u m,etric space ( X , p ) . Then a Stezner hull 14' of ,V has radurs
R ( W ) 5 2 . D ( N ) 5 2 . L ( X ,p ) ( M S T for N ) .
(6.68)
We start with a historical investigation of Euclidean spaces created by GauB.
6.5.1
The geometry of numbers
Counting the number of integer points in a bounded set of points is a wellknown question in the "Geometry of Numbers".
We find Gauff to be one of the first researchers in this area. He published in [I801 a result addressing the question of how many lattice points n ( r ) occur within or on a circle of radius f i and centered a t the origin of the lattice Z' in the plane, where r is a nonnegative integer: "Note, that in a discrete metric space each bounded set is finite, and hence, compact.
An analysis of Steiner's Problem in phylogenetic spaces
J;;
Yurnber n ( r ) of lattice points
The evidence in the tables suggest that as r increases, the ratio n ( r ) / r gets closer and closer to n. And, indeed, it holds that
which implies the assertion. For a proof compare [326].
6.5.2
The number of words in sequence spaces
Consider the hypercube (IBd.pH) For a word v E IBd we define the Hamming weight wt(v) as the number of times the digit "1" occurs in v. Clearly, wt(v) 5 d. Moreover, p~ (v, W ) = wt(v + W ) = wt (2: - I U ) (6.70) holds true for any two words v and .w in (IB",
2"emember.
that the operator
+ i n E3 is defined by O + O
Conversely.
= 1 + 1 = 0 and 0 + 1 = 1 + 0 = 1.
(:) is just the number of ways t h a t an unordered collection of k elements can be chosen from a set of d elements. Thus (f) is the number of words in (I35p p ~ ) with weight r , 0 r d .
< <
<
Lemma 6.5.2 Let v be a word in (lBd,pH) and let r be an integer with 0 r d . Then the number of words having distance at most r from v is precisely
<
For a proof and additional remarks see the textbooks by Hankerson et al. [208] and Schulz [388]. Using the following estimation of binomial numbers (compare [298])
we find Theorem 6 . 5 . 3 Let I/T7 be a set of words in the hypercube (lBd,p ~ with ) radius r , r d. Then
<
Now, let ( A d , p f ~ be ) a sequence space over a n alphabet A with Hamming t o 6.5.2 we have t h a t for a word v in (Ad,p ~ and distance pe.'"imilarly ) an integer r with 0 r 5 d, the number of words of distance a t most r from 7i is precisely
<
Consequently, Theorem 6.5.4 Let W be a set of ,words in (Ad,pfI) with radius r , r
Then
" ~ e m e r n b e r that we assume t h a t A contairls at least two letters
< d.
An analysis of Steiner's Problem in phylogenetic spaces
6.5.3
205
The number of words in phylogenetic spaces
Let A = (A*, pL) be a phylogenetic space with Levenshtein distance. Let T/Ir be a bounded set of words in A with t = D(Tt*). In view of (5.7) the set W' is finite, with (6.77) t P L ( W O , W ) 1% - bl,
>
for all words w in TIC,'
>
where w, is a fixed word in T V of length z,. Equivalently,
Hence, T,i/
C il
Using (5.1) and (5.2) this implies
Thus,
Theorem 6.5.5 Let W be a bounded set of words in a phylogenetic space A with diameter t . Then
where w E PI/'.
6.5.4
The complexity of enumeration problems
Sometimes I\-e need the number of solutions t o a problem. Enumeration problems provide natural candidates for the type of problems that might be intractable even if P = :\/P. hlany such problems appear t o be quite difficult. Clearly,
Observation 6.5.6 A enumeration problem associated with an HP-complete problem is ,hfP-hard.
Moreover, some enumeration problems seem to be even harder than the corresponding existence problems. On the basis of such observations we have the class of the #P-complete problems (read: number-P-complete) which is designed to reflect the difficulty of enumeration; see Garey and Johnson [179]. For instance. the following problems, which are of interest to us, are #Pcomplete: Counting the number of distinct perfect inatchings in a given bipartite graph [432]. Computing the volume of a conves polytope [l50]. Determining the number of lattice points in a convex polytope [34], [451]. Counting the number of trees with a given number of vertices [242]. On the other hand, some nontrivial enumeration problems can he solved in polynomial time. Consider the following problem:
The number of spanning trees Given: A graph G. Question: How many distinct spanning trees are there for G? This question can be solved in polynomial time using Kirchhoff's theorem, see 6.4.2. Counting spanning trees is one of the few enumeration problems which has a polynomially bounded time algorithm.
6.6
FERMAT'S PROBLEM IN SEVERAL DISCRETE METRIC SPACES
Recall Fermat's Problem. which is to find a point that minimizes the sum of the distances to a finite number of given points in a metric space. We are interested in Fermat's Problem because it is the local version of Steiner's Problem. More precisely, for a given finite set N of points in a metric space (X.p), Fermat's Problem is to determine a point q of (S.p ) such that the so-called Fermat function
An analysis of Steiner's Problem in phylogenetic spaces
207
is minimal. The desired point q is called the Torricelli point for N (in the space
(X,
PI).
It is well known that solutions to Fermat's problem depend fundamentally on the way in which distances in the space are determined. Consequently, there are many metric spaces to be considered. We will see that the solution methods and the complexity of solving the problem will be fundamentally different; there is no common technique for determining solutions to Fermat's Problem in each space. We will consider several specific discrete metric spaces.
I. In the sequence space (Ad,pH) with Hamming distance a Torricelli point for a set N of n given words can be found as follows by the so-called majority rule: Theorem 6.6.1 For each coordinate of t h e Torricelli point, we choose t h e lett e r of A which appears m o s t frequently i n this coordinate of t h e given points. This simple strategy consumes
O(n . d) time.
As an example consider the following English words of length 5:
consensus
I\/I M H S C
IM
E
L
0
O W O
G N E E E O K
N O Y T Y
o
N
E
Y
K
11. We transform a sequence space into a connected graph. Then Fermat's Problem becomes the so-called 1-Median problem which work in polyrlomially bounded time. Now, assume that the length function f : E + R for the graph G = (V, E) is always equal to one. Then the length of a shortest path G(v,. . . ,v') between the vertices v and v' is the number of edges in this path. Thus, the Fermat function F'lj is the average distance to all vertices. A vertex attaining the minimum value of Fr, is called a median. It is well known that in a tree there are either one or two adjacent medians; see Zyliov [476].Zelinlia [474]shows that a vertex v of a tree T = (I/', E) is a median of T if and only if for any neighbour w of v the
largest suhtree that includos 11% hut does riot inclutle L) contains a t most 1 7 / 2 vertices. This implies a linear tirric algorithm for ident,ifying ~ ~ i e d i a n s .
111. In accordance wit,h 5.2.15 tlic next corollary solve F e r ~ n a t ' sProhlern for a finite set of given points in a phylogeriet,ic space.
Using t,his corollary s e m x difficult. Instcad of pursilig this furtl~er..n.e discuss several other strategies for so1~-ingFerrnat's Problem.
IV. Recall, t,lmt ('T,,.ps) deliotes t,he inet,ric spacc of all ,\--trees.
=
11:
equipped with the split metric. For a collection ,I= ' {TI.. . . Tk)of aY-trees tlie Torricelli point TAIIis called a median tree for .\'. T h a t means that TAIIrni~iiinizcst,hc ful~ction
.
G ( T ) = rnax ps(T. T , ) ?=I
.X
is called a center tree for .\'. Similarly: we defint. tlic median and center tree for the rnet,ric space of all rooted S - t r c e s cquippcti n-it11 the classification rnet,ric.
(6.83)
(R,,. pi.)
;\Ictiian ant1 center trees are specific ltirids of so-called coiiserisus trees. which arc: trees summarizing information corrllrlorl to a collect,ion of trees. For more facts about the rl~edialifimctiori on classificatiolis s w hIcl\Iorlis arid Pon-crs [30%].11. n-ill d i s ( m s it ill t , l next ~ cl~aptrr..
TREE BUILDING ALGORITHMS
Evolution iriiplies that rnany different species h a w a common aricest,or and that all forrris of life prolmbly stern from the same rrmot,e I~e~irinings.'Hcnce, one of the tasks evolution sets for biologists is to discover t,hc relat,ionsllips among the species alive today and to trace the ancestors from wllicll t,lley descended. Trees are widely used to represent prcsunned historical relationships among a group of relat,erl biological entities. Here we use our concept of "Shortest Conncctivit~.'! and search for a most parsimonious trec. The "central dogma" is:
.Imost parsimonious trcc is a n SLIT in a phylogenctic space. But this t,asks is more cotnplicatcd t l m i it seenis a t first glance; Gould [IS81 n-rot,e: TTThensystemat,ist,s,also knon-n as taxonomists, set out, to rec.orist,ruct, the pliylogeri>- (e~.olutionar>liistory) of a group of species that they think are related. they have before them tlle spccies living t,oday and the fossil record. To reconstruct a phylogenetic history as closely as possible, tliey must nlakc inferences based on observational and esperirricntal dat,a. The difficulty is t h a t what can be measured is sinzilal-ity, whereas tlle goal is to determine relnteclness. 'Thc fact that all life in the world today uses iimilar gcnetic cotic is one hint that this \4ew is correct.
Xote that t , l definit,ioli ~ of similarity cannot be the problem of the mathernatical analysis. This is. in any case. the task of the biological sciences. But rnathe~naticscan hclp t o check if the choice n-as not false. T h e phylogenetic analysis of a family of (related) nucleic acid or protcin sequences is the determination of how the family niight have been derived during evolution. E ~ d u t , i o n a r yrelationships among the sequences are depicted by placing the sequences on the leaves of a trecl. T h e branching relationships on the internal vertices of the tree then reflect m-hich sequences are related. Starting n-it11 a set of known present-day objects a phylogenetic tree may be coristruct,ed by first assigning each object a leaf of thc t,ree arid then assigni~ig ancestral a n d unltnonn ob,jwt,s t,o the internal nodes. Sunnnarizing all thew consideratiolis. Peniiy [337] subdi~.idcdt,he process of clet,ernlining phylogcnies into s i s steps as follon-s: 1. Collecting d a t a ;
2 . Selecting a biological model:
3. Deciding on a n optinialitv criterion: 4. Generating all optirnal netnolk:
5 . Finding ancestral states: 6. Determining which poirit in the tree represents the root JT'e discussed most of these points in the preceetling chapters and will now focus on creating methods to build the trees. The key qucstiori is the reconstjruct,ion of the network hascd on conteniporarj. data. This d a t a may conic in o w of several forms:
(a) A iet of sequences in a phylogmctic s p a c ~ (b)
X matrix of distances.
(c) -4 ~nultiplealign~ncntwith the sequences corresponding t,o leaves.
(d) .A character state rnat,rix. (e) Suhtrees.
7.1
T R E E BUILDING METHODS - AN OVERVIEW
The mathenlatical thcoq. 1)eliind phylogenetic programs is interesting in that it comes from several areas of pure anti applied mathematics. Currently there are many different algorit,ll~nsused for const~ructionof phylogenetic t,recs: four major categories of methotis for inferring relationships are named by Fitcli [163]:
Distance Llinirnize the square of the deviation 11etn-een original and reconstructed dist,ances,or the length of the tree with no reconstructed distailces less t h a n ohservetl distances. Maximum Parsimony Slinimize the number of suhst~itutions/indelsrccluired to account for sequences a t the tips of the tree. Maximum Likelihood Slaxiinize the probability of thc d a t a given the tree. More precisely. for a inodel of substitutions selert a tree T and suhstitut,iorl parameters oil t,he edges of T. and calculat,e the probability (calletl the likelihood) of obtaining the observed data. Invariants Count site of p a t t e r m ; adding and subtracting cornbinations of these lead t o 1-alues n-hose expectation is zero for all but the correct tree. AAlt,hought,hc ~mticrlyingconccpts are relatively simple. t,he ~nathernaticsis complex. Over~ien-sof tree rnaking algorithrns are given by Clote and Baclafen [log], Hall [205], i\Iount [31T]. Penny and Hcndy [341]. and S~voffordarid Olsen [418] lye will consider rnet,hods nhich are related to our a p p r o a c l ~of "Shortest Conncctivit,y". Here, n-e focus on input data that is cithw a collect,ion of sequences. or a matrix of distances bet,neen each pair of t a m (names). T h e output is a graph, almost always in the form of a phylogenetic tree, and sometimes a network. For other approaches t o making phylogenetic: t,rees hasrd on sequence d a t a cornpare Bandelt and Dress [29], Dress et al. [131]. Dress arid Kriiger [130], Hendy [214]. Hendy. Penny and Steel [21P], and Strirnmer and ~(111Hacseler 14131.
nt. n-ill d~htinguishbetneell exact and heuristic/apl)ioxiiriatiori techniques.
7.1.1
Exact Methods
Lntil lion- n-e have discussed t,hree exact ~ n e t , l ~ o dtos construct a n SLIT for a finite set of sequences in a pliylogen~ticspace: .An extension of a +narriic programming approach, described in 5.2.15 An enurneration stratcgj.. described in 6.1.1.
X multiple alig~irnent,-bawdstrategy. described in chapter 5. T h e first tn-o approaches coiisuilie exponential t h e . The third rnet,hod is simpler, but the co~nplesityis in t,he iriethotl for crcat,ing t,he alignment. Belon. we present anothcr exact ~ n e t , h o dalso > using exponential time (which we caniiot avoid, since the problem is .L'P-coniplete, compare 6.1.2).
7.1.2
Heuristics
TT. havc seen s e ~ e r a tli ~ n e sthat it is hard to find a n SLIT. which is a tree n i t h rniriirnal length among all t,rees." Conscquentlg-, we are iritercsted in deriving effective heuristics for Steimr's Problern. BJ. a n efftxtive heurist,ic we nlean a n a l g ~ r i t ~ l l rn-hich n finds a tree intermnnecting a set .Y of n point~s,runs in time bounded b>- a lon-degree polg-noinial in 1 1 . and whose olltput tree has a length that does not escecci that of a n SAIT for 1'1- hj- a great (leal. Until non- n-e have considered the follon-ing such methods: An hIST as a heuristic: TYe (.an find a t,ree interconnecting a set of 11 points in a metric spare in quadratic time n i t h a length a t most twice thc length of a n S N T . But: (i) T h e performance is had; (ii) In general: the result of such a h e t ~ i s t i cis not a n ,"\-tree. TT'hiltr a k-SIIT can be used as a lieuristic, the time required for construction is a large polynomial of t,he number of points unless k is very small. but then the performance caniiot bc good. Thus the k - S l I T is more a generalization of the S I I T t,hari a heuristic.
T h e it,erated I-Steiner licuristic (see section 4.2). This heuristic. is based on routines that solve the 1-Steiner p ~ o b l e n i . ~ We will d ~ s c u s sserclal appioaches designed f o ~phr.logenetic ipaces.
7.2
MAXIMUM PARSIMONY METHOD
T h e maximum parsirnony mothod is a popular teclinicj~lefor rrconstructing phylogenetic trees from sequences or states of characters. The principle of ~ I a x i m u r nParsilno~iyinvolves t,lic ident~ificat~ioii of a conibinatorial s t , r u c t u ~that ~ requirrs the snlallrst nurn1,er of el-oliit,ionary c.hanges. This is a n application of Ocltliani's razor. accordi~igto n-hicli the best hypoth~ means t,hat esis is tlie one requiring thc sniallcst riun~berof a s s u ~ n p t i o i i s . It among all possible structures n-e seek one n-hich satisfies only one. naniely the condition of length rnininlization. l17hile the validit,y of parsi~noriyhas 1 ) e ~ rtlclmted. i it can be justified on biological grounds. see Page arid Holirles [331]. But note several problenis with this point of view: The amount of change reco~eretlby parsimony is. by definition. the smallest, possible amount t h a t is consistent n-it11 the dat,a. T h e actual amount, of evolutionary change may have been sorricn-hat larger. lTe rnaj find more thari one nlini~rial-lel~gtli tic? interconnecting tlie given entitles Parsimony is n-itlrly used in practire. as attested by the popularity of software such as PXLP. which stands for " Phylogenetic analysis using parsimon-"; sce Hall [203] and Swofforcl [417].
A n-ell-1tnon.n rnetliod t,o compute a n S l I T in a sequence space generalizes theorem 6.6.1 t o a dynamic programming algoritlirri for finding the location of the internal vertices in a givcn -Y-tree: " ~ r n ~ i r i c aruull,s l show an cxccllc~nt improvcmcnt for this heuristic in the "classical" metric spaces. n;imi:l> the Euclitican p1;rrie and the plane wit11 rectilinear distance. Is this also true in pli~.logenetic spxrs'.' 'See our discussioi~in he l~cgiriningof the fifth chapter.
Algorithm 7.2.1 (Pitch [i62]) Let -\-he (L set of r2 sequences in a .seque3r2ce space (-q",p ~ ) -Y : = { u k = vk.1;. . . . c k . d : k; = 1.. . . . T I ) , and let CL bir,ary S - t r e e T = (11: E ) De gi'um. T h e n do:
1. For each position i = I , . . . . cl do I . Mark euch leuf (I:,;~ u i t h{vk,j): L , := 0: 2. Until all oertices are mnrked do ve7tice.s 'uiith th,e Find a n unmarked vertex: ,whrc/, is c~djtrcentt o two rr~r~rked marks and -I\; Mark the ~ ~ n m a r k e~d~ e r t S c ewith s if LY1 n -1; # 0; otherwise ( a ) !Y1n (b) IY~ U :IT and2 L ; := L i + 1;
2 . L ( T ) := c:=,
L,.
Tlie correctness of Fitch's algorithm is proven by Hartigan [213]. In particular. it is shon-n that the final answer is independent of the vertices cliosen when moving: through the tree. The algorithm c o ~ n p u t c sthe Icngth of tlie tree. Since a binary .\'-tree has 212-2 vertices. it uses O(11) time for each position and hencc O ( d .n ) t,irrie to find the length. O n tlie other hand. there are esponentially man!- binary trees. Hence.
Observation 7.2.2 T h e Fitclr nlgoritlzm 7.2.1 uses linear t i m e t o j k d the length of a girien binary A\'-tree i n a sequence space. ~ Applying 7.2.1 for all binary :\:-trees find a n SMT for a finite .set of g i w points in a sequelace space i n e . ~ p o n e n t i dt i m e . Aft,er applying 7.2.1 vce ha1.e irlarks for all the internal s-ertices in t,hc tree. Honever; some marks ha-\re more than one letter and hence are ambiguous. There are several methods for choosing n-hich one of the possible states yields the most parsirnonious reco~lstruction:the simplest one is Farris' met,hotl: go back u p the tree assigning to anj- internal vertex tliat is ambiguous the intersection of its niark nit11 tliat of it,s i~ninctliatcancestor. H o m v e r , as tlie number of possible t,rees increases rapidly n-ith the nurnber of given sequences. it is virt,ually iriipossihle to employ a n exhaustive search when the nurnber of give11 sequences is not srnall. Fortunately. thcrp exist shortcut algorithms for identifying all shortcst trees that do not require exhaustive
e ~ m n ~ r a t i o rarid i , work for larger w t s of sequencps. Onc such algorithm is the branch-and-bound rnethod bj. Hendy a ~ i t Penny l [216], described briefly belon:
1. Guess a "good tree" To using a li~uristic~'): Lo := L(To); Let S be the set of all binary !Y-trees: 2. (Iteration:) 1. Partit,ion S into a small n111111~1.of subsets XI. S 2 :. . . ; Xk: 2. For i := 1;.. . , k do - Find a length L ( S , ) such t h a t L ( T ) L ( S i ) for all T E X i ; - If L ( S , ) 5 Lo then it,crat,e (ret,urn to 1. with X = X,).
>
The ohsermtion 5.2.1 suggests that the niethocl given by 7.2.1 can he extended P L ) . Alrld t,o find the location of St>einerpoints in phylogenetic- spaces (A*. indeed, Sankoff gives a dynamic yrograinrning algoritlnn for tree aligimient. He merges the high-clirne~isionalversion of the dynamic prograniming algorit,llrn for pairn-ise aligninent with the Fitch algoritlini:
Observation 7.2.3 ( S u ~ n k o f f[,378]) L e t :Y he (1 s e t of n words i n t h e phylogenetic space ( = l * . p ~ ) . L e t n birmry S - t r e e T be giuen. T h e n t h e location of t h e S t e i n e r poz'nts in T ca,n be reduced t o ( 2 4 " applications of 7.2.1, where d = inax{/v : u E *Y).
7.3
THE PERFECT PHYLOGENY PROBLEM
Now, n.e corlsider character state data. Recall the perfect phylogeny problml. Given: A set Ai of 11 taxa on a set C of characters, reprcsent,ed by a n n x 777 character-stat,? matrix 41. Determine: \'\'ilet,her a perfect phylogeny cxist,s. A h d ,if so. const~ructone. Tht. following observation is not a surprise
Theorem 7.3.1 (Steel [40G]) T h e pcvfert p h y l o g ~ r ~py~ o b l c r nas .I*?-complete
TZe will nom- restrict ourselvc to the binary case, that is we allow a character t o take exactly tn-o st,ates: -\I is a 0 - 1-matrix. Here, wc n-ill see t h a t t,hc problem can solved ~fficient~ly. For the following algorithm it will be convenient t o first reorder the col~irnris of -11. Consider each column as a b i ~ i a r ynurnber: sort tlwse ni nurnbers illto deriote tile dccreasirig order, placing the largest numl-wr in colu~nri1. Let reordercd matrix 31. From this p o i ~ i ton. c x h character n-ill he narned by t,he of ariy column it occupies in Hence. a character j will be to the riglit in _\^I character i if and only if i < j . For any colurnn k of YI. let O k be the set of t a m with a 1 in column k- that is the t a m that have charact,er k. Clearly, if I l k strictly contains Ojt , l ~ e ncolumn ( c l ~ a r a c t ~k~ rmust ) to be lcft of colun111 j in the m a t r i s lyf. Tlie major fact and the basis for a n ~fficientsolution of the pcrfrct phylogeny problem is
lo
T h e o r e m 7 . 3 . 2 T h e 71~0,trzx,\I h a s a p h y l o y r l e f j c trre if a'nd only if for euery puir of columns i and j , either 0, n r ~ d0,,are disjoi'nt OT o n e cont(~in,sthe other. This t h e o r e ~ nis intuitively clcnr. arid a complete proof is given in [I981 and [391]. To make this technique rlearer, Gusfield [198] furnislics the follon-ing srnall example: Let -\Il be the matrix
Tree buildin,g algorithn~,i
In 1-ien- of 7.3.2 we crcate the follom-ing algo~ithrn: Algorithm 7.3.3 Gi~~e'ra n character.-,state m a t r i x JI for 17 t a m and nb bi,nwy prochumcters, w e find (1 perfect phylogeny, if it e:~:lsts,by u ~ i n gtlie followi~r~g cedure:
do 2. For each row fIZli of Construct the string consi.sting of the cl~urncters; z'r, sorted ( i n c ~ ~ e u s i n y ) o&r, that JI, possesses: <9. Build the keyword tree T for the corastrwcted
4. T e s f whether T
12
str,i~~p:
1s a perfect p h y l o y ~ n gfor .\I.
It is shon-n that this algorithm can he implemer~tedin O(7ri . I ! ) time: see [198]. In view of the indepe~ldenceof the dlararters am1 the ~iccessityof t,he ltnon-ledge of all ni . 11 values, no algorithm can he (as)-mpt,oticallg) fastcr.
7.4
PAIR GROUP METHODS
l l o s t h ~ u r i s t i crnetliotis use classification techniques. Remember our approach in the chapt,er hefore, n.lierc n-e created a procedure invol~.ingsucc~ssivefusions of n = :2'I individuals into groups. The algoritlm attempts to cluster the d a t a by repeatedly grouping the closest elements. Yoting that phylogci~etirtrrcs arc binary trees, we "connect" exactlj. t,n-oelements. Thus this clust,ering t,echnique is called tile pair group rnethocl P G l I : and implies a heuristic for c ~ n s t ~ r u c t i n g shortest t,recs. Hrre. we have to find tlic ucarest pair ant1 t,ho determination of the new distances. We start with t,he distarice riiatrix D = D(AY, p) = (di,)i,,:l,. ,,,for 12' = {cI , . . . il,,) in the. space (1p).tlefinetl bj-: ,,
.
Sirice the metric. p is synimet,ric and the d i s t , a n c ~betn-eel1 a point and itself equals 0. it is only ncccssa1.j. to c.or11putc' r r ( n - 1 ) / 2 v a h m of D.
.\fter each step of the procedure n-e cornputc a nen- rnat,ris hose ent,ries are int,er-point and -class distances: in other words we convert the distance t,o a cluster-distance, t h a t is a function d : .\* x .'I"i El>", - where .I' denotes a classificatiori of ,'\-. The specifics of these cornputations distinguisli thc methods. Consequently, our basic strategy will be linking the least dist,ant pairs of taxa, followed bj. succcssivcly more distant t a s a or classes of t a s a . ST'hen tn-o t a m are linked, they lose their individual identities and are suk~sequentlyreferred t,o as a single class. T h e proccss is coniplete n-hen the last two classes are nlerged ii~t,oa siriglr class coritairiing all of t,lie original taxa. SSTheristated as a n algorithm:
Algorithm 7.4.1 L e t _\= ( Q . . . . c,,} be a set of rr poarifs an u m e t l a c space ( X .p ) . T l ~ r n :
2. W h i l e .\'I # 1 do (i) Find t h e near,& pair of distinct sets i n A'; ( 2 1 ) Create t h e n e w set AT' = -lTi U uA;: ,\: := .\- \ {-Ti.l;} u {1Y1}; fiii) S t o r e t h e dista~races
s q
nn,d -\;:
(iv) C o m p u t e t h e n e w distance f u n c t i o n d m d t h e neu; rlistanm matri:~:D ; thud I s . d e f i ~ r ed j I i , -Y') for. dl I< E .'I'. Yote that n.e start with a distance ~ r l a t r i swith n ( 1 2 - 1 ) / 2 paraInet,ers. Since a tree is defined by 7, - 1 parameters, n.e cut the iluniher of parameters b!- tlic factor 1 2 / 2 . Thus n-e may lose some inforination. In a general sense. this is a local search strat,eg>-using a mi~lirnurn-1e1lgtlloptirnality criterion. There is still 1.oo1n for e ~ d u a t i n goptions here. 7.4.1 start,s with a matrix of size rz x n . The closest pair of sets are arnalgamatetl, and the Clistmc~sf r o n ~this anlalgain t,o each other sets is determined. These
distances replace the distances from t,he t,wo sets heing joiiied in the matrix, producing a new tlistaricp matrix of size ( n - 1) x (n- 1). Theii 7.4.1 starts again; t,he process is repeated n - 2 times. Hence
Observation 7.4.2 A l g o r i t l ~ m'1.4.1 u s e s crabic t i m e . The following is a n input for which the algorithm fails: Let -'i = ( ~ $ 1. .;. . 1s.i) and let ( ( I : ~ ~ ~ ~ )t,e ( va~ ;Y-trcc c ~ ) ) with the length-furiction equal to 3 for the ~ u., arid equal to 1 ot,hcrn.ise. The11 t,hc distance matrix edges adjacent to I : and is given by:
Algorithm 7.4.1 amalganiates i.2 and ~ 3 in . T-iolittiori of the fact that these vertices are separated in the original t,rrr. Since n-e always find a tree iriterconnectirig -Ywhich has a 1erigt)hno more t h a n a n A'IST for S ; ~ v ehave, together with 5.2.2:
Observation 7 . 4 . 3 L e t ,GZ be cc he7rristic o r q ~ p ~ ~ o a i r n a t imoe~t hn o d for findi'r~y a trer for -Yin t h e m e t r i c space (S,p ) . T h e n
L ( S .p ) ( S M T for > na(S.p ) L ( - Y . p ) ( f r e e gzven b y ,GZ for -Y)Are there circunistnnces i11 ~vliichthe P G l I works ~ s a c t ' ! hy the following consequence of 6.3.6:
> 21
-
answer is gi~.en
Remark 7.4.4 4 1 , y o T i t h 7.4.1 .yi,ues t h e c o v e c t tree if t h e d i s t a n m s f o m ~a3rL u l t r a ~ n e t r i cspace. LIoreover: si~nilarityarid evolut~ioi~ar!~ relationships n-ill only coincide exactly if tlic distancxls a r r ~llt,ramrtrir:coniIm.c3 [27S].T h a t means t h a t ultrarlictric
distances will precisely fit a tree so that the distance between any t , ~ toa m is q u a 1 to the sum of the lerigt,hs of the edges joining t)hem, and the tree can be rooted so t,kiat all of the t,asa arc equidistant fro111 tlic root.
7.4.1
Linkage Clustering
One of the simplest agglonierativc methods is linkage clustering. I T clistii~guisll between two liincis of suclr techniques: T ~ I Psingle and the cornplet,e linkage clustering. T h e main feature of singlc linlcage clust,ering is that t,he distance between classes is defined as that between the closest pair of indi~.icIuals. where only pairs co~lsisti~lg of oirc i~idi~.idual from each class are considered. Suppose n-e choose sets in ,\', say ;Y, a d lYJ: to airialgate to form the nPn set S' = A T , U -VJ. -4 new distance function (and matrix) is found hy recalculating define as follows: For all scts K c ,\'\ {~"i'} with S' replacing S, ant1
T h e single liirlcagc metliod is closel~.related t,o rnir~iinumspanning t,rees. This can be seen if wc compare this t,ochnique n-it11 algorithm 1.2.8. T h e complete liriltage clnsterirrg method is the opposite of single linkage in the sense t h a t t,he distarice betn-eeii classes is IIOW definecl as that betn-een t,he . all most dist,ant pair of individuals. one from each class. In otlrer ~ o r d s for define sets Ii E ,\- \ {L\7')
7.4.2
Simple joining
One of t,lre most popular mctlrods is the nearest neighbor technique.
\* {.Y1) define For all sets K E ,I
Theorem 7.4.5 If the distance j i ~ ~ ~ c t ido irn~ the proced~ur~e7.4.1 co'rnes from a metl-ic t h e n (1 is a d i ~ s i n r i l c l ~ r r t ! ~ .
Tree building algorithn~s
7.4.3
UPGMA and WPGMA
Another specific variant of our PGlI algorithm is t,he unneiglit,etl pair group rncthod n-ith arithmetic mean (UPGLIA). It is t l ~ rmost co~nmonlj-used clustering rriethod. Hrre, the last step of 7.4.1 heconics: For all sets
K
E .\<\
{-Y') d e f k
A nmlificatioli of U P G l L l is the n-eiglited pair group method with arithmetic mean (TT7PGLIAl); n-hich weighs the sets of .I' hy t,lieir. size: For all set5 li' E ,\'
\
{S'} define
Theorem 7.4.6 L e t t h e distance f u n c t i o n d i n t h e procedurr: 7.4.1 be chanyecl as follows: p e 1c ( 1-. -1 I \ ) = a . r l " ' " ( ~AYl) , + 3 . d O ' " ( ~l Y. l ) . (7.10) 'uhereby a. 8 > 1 and a 3 = 1. Ifd0'" 'is a met?.ic, their elntU'i.s also a m e t r i c . A
+
Proof. 0111~.the t l i m g l e inequalit> need 1 x 1.crifiecl . -Yj A\-'}.
Considel for I<, K' f
%
(I"f ( K .,-I)
+
( , I
I<')
+ j d0'q1<. + a . d 0 y I < ' .- Y / )+ j ( l O ' l / ( ~* a d o ' l l ( ~ < . ~+i ' l7) d o ' d ( ~ i ' . ~ < 7 ' ) = ( a + 3 ) . dO'"(l<.I<') =
(1 . d O 1 d ( ~ < .
-
d0'('(K.I<') = d'" " ( K .I<'). =
GStra~lgcly. sorr~etimesi n refwencecl works, our method I-PGhI.4 is called \WGT\IFI and vice versa.
arid similarly
a n example consider given hj-
=
{c31..
. . . c i } and the (ultrametric) distance matrix
UPGh1.1 creates the classification
T h c related unrooted _ Y - t ~ e iei
n.hic1i is, in ~.ien-of 7.4.4. the correct trcc! and of length 12
7.5
STEINERIZATION
Let hc a finite set of points in a phylogenetic space. IT? convert a n hIST for :Y into a shorter tree through a sequence of "local Stcin~rizations"t h a t i n t r o d u c ~ nelv (Steiricr) points in order to reducc tlic length of t,lic tree. Kote that once added, a Steiner point cannot be rernovrrl. Consequentl~..we have a monotonic
iterative more carefully.
Clearly. it is necessary to explaiil the Steinerizatioli step
Algorithm 7.5.1 Let -1-be Th,e n
(1
finite set of points i n the m e t r i c space ( X . p ) .
2. If there are t h e e uertices I:,. r t 2 , 7 ' : j of T which form u path of length 2 Pwithout lo.ss of generality (11Q . ~ ' 2 E ~ El), 3 t h e n find (L point q tho,t minimizes the Ferrnat function ,for the.se ver,tices: If p(7ul, q ) + p(u2. q ) + / I ( L > ~q, ) < p(1:l. c 3 ) + ( 4 ~ 2 (.' 3 ) the'r~do -
:=
u {q};
Observation 7.5.2 T l v ulgoritla,rn protlures a tree for ,Yi n which ench S t e i r m poi'rzt has degree three. The first step needs 0 ( r i 2 l o g n ) time, whcre 72 = -TI. G is a graph n-ith a t most O ( T ~triangles: :~) in each triangle. rye find a Torricclli point in consta~lt time and consequently, thc third step needs only linear time. Henco,
Theorem 7.5.3 T h e alyoritla~rnr u n s in cubic tirne. To makc this technique clearcr. Foulcls [I681furnishes tlie bllon-ing small artificial example: Let ly = {wl... . 7 c 5 ) C {(L. c. g . u ) be ~ given by
.
h his idea was applied to Steiner's Prohlcrn in t.hr Euclidean plane for t h e first time in Smith et al. [400].
a c c a a
g u g u a a g u g u c c a u g u g u g u u
u u g u
a a a c c '
a a a a u
\l~Tc~ use tlie Harnrniug distance. Applying ari LIST algorithm n-e can obtain u : 2 : ~ c l I ;L ! : ~u. j lc1. Z L ' ~ 11'5) : of length (aniong other 1ISTs) the t,ree T = (A\7, {rcl. ---12. Introducing new sequences may to reduce the length. This is brought about using Steinerization as folio\?-s: Suppose n-e hal-c a substit,ution a t the column k bet,n-een the lcttcrs :r arid g : n-hich appears on t,wo trdges L C . L C ' ; u~':u:" of T w1:hich are incident to the same l-ertes Thc tree T can be modified by introducing the new sequence e. a St,einer point. which differs from the sequence u'only a t position k . Let T = (1: E): then
.
vl'.
TiierL = (1- U { L ' ) . E \
{ !-I ) . 11.'. ul'. w "
) U {c. ZL'. L'. d, 1 ' . 71'")). ---
(7.13)
and n-e find that the tree T' is one less shorter than T. This process can be used repeatctily to reduce the length. In our esaniple we consider three positions 1.2 and 7. ;It the end n-e find a tree nit,h two Steiner points
T h e tree is r o o t d a t
ull
arid of length 10.
To determine the relative error (the length of a tree const~ru(.ted1 ) ~ '7.5.1 divitlecl by tlie lengtli of a n LIST) first we note
Lemma 7.5.4 (C. [92].) Let -Ybe a finite set of points. Let T denote q a Tor7.icell.r p o i ~ r for ~ t -qT. T h e n ~
3
r
d
a7,
MST
Tree building algoritlms
Then
Theorem 7.5.5 Let -Y be a finzte sef of polnf5. Let T be a n MST for -1-and let T' be a tree defermmed b y the a l g o r ~ t l m7 5 1. T h m
Proof.
where AY(q)is the set of all neighbors of q in T' and GI = (1'1:': v , v' E A?\7)) is a forest fbr AY. Let Tq be a n AIST for X(q). Using the pr~ceetlinglenirna n7e corit,ini~e: (A\7,
Since G' U Uqt,-,,."i Tq is a tree interconnecting S n-ithout Steiner points, t'he assertion follons.
(3.1G) said t h a t the Steiner ratio of the pllylogenet,ic and sequence spaces equals 112, a t least approsirnatrly. It follows t h a t algorit,lirn 7.3.1 is riot sufficient to find a n ShIT in general. T h e preceetling lernrria says t h a t it is necessary t o consider subsets of four clclriclits of tlic give11 set.
-4 related rriethocl. for creating t h t so-called phylogeriet,ic rriediari-joining riett al. [30]. T h e algoritlm begins with a IISK. Aiming n-ork, is given 1)y B a n d ~ l et at parsimoq: suhsc?qurntly added fen- c o i l s e r l s ~sequences ~ (as a lcirld of Steiner points) of three r-nutually close sequences. This procedure will repeat,etl.
7.6
HANDLING MORE THAN ONE TREE
T K O observations give reasons for considering more t,l~arione tree at the same time. So far in this book lve h a ~ ae s s u r n ~ dt,liat thc cvolutionar!- relationships among sequences8 are best represented bj- a tree. In d i e r words. we are assuming that a tree models reality. Hon-ever, the actual evolut,ionary history may be not be In particular tree-like, in n-hidl case analyses t h t assume a tree may be seriously mislcadirig. Thcrc are lirriitatioris in always forcing dat,a onto a standard phylogenetic tree. Processes such as parallel mutation. hj.bridization, recombination, gene conversion arid lat,eral gene t,ransfer violat,e a tree-hased evolutionary model. fundamental problem in biological classification is t,lie question of how best, t,o combine int,o onc phj.loge~letictree a rollection of phylogenetic trees t,liat, classify the same or overlapping sets of taxa. n ' h ~ r ithe sets of t a s a are the same, t,his problem is known as the consensus trep problen~;t,he rriorr gcneral situation. where the tree classify possibly difFerent. though 01-erlapping. sets of taxa. has become ltnown as the supertrec problem. S l o r ~ o ~ . oin r : biology tlicre are ex;i~nplesof a rct,urn to a n aucestral state and convergent evolution. Such events are rare. hut c a ~ i n o the ignored. In other n-ords, if n:e look for " T h e Great Darwin Tree": n.e n-ill find a rietxork n-itli s e v ~ r a cyc1t.s. l But the nurnher of (elementary) cycles n-ill he srnall in relation t,o tlic number of vertices."
.
Clorisider the following rnatlierriat,ical esample: Let 3-= {vl:.. . v,,) he a set of' that the set P = {pl... . p" of' prot~einsis given; I:; denotes species. A4ssu~ne the specific form of the protein pj in the species c i , j = 1. . . . . k and i = 1. . . . . n. Then define -Yo)= {oi. . . . c h } . (7.16)
.
=l....,k. Let ~ ( 1 be ) a ph) logenetic tlcc f a we h a l e that the set *Y.
j
~ ( 3 )rlssulrnng
that each set AY(j)icpresent5
X ,j=1
%and genes. species. orgal~isms .... 'This idea of a sylnl~iolice\.olution was first givrn hj- hlargiilis [293j, [294], compare also [ I 291.
is a graph interconilect~ingthe set :Y.B u t , in general, vie may not assurne that G will be a tree. This fact n.as found in a real test of the theor!. of evolut,ion by Peiiny e t . al. [339]. After comparing phylogenetic trees constrirted from five different protein sequences. they conduct,ed that: ( I ) it is possible to malie falsifiable predictions from the hypothesis t h a t species ha^^ been lirlkcd i11 the past by a n evolutionary tree and (2) there is strong support fl-(-omthese five sequences for the theory of evolution. There may be exceptions n-here different scquences will lead to dif~ symbiosis t,heory. Also. in ferent t,rers as, for exainple. in t h serial pre-cellular evolution a netn-ork which circuits may be a bett,er inodel than a tree.
7.6.1
Consensus trees
-4 consensus tree suinrnarises information coininon to two or more trees. In ot,her n-ortls: 1
phylogenetic tree sulnmarizes phj-1ogenet)ic i n f ~ r n l a t ~ i o n ;
A consensus trce suinrnarizes the information in a set of trees Here. n-e have tn-o atlditioiml observations: (a) TYe can combine heterogeneous d a t a . anti (b) M'e can find hicl~leilphylogeiletic iliforlnation. For instance. Cavalli-Sforza [G8] compares tlic spccics tree and the tree of languages for huinan populations. This gives inany hints for the prehist,oric del-eloprnent of mankind. Coilseiisus trees are helpful to taxonomists: \Then they has conlpleted a classification, it may be that with d a t a from another source the classification is different, or t h a t by using a differe~itclustering rrictllocl a new classification results. The taxonomist may nish to form a n overall classification which takes account of the information shared in each classification, however it is obtained.
The Consensus Tree Problem Given: -4 collection T,of _\--trees, i = I . . . . 71-1. Find: -411 .\--tree srmmlariziilg thc phj~logci~ctic inforination of all T,
There are many ways to c o ~ r h i n e_\--trees into a single tree; see Seinple and Steel [393].1° T h e r n ~ t h o d sdiffer in lvl~ataspect of tree inforrnatiori they use. and hon- frecjuent,ly that iiiforniatiori must, b~ shared among the trees to be included in t,he cunsensus. The most comrnonl). used are thc strict consensus and the majority-rule corlseilsus trees. Suppose tliat T I : . . ,TI,, are L\--trees. Each of the trees has the same leal-es, ~iamelythe ii-~embersof _Y. 1% are interested in a n I\'-trcc T described hj. one of the follon-irig methods.
I. T h e strict conselisus tree inclutlcs only tliosc splits tliat occur in all t,lie trees. T h a t means Ill
S(T:=
ns(T,).
(7.18)
i=l
111 these considerat~io~is. n-e use a n ini~nediateapplicat,ion of 6.2.10 which provides a. part,ial order on 7;,: we define TI 5 T2 for T I .T.2 € 7, precisely if S(Tl) C S(T2). This rc1at)ion has the iblloni~ignice properties: first, t,lze rnax5 ) are t,he binary :Y-trees. And secondly: inial elements of ('6,.
Observation 7.6.1 T h e ,strict c o ~ ~ ~ e ~ tree r ~ s ui.ss t h e ( m i q ~ ~ greatest e) lower, bowad for T I ,. . . . T,,, in 'T,, 'ijmder 5 , , n ( ~ m e l yt h e tree T uiith t h e property (7.18).
11. Mi.can relax the require~neritthat a split of'T occur in all trees, arid instead retain those splits occuring in a majority of' the trees. Algorithm 7.6.2 For r ~ a c hof t h e A\7-tress T I . . . . T,,,, nrt~rkt h e ~ i e r t i c e sinducti'uely as follows: 1. i l l n ~ ~t hk e leuf c w i t h { I , ) :
"For simple h u t fundamental lirnitatior~son consrr~sust r w ~ r ~ c t l ~ oconiparP tis Slccl. Dress
and Bocker [-1071.
Tree b l ~ i l d i n ga1,qorzthnu
229
For rooted S - t r e e s n.c have another nice characterisation of majority-rule consensus trees:
7.6.2
Supertrees
Here the task is to com1,inc (rooted) pllyloger~et~ic trees with overlapping sets of leaves into a single supertree. IIore precisely:
The Supertree Problem Given: A c o l l e c t i o ~T, ~ of *Y,-trees. r = 1. logeilet~ctree f o ~,lTZ Find: h ph~.logcneticS - t l e e for
, k each Icpreseritiilg a phy-
Each of the original trees can he rogarcled as a sarriple taken from tlic sllpertree. The supert,ree approarh is an example of the so-called Divide anti Conquer method:
1. Divide the sct of sequences into groups: 2. Find a pliylogerictic wee for each group:
3. Combine the trees into one. There are many .ir.a!.s to conhirie phylogenet,ic trees into a single tree; see Bryant, Seniple; Stccl [LS]. and Semplc and Stcel 13931. For simple but fundamental limitations on supertree methods compare Steel et al. [ A O i ] .
REFERENCES
[I]
.$.I7.Aho. J.E.Hopcroft, and .J.D. Ulllnann. The Desig'r~,ursd .4r~alysis of Computer A1goritlrnr.s. Addison-T5eselej.. 1074.
[2]
R K Ahuj.3
Flows arid P a t h i In 11 Dell .41111co. F 1hffioh. and S 1Ialtello. editori. Annotnted Bzhliogrc~phzes 171 Cornb~rmtorial @ t ~ m z u t ~ o n pages . 283-310. .John n - l l e ~ant1 Sons. 1997
[A] 11 -4igner arid G 11. Zieg1t.r. Proofs from The Book Springer. 1989
[5] S.G. Xlil. The Design and Arady,sis of Pardlel Algorithms. Prenticci Hall. 1989.
x:,
<
J11, L?". Bullet,in of the European Association for Theoretical Computer Science (EATCS). 28:16-20. 1986.
[6] 1'. =Iltman and TT7.R. Franklin. O n the question .'Is
[7] J . Alhrecht and D. Cicslik. T h e Steirier ratio of firiitc tlilne~lsio~lal L,> spaces. I11 D.Z. Du. J.11. Smith. a d J.H. Rul)inst,eiri. editors. Adoances in Steiner Trees. pages 1-13. Kluwer Acadeniic Publislms. 2000. [8] J . -4lhrecht and D. Cicslik. The Steiner ratio of l,j-planes. I n D.Z.Du a1it1 P.AI. Pardalos. editors. Ha7zdbook of Cornbinatorin1 Optinlimtion, volume -4. pages 573-590. Kluner ;lcatlelriic Publisliers. 2000. [9]
J. Allxecht and D . Cieslilt. The Stc'in~r Ratio of 1;.
In Geraeralized Convexity and Geriumdized Nonotonicity, number 502 in Lecture Notes
in Ecoriornics and Slatlleniatical Systerns. pagps 73 -87. Spririger-lTerlag, 2001. [ l o ] J . Xlhreclit and D. Cieslilt. The Steiiwr Rat,io of three-dirnensio~lal1,spaces is esse~it,iallyg-rc3atc.rthan 0.5. Congr,e.ss~$ Numerantiuni, 153:179 185, 2001. [ l l ] P.S. AAlcsa~lclrov.Die Hilbertschcn P~,oble.rne. .$kadernische 1krlagsgesellschaft Geest k Portig: Leipzig. 1979.
[12] I. -4lthofer. Spanners of Graplis with Arhitrarj. Edge Lcngth. Preprint' 891024. Cniversitat Bielefelrl, Germany. 1989. [13] S.F. Altschul. A P r o k i n .Alignment Scoring System Se~isitiveat .A11 Evolutioriary Distarices. ,I. Molecular E ~ i o l ~ ~ t i 36:290-300. on, 1993, [14] I. Anderson. -4 First Coume i n Discrete Mntiaemntics. Springer. 2001 [lj] T . .Andrea(? and H..J. Bandelt. Perforrnanw guaralitccs for approximation algorit,hms depending oli parametrized triangle ineq~alit~ies.SIAM (1. Disc. Math.. 8:l-16, 1995.
1161 Y.P. -4neja. An integer linear prograrrirriirig approach to the Steiner Prohlern in graphs. Netluorks, 10:167-178, 1980. [17] S. Arora. Polj.nomia1 time approximat,ionschc~ncsfor Euclidean T S P and other geo~netricproble~ns. 111 Proc. 37th Annu. IEEE Sympos. Foi~n~d. Comp'ut. Sci.. pages 2-1 1. 1996. [l8] S. Arora. Polynomial Time Approsimation Scheincs for Euclicleari Traveling Salesmarl and Otlier Geometric Prohlenis. ,I.of A CM, 45:753-782, 1998. [19] E. hrtiri. Galois.sche Tlreorie. Teubner \7erlagsgesellschaft, Leipzig. 1965.
[20]T . K . ;lttwood and D.J. Parry-Sinit,ll. Introduction to h~;oinforrnutics. Preritice Hall. 1999. [21] 1- .Auletta a i d ;\I Pcuente Better Algorithms for \Iini~num TT~ight \7erte.c-Co~mectirit\ Problems In STACS 97. number 1200 in Lecture S o t e s in compute^ Scierlcp pages 547-558 Springer-ITerlag,1997 [22] G. Ausiello. P. Crescerii. G . Ganibosi, I T .Karin, A. llarchetti-Spacca~~iela~ and 11. Protasi. Co~npl~zit:y a n d Approzi'matior~ Springer: 1999. [23] G. Ausicllo: A. D'At,ri: arid 11 Lloscarini. Cordial propertics on graphs and niiniirial conceptional conncctiolis in scrriantic dat,a models. Jowrral of Computer and Systems Sc'ienms. 33:179-202: 1986. [24] F J . Xyala. The q t h of Eve - rnolecu1a1-biologj arid hulriari origins. Sr zence. 270:1930 1936. 1995. [25] F..J. A j A a arid ;l.AA.Escalaute. The orolutioa of huriian populations: -A molecular perspective. Mol. Ph?yl. Evol., 5:188--201. 1996. [26] C. Bajaj. The ;Ilgchraic Degree of Geometric Opti~nizationProhlenls. Discrete and Computational Geometry. 3:171-191, 1988.
REFERENCES
233
[27] *A. Balakrishnan. T.L. hlagna~lt~i. and P. 3Iirchantlani. K ~ t w o r kDesign. In 11. De1l';lrnico. F. I\Iaffioli, and S. hlartello, editors. Anndated Biblioginpllies in Cornbinatorial Opti7nizatio~n.pages 311-334. Joliri \Tiley and Sons, 1997. [28] H.-J. Bandelt and .A. Dress. Reconstructing the shape of a t,ree from ohserved dissiinilarity d a t a . Arl~imnce.sin Applied n/luthenratic.s. 7:309-343. 1986. 1291 H.-J. Bandelt and A. Dress. Split Decomposition: *IKen. and Useful Approach t o Phj,logcnctic -Analysis of D i ~ t ~ a n cData. e Mol. Phyl. and Eiiol., 1:242-232. 1992. 1301 H.-.I. Bandelt, P. Forster. and .A. Riihl. hIcc1ian-Joining Net,n.orlts for Inferring Intraspecific Pliylogcnies. Mol. Biol. Evol.. 16337-48, 1999. [31] H.-J. Bantlelt. P. Forster, B.C. Syltcs. and h1.B. Ricliartls. hlitochondrial Portraits of Hurlan Populations Using AIediari Networks. Genetics. 141:743-753. 1995. [32] J. Bang-Jensen and G . Gutin. Dqraphs. Springer. 2002. [33] hI. Baronti. E. Casini, and P.L. Papini. Equilat,eral sets arid t,lieir central points. Rendiconti cli Matemutika, 13:133-148, 1993. [34] B. Barvinok. Lattice Points arid Lattice P o l j m q m . In J . E . Goodrrian and J . O'Roilrke, editors. Handbook of Discrete and Comp~rtr~tionnl Geometry. pages 133-152. CRC Press, 1997.
~ Mer~schen Cllstein. 2001 [35] hI. Baur and G . Zic.glcr. D/r O d y ~ s edes [36] .J.E. Beasley arid A. Lucena. Brai1c.h ant1 Cut AAlgoritlinls.In J.E. Beaslq., editor. Adi?arrcc.,sin Linear tcnd Intege'r Progru~rnminy,pages 187-221. Oxford Science Public.at,ic)~ls, 1996. [37] R. Bellnlan. Dynwnic Progra7rarrai11,g.Princeton University Prcss, 1957 [38] C . Berge. Graphs. Elsevier Science Publishers, 1985. [39] C. Berge and A.Ghouila-Houri. Programnles. Jeux et, Reseaux de Transport. Paris. 1962. [40] P. Berman and 1..Ramaiyer. Inlpro\-etl .Approximations for the Steiner Tree Problern. ,I. of Algorithrrss. 17:381-408. 1991.
1411 11. Bern. Triangulations. In J.E. Gootlnian and J . O'Rourke, editors, Handbook of Discrete arrd Consputatio~~ul Geometry, pages 413--428. CRC Press, 1997. [42] 11. Bern arid D.Eppstrin. Apl~roximationalgorit,lims for geometric problems. I11 D.S. Hochbaurn, editor, Approxirnr~tionAlgoi~ithmsfor .\'P-hard Problems: pages 296-345. PITS Publishing Company. 1997. [43] 1I.IV. Bern. Faster Exact .Ilgorit,hnis fbr Steiner Trees in Plaliar Networks. IVetworks. 20:109-120, 1990. [44] S1.n'. Bcrli and R.L G1;lham. Das P~ohlerndes k u r ~ e s t e nYetznerlis. Spelctr~~m drr W ~ s s e r ~ c h a pages jt. 78-84. S l i i ~ z1989. [45] 1.1.IV. Bern and R.L. Graham. The Short,rst Ketn-ork Problem. Scientific A7nericn7r, 260:81-89. 1989. [46] L.J. Billera. S.P. Holrnes. and K. Ihgtmann. Geometry of the Space of Phylogenetic Trees. Adunnces in Applied n/i'at1ie~rraatics,27:733--767. 2001.
[4S]1'. Boltyanski. H. IIartini, arid 1'. Soltan. Geometric Methods and Optimization Prob1em.s. Kluwer Acaclemic Publisliers. 1999. [49] ibI. Bonet. C. Phillips. T. IT'arnow, and S. Yooseph. Inferring Evolutionary Trees from Polyrriorphic Characters. and a n A4nalysis of the 11ldoEuropean Farriily of Languages. DIMACS Series in Discrete Mutlzerrratics and Theoretical Computer Scierlce. 37:13--55.1997. [20] I<. Bopp. lieber d m kiirzeste T'erbind~~ng.ssysterr~ zwischerr riier. Punktera. PhD thesis, Uni~ersitiitGiittingen. 1879. Dissertat,ioli. [51] ;1. Bo~cliersanti D.Z Du The M t e i n e i Ratio in Graplii Computzng. 26 857-869. 1997
SIA4M J.
[52] *4. Borchers. D.Z. Du, B. Gao, arid P.\Van. The k-Steincr Ratio in tlic Rectilinear Plane. J. of Al,yorith~n,.s,29:l-17, 1998.
r ~ Univ. Calif. Press, 1984. [54] P.J. Bon-ler. Euolutiorr.: The hitstory of c ~ idea. [55] hI. Brazil. Steirier 1Iiriirnu1n Trees in Uniform Orientation hletrics. In S.Clleng and D.Z. Du. editors. Steinev Trees i s r ~ I7adust1.y. pages 1-28. Klun-er Acatlemic Pul~lishers.2001.
REFERENCES
235
[56] 51. Brazil. .J.H. Ruhinstein, D.A. Thoinas, .T.F. TT'erig, and K.C. IYorrnalcl. Shortest networks on spheres. D I M A C S Series i n Discr.ete M a t h e ~ n a t i c s and Theor~eticnlC o m p u t e r Science. 40:453-461. 1098. [37] J.R. Bron-11. plailo.sophy of niathematics. Routledge, 1999. [38] D. Brj.ant,. C. Sernple, and 11. Steel. Combining Evolutionary Trees with A h c e s t r a l Dirergenw Datcs. Prcprint CCDAIS 2002/14. University of Cant,erbury, Chri~t~church. New Zealand, 2002. [59] P. Buneman. The recovery of trees from measures of dissimilarity. In F.R.Hodson. D.G. Kcndall. arid P. Tautu, cdit,ors. iWath,e~rmticsin the Archaeological a i d Historical Sciences. pages 387-395. Edinburgli E n versity Press, 1971. [60] R.E. Burltard, T. Dud&, and T . llaier. Cut and patch Steiricr trees for ladders. Disc~.eteMathematics. 161:33--61, 1996. [61] H. Buseniann. The Geometry of Geodesics. Sciv York. 1955 [62] G..R. Cai and Y.-G. Sun. The l l i n i ~ n u mAugnientatio11 of any Graph to a l<-Edge-Conncctcd Graph. Networks. 19:131-172. 1989. [63] R L. Calm and A C TYilson 217.303-304. 1982
5 I o d ~ l sof h u i m n c ~ o l u t i o n Science.
[65] 11. Cartcr. h1. Hendy. D. Penny, L.rl. Sz<elj,,arid Y.C. T b r m a l d . 011 t,he distribution of lengths of e ~ d u t i o n a r ytrees. SIAM J. Disc. Mntlr.. 3:38---47.1990. [66] J . L . Casti. Five More Golden Rules. John V-ilej. & Sons. 2000. [67] J.L. Casti. Paradignia Regained. Abacus. 2001. [68] L.L. Cavalli-Sforza. Stan11nl);iurne ro11 J7iilkern und Sprachen. In 13. St,reit. edit,or. E ~ ~ o l ~ ~ ttles a o nillrnscr'ler~.pagcs 118---123.Spelttrurri Altademischer J7erlag, 1995.
. Hariser \>rlag. [69] L.L. Cavalli-Sforza. G e n e , Volker u n d S p m ~ h ~ e nCarl 1990. arid F. Cavalli-Sforza. Vemchieden u'nd (loch gleicli,. [70] L.L. C~T-alli-Sforza I
1711 L.L. Ca~alli-Sforzaancl -4.IY.F. Edn-ards. Phylogenet,ic analysis: models and estinlat,iori procedures. Euo111,tion. 21:550-570. 1967. [72] L.L. Ca~alli-Sforza.P. h l e ~ ~ o z zai l. ~ dA . Piazza. The Hzstoq and Geogruphy of Human Genes. Princeton Universitj- Press, 1994. 1731 A. Cayley. -4 theoiern on trees. Quart. Muth . 23:376-378. 1889.
The> Ferrndt-Problem In [74] G.D Chakeimn ant1 h1.A Ghandch;iri lIlnl~o~~~sl
-
An Algoritlmic Xppioath. Ken- Yolk.
[83] F.R.K. C1-lung and E.X. Gillxrt. Steiner T r w s for the regular simples. Bull. of the I ~ s t of . i\fath. Ac. Sinica, 4:313-325, 1976. [84] F.R.K. Chung and R.L. G r a h a ~ n .St,einer trees for ladders. A m . Discrete Math. Amer. Matla. Soc.. 2:173-200, 1978. [85] F.R.K. Chung and R.L. Graham. .A new hound for Euclidean Steirler iLIinima1 Trees. Ann. AT.Y. Acod. Sci., 440:328-346, 1985.
REFERENCES
23 7
[86] F . R . K . Chung and F . K . Hnaiig. A lower bound for the Steiner T r ~ e Prohlern. S L 4 M J. Appl. Math., 34:27-36; 1978. [87] D. Cieslik. l h e r Bii,unre rr~iraimalerL&ye L ~ Lder normierten, Ebene. P h D thesis; Ernst-Aloritz-Arndt Universitiit Greifsn-ald. 1982. Disscrtation A. [88] D. Ciesllk The Fermat-Steinrr-n'eber-Problcin in LIirikonski Spdws optzrr~~zat7on. 19 485-489. 1988 [89] D. Cieslik. Kiirzestc Bbume in dcr Ebenr. Math. Semesterbericlate. Bd. SX'Cs\T:268 -269. 1980. [90] D. Cieslik. The I'ertex-Degrees of Steiner llininial Trees in Ilinlton-slti planes. In R. Bodendieck and R . H e m , edit,ors, Topics i'r~Corr~binatorics and Gmplr Theory: pages 201-206. Pliysica-Ierlag: Hcidelbrrg. 1990. in LIinlton-ski-spaces. [91] D.Cieslilt. The I-Steiner-lli~linlal-Tree-Problern o p t i n ~ i z a t i o n 22:291--206. ; 1991. [92] D.C i d i k . Stein,er. Minirnal Trees. Klun-rr Xcadeniic Publishers, 1998. [93] D. Cieslili. Using Hadn-iger K ~ ~ r n h cin r s Netrork Design. D I M A C S Series in Di.screte ~ I h t h e ~ r n a t i carrrl s T1,eoretical C o m p u t e r Science. 40:39-78, 1998. [94] D. Cieslik. k-Steiner-niini~nal-trces in metric spaces. Discrete Mathernutics, 208/209:119-124. 1999. [95] D. Cieslik. Using Dvoretzky's Tlieore~nin Netn-orli Design. Journal of Geometry. 65:7-8, 1999. [96] D.Cirslik. The vertex degrees of minimum spanning t,rees. Europea,n J. of Operationml Research. 125:278-282; 2000. [97] D.Cieslik. X nionotorie iterative procedure to approximat,c trees of rniriinial length in metric spaces. Nonlinea'r A ~ r ~ a l y s47:2817-2828. i~. 2001. [98] D. Cieslik. Network Design Problems. In C.A. Flouclas alld P. Pardalos, editors. Enc,yklopediu of Opti~rnization,~ o l u r n e4 , pages 1-7. Kluner Academic Puhlishers. 2001. [99] D. Cieslik. T h e Stei'rrer. Rntio. Klun-er Academic Puhlishers. 2001 [loo] D. Cieslik. The Steiner ratio of high-dimensional Banach-LIirikonski spaces. Discrete Applied hriath,emntics. 138:29-34. 2004.
[101] D. Cieslik. A. Dress. and 11'. Fitch. S t e i n ~ r ' sProhlern in Double Trces. Appl Math. Letters. 15:853-860. 2002. [I021 D. Cieslik, A . Dress. K.T. Huhcr, arid 1.. Iloulton. E~nheddingComplexity arid Discrete Opti~nizationI: .A New Divide and Conquer Approach to Discrete Optiniization. Ann(~1,sof C07nhinatoric~.6:257-273: 2002. [I031 D. Cicslik. -4. Drcss. K.T. Huber, arid I-.lIoult,o~i.Elnhedclirig Coniplexit!- and Discrete Optimization 11: -4 Dynarriical Progra~rimirigApproach to the Stei~ier-TreeProblem. Annuls of Comhlrrotoric,~.6:215--283. 2002. [I041 D. Cicslik, *A. Drcss, K.T. Huber, arid 1'. \Ioulton. Connectivity Calculus. Appl. Math,. Letters. 16:39.5-399. 2003. [10.5] D. Cieslik. .A. Ivanov. a i d A . Tuzhiliri. lIiriirna1 Trees in Phylogenetic Spaces. Preprint 3/2001. LTnivcrsitiit Greifsn-ald. 2001. [I061 D. Cieslik. A . Ivario~..and *A. Tuzliiliri. ~ I c l z a k ' sXlgorithrn for Pliylogenetic Spaces (Russian). Vestnik ~Wosko~o. Lhi'o. Se,r. I. Mat. AJech.. 3 : 2 2 2 8 . 2002. [I071 D. Cieslik and .J. Li~ihart.Steilicr Lliriirnal T r w s in 1:. Discwte Mathematics, 153:3948: 1996. [I081 D. Cieslik a r d H.J. Sclinlidt. Ein Ilinirnalproble~nin der Ebcnc ulid irri Raum. alpha. 18:121122. 1984. [I091 P. Clote and R. Bacltofen. Cornp7ifut~or~al hloleculnr B~oloyy John \\'ilq k Smns. 2000. [I101 E . J . Cockayne. O n tlie eficiency of the algorithm for Steiner lIinirna1 Trees. SIAM J. Appl. il~lath.,18:150-159. 1970. ed [ I l l ] C.J. Colbourn and L.K.Sten-art. Perriiut,atiori graphs: c o n ~ i ~ c t , domiriatiou aud Stc~i~ier trees. Discr.ete A.latheruc~tic,s,86:119-189. 1990. [I121 B. Cornrie. S. P\'Iatt,he~?-s.and LI. Polinslq-. The At1a.s of Languages. Quarto Publishing Pic., 1996. [113] 11. Conger. Erlergy-n1i1limizi11g n ~ t ~ ~ oinr kr". s IIaster's thesis. 11-illiariis College, 1989. [I141 J.H. Connay and 8.1;. GUJ The Book of A7urnbr7s. Springer. 1996
REFERENCES
[116]
239
K. Courant and H. Rohbins. TVhal, is Matlre~nnfics. Oxford Cniversity Press; Kew Yorlt, 1941.
[I171 12. Courant: H. Rohbins, and I. Stclvart. W h a t is Mr~the7nntics. Oxford University Press, Sen. Shrli. 1996. [I181 D. Crj.st,al. T h e Cambridge Encyclopedia of Lnragunge. Cambridge Uiiivrrsity Press. 1987. [I191 D. Cvetkovii.; I - . DimitrijeviC. and 11. SIilosavlje~iC. Variations o n the T7.azdling Snlrs7nc~~11 Tl?,emc<.Libra Produkt Beograd. 1996. [I201 C. Daixin. T h e Or79171 of Spec~ps.Lontlon, 1859. [121] P . Davies. The A/llnd of God. Penguin Books. 1992.
. 1998 [122] P. Davics. T h e Fzfih M ~ r c ~ c l ePrliiguin. [I231 P.J. Dayis and R . Hcrsh. Tlre hlathesrrnticcd E q ~ e r i e n c e . BirlthLi~ser. 1981. [I211 h1.O. Dayhoff. Atlas of Protein Sequence arid Strurture. Technical Report 5. Yational Biomediral Research Founclatiori, 1l'ashington. D.C.. 1978. [I251 E.15'. Dijkstra. .A not? on two prohleins iri connection n-it,li graphs. N?L77~e7..n/Iut/t,.. 1:269--271, 1959. [126] 51:. Donischlir. Logistlk: Trnnspwt. Oltlenbourg. 1981.
. [I281 1l' Dorristlilte and A. Drexl LogistlX, S t a n d m t e O l d e l ~ h o u ~ g1982. [129] 1V.F. Doolittle. Staninibaulri d ~ 1,ehens. i Spektrurn der M~rzsser~sc/iafi, pagei 32-37, April 2000. [I301 A. Dress arid LI. Kriigrr. Parsimonious Phylogenctic Trees in SIetric Spaces a i d Sinlulat,ed Annealing. Adua~ricesin Appl. Math., 8:8-37. 1987. [I311 A. Dress. A . yon Haeseler, arid AI. Krueger. Reconstructing Phylogerletic Trees using Variants of the "Fom-Point,-Coiidition". 5't.i~die.nzur Klassifikc~tion:17:299 --305.19%. [I321 A. Dress and R. V7etzel. Tlie Hliirian Organism - a Place to Thrive for , Lrch~vallier.;\I. Schatler. the Immiino-Deficienq. I h s . Iri E. D i d q ~ Y. P. Bertrand, and B. Burtschy: editors: Ne'w Approaches in Classificutio~r~ an,d Data Analysis. pages 636-613. Springer ITerlag. 1994.
[I331 S.E. Drej-fus and R . A . TYagner. Tlie Steirier Prohlern in Graphs. hTet,works: 1:195-207. 1972. [I341 Z.Drezner and H.\Y. Haniacher. editors. Facility Location. Springer, 2002. [I351 Z. Drezner. I(.Klan~roth.A. Schobel. anti (2.0.\Yesolon.sky. The ITeber Proble~n.In Z. Drezner and H.TY. Hamaclier, editors, F(~c%l%ty Locution,. pages 1-36. Springer. 2002. [136] D.Z. Du. O n component-size hourlcled St,einer trees. Discrete Applied Mathematics: 60:131-140. 1995. [I371 D.Z. Du. B. Gao, R.L.Graham. Z.C. Liu. allti P.J. TT'an. Minir~lurrl Steiner Trces in Nor~nedPlar~es.Discrete and Coinputufional Geo'rrretry: 9:351L370. 1993. [I381 D.Z. Du arld F.K. Hn-ang. :i nen- bound for the Steilier Ratio. Tmns. Am. Math. Soc.. 278:137-148: 1983. 11391 D.Z. Du arid F.K. EInang. An A4pproacllfor Proving Loner Bounds: Solution of Gilbert-Pollak's conjecture on Steiner ratio. In Proc. of the 31st Ann. Symp. or),Foundutioras of Computer Scicsrce. St. Louis. 1990. [I401 D.Z. Du arid F . K . H m n g . A proof of the Gilbert-Pollak conjecture on the Steiner ratio. Algol-ithmlca. 7:121- 136, 1992. [I411 D.Z. Du. F.K. Hn-ang. and J.F. TI-png. Steirier hIi11imal Trees on regular polygons. Discrete uvd Cornpututionnl Geometry. 2 % - - 8 7 .1987. [I421 D.Z. Du,F.K. Hwang, and E . S . Yao. The Steiner ratio cor?jecture is true for five points. J. Cornbin. Theory: Ser. A.38:230-2-10: 1985. [I431 D.Z. Du, J.11. S ~ l i i t l and ~ . J.H. Rubinstein, editors. i l d v u ~ ~ c eins Sfeiner Trees. Iiluwer aAcacle~nicPublishers. 2000. [I441 D.Z. Du arid \I'.D. Smith. Disproofs of generalized Gilbert,-Pollak conjecture on the Steiner ratio in three or more dimelisioris. J. of Cornbiricltor.lal Theory, A, 743115~-130,1996. [l45] D.Z. Du. E.Y. Yao) and F.K. Hwang. -1Short Proof of a Result- of Pollali on Steiner 1Iiriirrial Trees. J . Cosnbin. Tlleor!~.Ser. ,-1.32:396--400,1982. [I461 D.Z. Du. Y.F. Zliang. and Q.Fc~ig. Or1 bctt,er lieuristics for Euclidean Steiner nliriirnum trees. In Proc. S%nd FOCS, pages 431-439. 1991.
REFERENCES
211
[I471 S. Du, D.Z. DLI, B. Gao, and L. Qii. A simple proof for a result of Ollerrnshsn- on Steirler T r e ~ s 111 . D.Z. Du arid J . S1111, editors, Adzlc~nces i n Opti7ni~(~tiv~r~ und Appro:cinaatiorr. pagcs 68- 71. Klun-pr Academic Publishws, 1904. [I481 D. Dilrand. .A New Look a t Tree AIoclels for 1Iult)iple Sequenct? *Alignment. DIMACS Series i7~Discrete Mathematics and Tlaeoretical Computer Scier~ce,47:65-84; 1999. [I491 R . Duriel. The E'eim,it-M7ebel P r o h l ~ marid Inner Pioduct Spaces ,JourTheory. 78.161-173. 1994. nu1 of App7o~~rrzatzon [I301 1 I . E . D~.erarid .-1.M. Frieze. Tlw complexity of computing the voluinc of 17967-074. 1988. a poll-lied1on. S I A M J . Cornpr~t~ng.
[132] 11. Eigrn. Das Urgen. Kom Xcta Leopoldincl 243132. Deutsclit~=Ikaclerni~ der Naturforsclicr Leopoldina, 1980. [153] 11. Eigen. Stufen zwm Leben. Seiie P i p i , 1092. [I541 H.11. Enzeilsbelger. Dorunhrzrlyr Up. 1999.
.IK P ~ t e l sS, a t i t k , Llassaciiuwtti.
[155] D. Eppstein. Spanning Trees ant1 Spannrrs. In J . R . Sack and J . Lrruria, e~lit~ors, Handbook Cornputatiosscd Geometry. pages 423-461. Elsevier Scierice B.\.'.. 1990. [156] B.S. Everitt. C1i~itc.rAnalyils. Arnold. 1993 [I371 G.F. Fagnano. Problelnat,a quaedaum ad methodurn masimorum r t mininlorurn spectantia. Novm Actc~Evuditoriu~n.pages 281 -303, 1773. [158] J.S. Farris arid A.G. Kluge. Parsimony and History. S?j.st.Biology: 46:zl.i218, 1997. [159] P.Fcr~nat.Abhandlungen iiber Maxima u'nd Minima. Os~valdsKlassiker der exakten \Yissenscliaftrn. 1934. [I601 D. Fern&ndez-Boca. The Perfect Phj-logeny Problem. Iri S. Cheng and I ~ , 203-234. Kluwer AcaD.Z. Dii: editors. Stein(:,- Twes in I I K I U S ~pages tlernic Publishers, 2001. [I611 \Y. Fitch and E. LIargoliash. Constrilction of Phylogenetic Trees. Science. 133:270--284, 1967.
[162] n*.hI. Fitch. Toward defining the course of r v o l ~ t ~ i o nminimum : cliange for a specific trec topology. S y s t e m a t i c Zoology: 20:406-416. 1911. [I631 ST.hI. Fitch. An Introduction t o Llolecular Biology for Alathematicians and Cornput,er Programmers. D I A L 4 C S S e r i e s i n Discrete M o , t l i e ~ n ( ~ t i c s a'nd Tl~eoreticalC o m p u t e r Science. 41:l-~31.1999. [164] ST 11 Fittll and T . F S m t h Optimal sequence alig~irncnts Proc. Nut1 Acad S C Z . U S A . 80:1382-1386. 1983. [I651 S. Fliege. Eiri Baumgenerator in Sparse-hlatrix-Techriil~. In Rechpages 15- 17. Berlin, 1975. NTG Fa&n e ~ : y e s t u t z t e rSchalt~ur~qse7~teuu~7~f. herichte. Band 52.
[167] L.R. Foulds. 1Iasin1um Shying5 111 the S t ~ i n c rProblein T h e o r Bzology, 107:471-374. 1984. [I681
111
Pli>logeny. J.
L.R.Foulds. G r u p h T l ~ e o r yApplications. Springer. 1994.
[169] L.R. Foulds and R.L. Graham. The Stciner Prohlem in Phylogeny is SP-co~nplete.Advnncea in, A p p l . Math.. 3:43-49, 1982. [170] L.R. Foulds. S1.D. Wendy, arlcl D. Penny. A graph theoretic approach to the development of lninilnal phylogmetic t r e ~ s (7. . Mol. E'i~ol.,13:127-149. 1979. [171] 1I.L. Fredmm and R.E. Tarjan. Fibonacci heaps and thcir uses in irnproved network opti~nizatiorialgorithrris. J o w n a l of t h e ACM. 34:596615. 1987. [I721 H.K.Gabon-. 2 . Galil. T . Spencer, and R.E. Tarjan. Efficient Algorithnis for Finding Jliriimuln Span~iingTrees in Undirected and Directed Graphs. Cornbint~tor.ict~. 6:109-122. 1986. [173] H.K. Gabox\- and R.E. Tayjan. Efficient algorithms for a family of rnatroid iritersectio~iproblems. J . o f Algorithms, 5:80-131, 1984. [I741 B. Gao. D.Z. Du, and R.L. Graliani. A tight lower 1)ourid for tlic Steiner ratio in 11iril
REFERENCES
243
[I771 M.R. Garey, R.L.Graham. arid D.S. Johnson. The coniplexitj- of corriputing Steirier llirii~rialTrecs. S I A M J . Appl. Math.. 32:835 -859. 1977. .Johnson. The rectilinear Steiiler Tree Problem is [178] 1I.R.. Garey arid D.S. NP-complete. S I A M J. Appl. Math.; 32:826-834. 1977. [I791 A1.R. Garey and D.S. .Johrison. Corriputers and Intractilility. Sail Francisco. 1979. [18O] C.F. G a d . 111 Werke. Kgl. Gesellschaft der 1Yissensrhafte11. Giittingen. 1917. [I811 C.F. G a d . Briefwx;lisel Gaul3-Schuliniacher.. In Werke Bd. X , 1 , pages 459-468. Iigl. Gesellschaft tler lT*issrrlschaftxri, Gijttingen, 1917. [I821 B. Gavish. Topological Desigri of Centralized C'ornputrr Ketworlts - Formulations and AAlgoritllins. Net,luork:s; 12:355-377. 1982. [I831 G . Georgakol~oulosarid C.H. Papadimitriou. The 1-Steiner-Problem. ,J. of Algorith3ms. 8: 122Z130. 1987.
n [184] A. Girrer.. Die yedachte Nutur.: Urspriinge der ~ n o d e r ~ i eW~issen,schcijt. ron-olilt, 1998. [185] E.N. Gillm-t. Gray codes arid paths or1 the 11-cube. Bell System Tech. J.. 37:813--826; 1958. [186] E.X. Gilbert ant1 H.O.Pollak. Steinc~r?iIinimal Trees. SIA4A/l J . Appl. Math.. 16:l--29. 1968. 11871 11. Glauhrecht. Die gnrrze Welt ist eine Prasel. Hirzel Ierlag; 2002. 11881 J . L . Gould and 15' T I
55' lI7Sortori and
R.L.Graham arid P. Hell. Problem.
[I901
B7oloyical Sciences
A7a7a.
On the History of the IIinirnuiii Spaririing Tree Hist. Comp., 7:43-37. 1983.
R.L.Graham and F . K . Hnang. A remark on Steiner hfinirnal T r ~ e s B. d l . of the III,.~~. of Moth. Ac. Sinica, 4:177 -182; 1976.
[I911 D. Graur and M7.I-T. Li. Pu~~zdarnenio,lsof Molecular Evolu,tion. Sinauer .Associates, Inc.. 1999. [I921 H. Groerner, Abschiitzungt~nfiir die A~izahlder korivescn Kiirper, die einen lioi~veseriI i i i ~ y e rbcriil-iren. Mo.riat.slrc.fte @ I , Matherrmtik. 65: 7481. 1961.
[I931 C. Grijpl, S. Hougardy, T . Yierlioff. slid H.,J. Priiinel. .4pproximat,ion .Algoritlin~sfor t,he Steiner Tree Problcni in Graphs. In S.Clieng arid D.Z. Du,editors. Steiner Trees i'tr Irrdlrstyy. pages 233-279. Kluwer A4catlemic Publisliers. 2001. [I941 .J. Gross and J. I l l e n . G m p h Theory and it,s ;Ipplic(~tions.CRC Press. 1999. [I951 11. Grijtschel arid C.L. IIontna. Integer Polyhedra arising from certain Ketn-orl; Design Problems n-it11 Connect,isitj. Constraints. S I A M ,J. D i crete Math.: 3:502-523. 1990. [196] B. Griinbaurn. On a conjecture of H. Hatlnigcr. Pacific J . ,llath.. 11:213219. 1961. [I971 11. Gnan. Graphic programrnilig using odd and even poirits. Clii'nese Math.. 1:273--277. 1962. [I981 D. Gusfield. Algoritlints on Strings. T7tee.s. ur~dSequences. Cmi1)ridge University Press, 1997. 11991 G. Gutin and A.P. I'unneii, editors. The Trclweling Sales'man Pmbleiri and its Variations. Kluwe~rAcadcmic Publishcrs. 2002. [200] I. Hacking. TYliat llathernatics Has Done t,o S o n i ~iirlcl Only Some arid Aiecrssit7y. pages Philosophers. In T. S~niley,editor. Mathen~~atics 83-13;. Oxford LTniversit!- Press. 2000. [201] H. Hadniger. c71,er Treffanzahl k i translatiorisglcicl1e11 Eichkiirpern. .4rcla. Math., S:212-213. 1957. [202] .I.\.. Haeseler mcl D. Liebrrs. Molekl~dareEuvlutio7z. S.Fiscller \.erlag. 2003. [203] S.B. Hakinii. Steiner's Problem ill Graphs and its Implications. Net4uorks. 1:113-133. 1971. [203] S.L. Hakirni and S.S. 17a11.Distance nlatris of a graph arid its realizahility. Q ~ ~ a rAppl. t. Matlt.; 22:303-317, 1964. [205] B.G. Hall. Plrylogenetic Trees Made Easy. Sinawr Associates. Sunderland, LIA, 2001.
REFERENCES
245
[207] A1. Harlan. 011Steirier's Problern nit11 rectilinear dist,ance. SIAA'I J . Appl. Math., 11:255-265, 1966. [208] D . R . Hankerson. D.G. Hoffrilan, D.aA. Leoriarcl, C.C. Lindrier. K.T. Phelps, C.A. Rodger, anti .J.R. \Tall. Coding Theory a n d Cryptogrqhy. AIarcel Dekkrr. 2000. [209] F. Harary and E.N. Palrricr. Graphical Enumerutiora. Xoadeniic Press, 1973. [210] D. Harel. Computcrs Ltd.: 'what they really can't do. Osford University Press. 2000. [211] D. H a r d Dus Afl~npuzzleu'nd uieitere bad Spi-iriger, 2002.
T ~ ~ P Warus .S
der Computerwelt.
[212] F.C. Harris. Steirler IIinirnal Trees: Their Co~nputationalPast. Present. and Future. JCMCC. 30:195--220. 1999. [213] J.-A. Hartigan. hlinimurn niutatiori fits t,o a given tree. Biometrics. 29:.53-6.5%1973. J Preplint 13/2000. Vni[214] 11.Hently. Hadanlard 1Iethods in P ~logenetics. versitat Gr eifswaltl. 2000. [215] 51. Henclj-. C.H.C. Lit,tle, arid D . Penny. Cornparing t>reeswith pendant ~.ert,icc.slabelled. SIAM J. Appl. Math.. 44:1054-1065. 1984. [%I61hI. Hendy arid D. Penny. Braiicli and bound algorithms t o determine minimal evolutionary trees. Math. Biosci.. 59:277 290. 1982. [217] 1I.D. Hendy, 2000. private communication [218] h1.D. Heritly, D. Pcnriy, and 11.A. St,eel. A discret,e Fourier analysis for evolutionary trees. P7.o~.Natl. .4cad. Sci. USA, 91:3339-3343, 1994. Tromha. The P a r s i n ~ o n ~ o uLrnzuerse. s Springer. 12191 S. Hilciebrandt aiicl -1. 1996. [220] J.11. Ho. D . T . Lee. C.H. Chang. arid C.K. Sibng. \Iinirnum diameter spanning trees ant1 related problems. SIAM d. Co~npsrting,20:987-997. 1991. [221] D. Hothhaurn Approsmnt7on Algor~fhrrtsfor .\'P-hnrd Publishmg Compariy. 1997
P ~ o b l e ~ nPITS s
[222] R.SV. Hockney a r d C . R . .Jesslmpe. Parallel Corrrputc~r~.~ 2. Adam Hilger. 1988. [223] C. Hoffinann. Graph-th,eoretic algor-itlrrns a d graph iso~mo7phis7r~. Nunber 136 in Lecture Kotes in Computer Science. Springer-Irrlag. 1982. 12241 R. Horst. P. Parclalos, and N.IT.Thoai. Introduction t o Global Optimization. Klun-er .4cademic Puhlislicrs, 1995. [225] D . F Hsu. S . D . Hu. and Y.Kajitani. On shortest k-edge connected Steiner ~ietmorksn-ith rectilinear distance. In D.Z. Du and P.hl. Pardalos. editors. Mini~rrmzcmd Applicatiom. pages 119-127. K11in-er Academic Publishers, 1995. [226] T.C. Hu and 1I.T. Shing. Comblnntorial Algorithms. Dover Puhlications. 1982. 12271
U .Huckenbeck. Eztrem,al pnth.s
i71, gmplis: Fo~u7~~%atiorzs, search stmtegies. asrd related topics. .-ll
[228] F.K. H r a n g . On Steiner AIininlal Trees with rectilinear distance. SIAl1.l ,I. Appl. ilfath.. 30:104--114. 1976. [229] F.K. Hnang. A linear time algorithm for full St>einerTrees. Ope7: Res. Letters, 4:235-237, 1986. [230] F.K. Hivang and D.S. Richards. Steiner Tree Pro1)lems. Nct.wo~,ks,2 2 : 5 - 89, 1992. [231] F.K. Hn-arig, D.S. Richards, and P. V7int,er. T h e S t e i n e r Tree Problem. North-Holland, 1992. [232] F . K . Hwang and Y.C. Yao. Comrnents on Bern's Probabilistic Results on Re~t~ilinear Steiner Trees. Algo,vithrr~ica.5:391-398. 1990. i an eirier Singularitiit tles [233] A.Illgcn. Das J7erlialtcn v o ~ Abstiegs~.c-.rfalmx Gradienten. optimization. 10:39- 35. 1979. [234] C. Isenberg Ilinirrium-S17egc-.-Struktlneri. d p h u , 20:121- 123. 1986.
Branching Solutio~r~sof O n e [235] A. 0.Ivariov and *A. A. Ttlzhiliri. Dimerrsional L'ariatio,rrnl Problenas. Slkrlti Pu1)lisher Press, 2000. 12361 A . 0. Ivano~.and -4. -4. Tuzhilin. Diff(.rential calcultis on the space of Steirler minirrlal trees in Rimiaiiiiiari irianifolds. Sbornik: Afathern,atics~ 192:823--841. 2001.
REFERENCES
247
[237] X.O. Iranov, 1.1'. Ptitsyna. and A l . A iTuzhilin. . Classificat,iou of Closed bIininia1 Ketn-orks on Flat Tn-o-Dimensional Tori. R ~ ~ s s i uAcad. n Sci. Sb. Math., 77:391--423. 1994. [235] 14.0. I ~ a n o vand -4.;1. Tuzhilin. n/lini7nnl Nct'u~orks- The Steinw Protilesr~ CRC Press, Boca Raton. 1994. and Its Ge7re1~alizc~tions. [239] A.0. 1vdno1- and 66:251-317. 2001.
A T u ~ h i l i n .Ext1~1neS e t n or lcs Actn Appl Math .
o ~ .\..A. 'Tuzhilin. Thcory of estrenzal ~ietnorl<s(russian). [240] A.O. I ~ a n and Preprint, \loscow State Tlriiversity. 2003. [241] T. Jansoii. Speuk-A Slaort History of La~rlgzril,ges.Oxford University Press, 2002. [242] h1.R. Jerrunr. Count,ing t,rees in Letters, 51:lll-116; 1994.
;I
graph is #p-complete.
Inf. Proc.
[243] T . Jiang and L. ITang. Computing Shortest Net~vorksn-ith Fixcd Topologies. In D.Z. Du, J.11. Smith. a i d J.H. Rubinstein. editors. Ailuances in Steincr Tkes. pages 39-62. Kluner ;icadernic Publishers. 2000. [244] D. .Johanson. L. .Joharison. and B. Edgar. A7~cestors:In Search of Hunmn Origins. 1-illard Books, 1994. [245] 11. Jiinger. G. Reinelt, and G. Rinaldi. The Traveling Salesrrrarl Problem. In bI. De1l'Arriic.o. F. hlaffioli. and S. Alartello, editors, Annotated Bibliogri~phiesin Cornbi~~l,ntoriulOptin~lzation,pages 199-221. John I'l'iley and Sons. 1997. [246] D. .lungnicltel. Grccpher~,Netmerke unil Algorithmen. BI '\17issenschaftsverlag. Llarinheirii, 1994. [Xi] A.B. Kahng arid G. Robins. A lien- class of iterative. Steiner tree heuristics with good perfor~riance.IEEE Trims. Cornp. Aided Design. 11:893--902. 1992.
[238] 11. Kanehisa. Post-genome Informatics. Osford University Press. 2000. [249] S . Kapoor and H. Ramrsh. .llgoritlms for Enurirerating -411 Spanning Trees of undirectctl arid weighted Graphs. SIAM b. Cornp., 24:237 -265; 199.5. [250] 0. Kariv arid S.L Haltimi. An .llgorithrnic -Approach to Netx-ork Location Problenrs: Tlie 11-l\Ietlians. SIAM ,I. Appl. Math.. 37:339%560,1979.
[251] R.N. Karp. R,educibility arnorig cornhinat,orial probleins. In R.E. lliller arid J.TT7. Thatcher. editors: Complexity of Computer Cornpututions, pages 8 5 1 0 3 . Yen- h r k . 1972. [252] R. 51. Karp. Probabilistic analysis of ~ ~ a r t ~ i t i o n algorithms ing for the travRes., %:209-224, 1977. elling salesman problem. Math. Ope7~utio~rs.s [253] J . Katajainen. Blacketing and Filtering i7a Cornpututlonal Geonaetrry. 13hD thesis. Eniwrsity of Turku. Drpartenient of Computer Science. 1987. Acatlernic Dissertatiori. [254] J . Kata.jaine11. Tlit. Region -1pproach for Coiriputilig Relative Nciglihourhood Graphs iri the L,, metric. Computing. 10:147-161. 1988. [255] A.K. Kelrnans. On properties of the characteristic polyriornial of a graph. L Kommuniz7nu. Gosenergoizdat. ~Iosco\v,1967. Iri Kibernetiku - I Y ~Sluzbu Eihoury and P.hI. Partlalos. [256] B.X. exact Brarlcli and Bound Algorithm for t,he St,einrr Problcrii in Graphs. Y n ~ n b e r939 in Lecturr Not,es in Computer Science, pages 5 8 2 5 9 0 . Springer-TTerlag,1993. [257] B.N. Khoury; P.11. Parclalos. arid D.TT7. Hearii. Equivalelit for~riulatio~is for tlw Stciner Problem in graphs. In D.Z. Du and P.SI.Partlalos. edProble,rr~s,pages 111-123. TYorld Scientific itors. Network Oytimizatio,~~ Publishing Co.. 1993. [258] S. Khullcr. Approsimatioii algorithms for firidilig liiglily corinectetl suhgraphs. In D.S. Hochbaurn, editor. Appro:ci!rnntion Algorith~n.~ for ,I*?l~arclProblems, pages 236-265. PTTT Pul~lisliirigConipany. 1997. [259] S. Kliuller. B. Ragha-achari, and N.Yonrig. L o n degree sparirling trees of small m i g h t . SIAM J. Comp~rtiny.25:355-368: 1996. [260] 11.Iiirlimel and D.E. Axelrod. Branrhzrq Processes i n Biology. Springer. 2002.
12621 TV. Kolirieri. Metrisehe Rct'ume. Acadcn~iaTcrlag Richarz, Saillit -4ugustin. 1988. [263] T.\T7. I
REFERENCES
2-29
[265] B. Korte, H.J. Prolnel; and -4. Steger. Steiner Trees in 17LSI-la>-out.In Paths. F l o w and VLSI-Layout. Springer, 1989. [266] B. Kortc and J .
\*J
gen Corr~bmafor~al Optfrn~zatmnSpringer. 2000
[267] L. I
[280] D.H. Lee. Lon. Cost Drainage Netn-orlts. Net'works. G:351-371, 1976 [281] I(. Leichtm-eiss. Konueze Afer~gen.Dent,scher ITerlagder 1Yisse1-ischaften. Berlin. 1980. 12821 T. Lcngauer. Combiraatorial Algorithms for Integrated Circuit Layout. Teuhner-John M'ilej- & Sons, 1990.
g insertions [284] V.I. Levenshtei~i.Binary codes capable of ~ o r r e c t ~ i ndeletions, arid reversals. Souiet Phys. Dokl., 10:707--710.1966. [285] D.TY. Lltnhiler and X A. ;llv Steirlcr s Problem arid Fagnmo's result on the spheie. Math Proy . 18.286-290, 1980.
[ B G ] Z.C. Liu and D.Z. Du. On S t e i n ~ rIIinirnal Trees n-ith L, Dist,aiice. Algorithrnlca. 7:179-192. 1902. [287] B. Lornborg. The sceptical e7rvironrr~e7rtalist.Cambridge Vniversity Press. 200%. [288] 11. Lothaire. 1997.
Combi7~utoricson TVorrls. Carlibridge University Press,
[289] L. Lovtisz. J. Pelikin. and I<. fistergomhi. Kombinutorik. Teul~lierlTerlagsgesellschaft. 1977. [290] L. LovBsz. J. Peliltan. m d K. \i.stergornbi. Springer. 2003.
Dlscretr iklafhtrnutzcs
[291] R.F. Love arid J . G . llorris. LIod~llinginter-city road distarices 1)y mathenlatical functions. J. @er. Res. Soc.. 23:61-71, 1972. [292] R.F. Love; J.G. IIorris, arid G.O. TYesolowsky. Facilities Location els and Metlmds. Nort>h-Hollalid,1989.
-
Mod-
[293] L LIargulls Syrnbzotlc Plunet TTrlclenfeld and Nlcolioil/Ollon Puhllihing. 1998
. 11-trees Bullet~no j Math [295] T LIarguih and F R I l c A I o ~ ~ i iCorirenius Bzology. 43 239-244. 1981. [296] G.E. Martin. Springer. 2001.
Counting:
The Art of E7rwner.ati1~Conrbinntoric,~.
REFERENCES
251
12971 H llartinl. K . J . S n m ~ p o e l and . G VFia Tlie G e o r n e t r ~of lIlnlion.slc~ Spaccs - Survey E x p o s h o n e s Mathenzatacuc~.19.97-112. 2001
e Springer. 2002. [298] J . AlatouSek and J . SeSet?il. D i s k ~ ~ e tMathematik. [299] J. LIaynard Smith and E. SzathrnSry. T h e nmjo~.t m n s i t i o n s i n evolation. \S'.H.Freernan. 1995. [300] J AIaynald S n u t l ~and E Szat1lnl;il~ Euo17~t~onSpelitium. 1996. [301] E. h I a r . W l ~ m Et ~ ~ h ~ t 2,s. i o Basic n Boolis. New York: 2001 [302] F.R.. llcllorris and R.C. Powers. Thc lledian Function on \ITeal
011the
piohkni of S t e l n ~ r Canad illnth Bull . 4 143- 148.
uncl "Large Region Loca13061 D. hlelzer. S-kon~.eseOptiri~ier111lgsa~1fgal1e1i t,ion". W i s s . Zeitsclrrift der Humboldt Ilrzi~uersitutBerlin; Mat11,ernc~tisch hTatu1.wi,sse7?,scI1,aftli(:I~,e R e h e : XXX:387-~391.1981. [307] A. hligdalas. Mathemutical progra,m7rrirrg teclrniqlres f m anulysis and design of corn~rr~unicat,ionund irun.sportution ,netruorl;,s. PhD thesis, Linkoping, 1988. Liiikiiping studies in Science a ~ i dTechriology. [308] Z Xlillei and A1 Pcllwl Thc Stclricl Pioljlcrn w o r X s . 22 1-10. 1992
IIJ
the HTpercube
Net-
. [309] G. hlink. Editing and G~liealogicalStudies: the Nmn. T ~ s t a m e n tLiterary and Linguistic Computing, 15:51--56, 2000. [310] H. hIinkonslti. Leipzig, 1910.
Geometric der Zalslen.
Teuhner \'erlagsgesellschaftiaft,
13111 R.H. AIiihring and D. ITagner. Cornbi~iatorialTopics in I7LSI Design. 111 11. D e l l ' A h i c o ,F. llaffioli. and S.LIartello. editors, iln,raotnted BIblio,graph,ies i n Con~hlnatorinl Optimization. gages 429-444. .John \T7ilej. aricl Sons. 1997.
13121 1V.X. Llolodschi. Sturlies t o philosophical p7,oblerns of ,rr~athesnatic.s(Russian). l\loscon, 1969. [313] F. Alorgan. I\liriimal surfaces, crj-stals. sliortest networks, and undergraduates research. ilfnth. Intelligen,ct:r. 14:37--44, 1992. [314] F. l\Iorgan. T h e M a t h C h a t Book. Tlic l\Iathcmatical -Association of America. 2000.
Tile [315] F. AIorgan, C. French. and S. Greenlcaf. 1Vullfs Clusters in R'). ,Jo~rrnalof G e o m e t r i c A ~ ~ u l y s i8:97--115. s, 1998. [316] P Llorrison and P. Slorrison. P o v e ' r s of T m . Scientific . h e r i c a n Books, 1982. [317] D.1V. AIount. Rioinfor-snutics. Cold Spring Harbor L a l m r a t o q Press. 2001.
ge~~ 1996 [318] G . Xiigler and F Stopp. G i o p h m u n d r l n u ~ e r ~ d u r ~Teubner. [319] S. Ncetilemari arid C. 1Tlunsch. A general method applicable to the search for similarities in t,lle amino acid sequence of tivo proteiris. J. hfole~%11(17~ Biology, 4L3-143-433. 1970. Jcna, 1986 [320] 0. Keunlann. Rernerliungen 7um Stein~r-V7eber-Problc~~n. [321] B.K. Nielsen. P.11Ynter. and 11. Zacliariaseri. A11 Exact Xlgorithin for t,he Uniformly-Oriellted Steiner Tree Problem. In Proceedings of t h e 10th E ~ ~ r o p e nSny n ~ p o s i ' i ~ ronn A1,goriMrrns. nurnber 2461 in Lecture Notes ill Computer Science, pages 760-772. Springer-I'erlag, 2002. [322] It. Noyak, J . Rllgtlj. arid G. Kandus. Steiner Tree Bascd Distributed Ilulticast Routing. In 1.Clie~igand D.Z. Du, editors. S t e i n e r Trees in I n d u s t r y . pages 327-352. Kl~nverAcademic Publisliers, 2001. [323] >I.*A. Kowak? D.C. Krakaner, and A . Dress. .A11 error limit for the oyolntion of language. I'roc. R . Soc. L o n d . , 266:2131-2136. 1999. e r unit splieres [324] A. Ocllyzko and N.,J.A4.Sloane. S e x bounds o:i tlw ~ i u ~ i l h of that can touch a unit sphere in 11 di~nensions.J. C o m b . Tlreory, -A,26:210214. 1979. [32.5] -1. Okabe, B. Boot,s. and I<. Sugillara. Spatial Tessellations - Co'nmpLs and . ~ p p k a t i o n . v of I,'o~onoi D l a g m n t s . John U-iley k Sons, 1992. [326] C.D. Olds. .A. Lax. and G . Dayidoff. Tlzc Geornetyy of Arumhers. The i\lathernatical .Association of America, 2000.
REFERENCES
2:,3
13271 S. Olson. M(~pyisrgHuman History - Disco!uer.in,g the Past T h m ~ g h0711.. Genes. Hought,on LIiffliri Compariy. 2002. 13281 .I. Oprea. The Matl~emo,tics of Soap Films: Explomtiorss with Mu,ple. A h ~ r i c a rLIathematical i Society, Providence. 2000. [329] R . Otter. The Nuriiher of TIPPS. An. Math.. 49:583-399. 1948 [330] T . Ottnlann anti P.V7idlilayer. Alyorithrrren wrd Dutenstr,u,ktur.eri. Bibliograpliisches Iristitut (BI). Ilanrilieirn. 1990. [331] R.D.AI. Page and E.C. Holines. Molecdai Evolutio'n,: A Ph,ylogc.netic Approach. Blacltn-ell Science. 1998. [332] F.P. Palerrno. -4 netnolk ~niriirnizationprohl~rn. IBM ,I. Res. Deu.. ;1:335-337. 1966. [333] C.H. Papadimitriou and I<. Steiglitz. Prentice-Hall, 1982.
Combi~~c~torial Optirrrizntion.
[334] C.H. Papadimitriou ar~tlU.\'.\;lzirani. On Two Geonictric Problems Related to the Traveling Salesman Prohlern. J. of A1gorithnr.s. 3:231246, 1984. [333] P.hI. Pardalos and D.Z. DLI,editors. Netwo7.k Design: Co'nnectBvity a7ld Facilities Location. volume 40 of DIMACS Series in Discrete Matirematics and Tllmretical Computer Science. Aniericau IIathen~at~ical Society. 1998. [336] E. Pcnniii. Ilodt3lnizi1ig the Tree of Llfp. S ~ ~ e n c300:1692-1697. e. 2003. [337] D . Periny. Criteria for optimizirig pliylogcnctic trees and tlie p r o b l ~ mof tion, 1976. deterrnirling the root of a tree. J. Molecrdal- E ~ ~ o l ~ ~ 8:75-1163, [338] D.Pcnny. 2001. p r i m t ~communication. [339] D. Penny, L.R. Foulds, and hl.D. Hencly. Testing thc theor!. of evolut,ion by comparing phylogenet,ic t r w s ~ o n s t ~ r u c t efrom d five different protein sequences. Nature: 297:197-200. 1982. [340] D. Penny and ;\I. Hcridy. Testing r r ~ t ~ h o dofs ~volutionarytree coristruction. Cladistirs. 1:2GG-272. 1983. [341] D. Penny ant1 h1. Hendj-. Phylogenetics: Parsimony and Distance Iletliotls. In D.J.Balding et al., editor. Handbook of Sfatistical genetic.^, pages . John TT7iley Sr SOLS.Lt,d., 2000.
13421 D. Pmnj.. A1.D. Heady. and -4. Poole. Testing funtlaineiital evolutionary hypotheses. J . Theor. Biology. 223:377-385. 2003. [343] G. Pesole, E. Sbisci. G. Preparata, and C.Saccone. The Ewlutiori of t h ~ 5Iit~ochonclrialD-Loop Region arid the Origin of l\lodern Alan. Mol. Biol. Evol.. 9:587-598, 1992. [344] 11 Pigliucci. Denyzng Euolictzon Siriawr Aisociates. Inc . 2002. (3451 F. Plastria. Continuous Location Problems. In Z.Drezner. editor, Fucility Locution, pages 225262. Springer, 1995. [346] F. Plastria. Continuous Co~.eringLocation Prohlenis. In Z. Drezrier and H.TY. Harnacher. cdit,ors, Facility Loccrtion. pages 37779. Springer. 2002. [347] a4T-. Pogorelo~. Hdbf>rt'sFourth P r o b l ~ r r ~John \T711e\ m d Sons, S e n l h r k . 1979. [348] H.O. Pollalc Somc reinarks on the S t e i n ~ rProblem. J. C o m b i n . Theory, Ser. A,24:278-295. 1978. [349] G. Pol?-a. K o m l . ) i n a t o r i s l .411zal1lbestinimungen fiir Gruppen, Graplirxi und cliernische '17erbil~durigen.A c t a Math., 68:143-254, 1937. [350] K.R. Popper. Lrn,ended @lest: An h t e l l e c t z ~ dAz~tobioyruphy. Fontana, 1976. [351] F.P. Preparat,a and AI.1. Sharnos. Computational Geonietry. Springer. 1988. [332] R.C. Prim. Sliortcst colrirriunic.at~ionnetn-orlis and some generalizations. Bell Slyst. Techn. .I.. 31:1398-+1401.1957. 13531 E. Prisner. Distance Approximating Spmiiing Trees. In S T A C S 9% riurnber 1200 in Lecture Kotes in Conlput,er Science, pages -199-310. SpringerVerlag; 1997. [354] H.J.Prijrnel. Graphentheorie. In G.TTyalz, editor, Faszination Adathem u t i k , pages 1 8 4 1 9 4 . spectrum. 2003. [355] H.J. Prornel and *I. Stegcr. T h e Stezner Tree Problem. \Yeneg. 2002. [356] J.S. Provan. Convexity alid the Steiner Trcr Pro1)lenl. Networks: 18:5572, 1988. 13.571 H. Priifer. Ein nellrr Beweis eines Satzes iiber Permutationen. Arcla. Math. Plays.: 27:742-744. 1918.
REFERENCES
'335
[358] G. PSuri, G . Rozenberg; and -4. Salomaa. DNA co~nputing. Springer. 1998. [339] B. Raghavachari. A41goritlmsfor finding low degree structures. In D.S. Hochbaum. cditoi-: Approzimation Alyorithrns fo,r ,\'P-f2ar.d problem,^. pages 266-293. PTT-S Publislling Compa~ir;.1997. [360] P. Rechenherg. Was ist Informatik. Carl Hanser Terlag; 1994. l\'etn-ork Connectivity. In [361] S. Rhagaran and T.L. IIagnanti. 11. Dcll'AAmico. F. hIaffioli, and S. Slartello. editors. Annotated Bibliogmphies in Combinntorial Optinaimtion. pages 335-354. John JYiley and Sons. 1997. [362] G. Robins and J.S. Salon-e. Lon--Degree Alinimurn Spaririing Trces. Discwte Cornput. Geometry; 14:151-165, 1995. [363] D . Robirlsori and L.R. Foulds. Comparison of plq-logenetic trees. Math. Biosci.. 33:131-1-17. 1981. [364] S. Rolmicz. Metric Linear Spcices. PJVN-Polish Scientific Pul~lislicrs. JJ7arszan.a. 1972. [365] C. Roos arid T . Tcrlaky. Xdrarices in Linear Optimization. In h1. Dell'Amico, F . IIaffioli, antl S. llartello, editors. Annotated Bibliographies irl Co,rr~binutorinlOptimization. pages 93-114. John JYiley arid Sons. 1997. [366] P.E. Ross. Streit urn \Yorter. In B. Strcit. edit,or. Evolution (ks Me'nsciaen. pages 12G135. Spclctrum Akademiscller iTerlag.1993. 13671 B. Rothfarb. Optirnal Design of Offsllore Natural-Gas Pipeline Syst,erns. Oper. Res., 18:992-1020. 1970. [368] J.H. Rubirlsteirl arid D.X. T h o ~ n a s .The Steiner Ratio co11,jecture for six points. J . Combisn. Theory. Ser. ;1.58:34-77. 1991. [369] J.H. Rubinstein. D.A. Thomas, and .J.F. TYeng. Degree-Five Steiner Point,s Cannot Reduce Network Costs for Planar Sets. Netwo~%s.22:531537. 1992. 13701 J.H. Rubinstein and J.F. JYcng. Compression theorems antl Steirler Ratios on Splicres. J . of Comhi~iato~.ial Optisnizution. 1:67--78. 1997. [371] B. Russell. A Ili.sto~-gof Western Philosophy. George A411m& Vnnin, 19-15.
[372] B. Russell. Th,e Problems of Philosophy. The Honw University Library. 1946. [373] B. Russell. Plailosophie des Ahenrllanrles. Europa I'erlag, 1930. [374] H. Sachs. Einige Gedanken zur Geschichte uncl zur Entwicltlung cler d e r h~c~tlre~rr~atisclrerl Gesellschc~ftisi HarrrGrapherlt,heorie. ~~fitteilunge~r, bu7g. 11323 -641.1989. [375] S. Sahni and T. Gonzalez. P-coinplete approximation problems. ,I. of ACM, 23:555-565. 1979. [376] J S Salon-P 011Eutlidean Spanner Graphs with Snlall D e g ~ e eIn 8th AnGeometry. pages 186-191. Berlin. 1992 nual Sympos~umCorr~putuf~ond [377] J.S. Salone and D.31. IYarme. Thirty-firc-point rectilinear Steiner 11inima1 Trees in a day. Networ,ks. 25:69-87, 1995. [378] D. Sanltoff. Ilinimal Ll~itationTrees of Scyuenccs. SIAM J. Appl. Math.. 28:33-42. 1975. [379] D. Sarlltoff and P. Rousseau. Locating the vertices of a Steirler Tree in a n arbitrary metric space. Math. Progr., 9:240-246. 1975. [380] Ir.3I. Saricll and .\.C. ITilson. Irnrnunological tirnc scale for lloinir~oid evolution. Scierace, 158:1200-1203. 1967. [381] J . J . Schaffer. Geonretry of Sph,eres in jioriraed Spaces. IIarcel Dekker. 1976.
[383] K.-H. Schleifer m t l 31. Horn. l\Iikrohiellc 17iclfalt - die unsichthare Biodirersitat. Biologie heute. 6:l-3, 2000. [384] A . Schobel. Locati~rqLisnes urd H,~perplunes.Klu~verAcademic Publishers: 1999. [385] C . Sch6ning and R,.Pruirn. Springer, 1998.
Ge'riis of Tl~eoretical Computer Science.
REFERENCES
257
[387] F . Schrenk. T.G. Bromage, aritl H. Kaess~riann.Die Friihzeit tlcs IIenschen. In W o h l n die Reise geht, pages 94-101. I'erband Deutscher Biologen: 2002.
[389] H. Schupp. Optzmieren. Bibliograpliisches Iiistitut (El). lIannheim, 1992. [390] B Sdinlkoasbi arid 11. ITirigron Tlir Defericd P a t h Heuristic for the G m e r &/ed Tree . Q y ~ n i e n t Piohlern T of C o r n y ~ ~ t u t ~ o nBaology, ul 4 315-Xi1, 1997 [391] C . J . Scriba and P. Schreiber. 5000 Jnhre Geo'rnetrie. Springer. 2000. [392] P. Sellers. O n the theorj. a r ~ dcon~putationof evolutionary distances. S I A A I ,I. Appl. hfutlz.. 26:787--793, 1974. [393] C. Sernple and 11. Sterl. Phyloyenetics. Oxford Vniversity Press. 2003. 13941 J . Setubal arid J . IIeidar~is. Introduction t o Comp~ututiorraln/lolecdur Biology. PSYS Publishing Company. 1997. [395] 1I.L. Shore. Tlir Steirier Problem rn Gr,lphs and i t i .lpplicdtlon to Phj logen!. Master's thesis. lIasse\- U n i ~ r c r s i t ~1979 . [396] i\I L Shorc. L.R Foulds. and P B Gi1)lmns. algorrthm for the Stciner problem in graphs ~ V e t w o ~ k12.323 s. 333. 1982 [397] D . Skorin-Kapov. O n Cost .Uocation in Steirier Tree Networks. In Y.Cheng aritl D.Z. D L editors, S t e i ~ r e rTrees in Industry. pages 353-376. Klu\~-erXcadeniic Publisliers~2001. [398] J.11. Sniit,h. Generalized Steiner iietn-ork problenls in engineering design. In Design optimization. pages 119-161, 1985. [399] J.11. Srnith. St,eiwr hIinirna1 Trees in E:': Theory, Algoritl~ms,arid Applications. In D.Z. Du ant1 P.hl. Pardalos, editors, Handbook of C o ~ m b i m t o rial Optimizatio7~,volume 2. pages 397-470. Klun-er Acatlrrnic Pul~lishcrs. 1998. [400] J.AI. Smith, D.T. Lec, arid .T.S. Liebman. .An o ( n log 11) Heuristic for S t e i r m lliiiirrial Tree Prohlem on t,lic: Euclidean Iletric.. Xeti~lor,ks.11:23-~ 39: 1981. [401] .J.hI. Srriit,h and J . S . Liebinan. Steiiier Trees. Steirier Circuits and the Interference Problern in Building Design. Eng. Opt.. 4:13-36. 1979.
[.to21 T . F . Sriiith. 11,s. M7aterrnan, and TT7.11.Fitcli. Comparative Biosequence XIetrics. .T. Molec?~larEscolution. 18138-46, 1981. [403] TY.D. Smit,h. Won- to find Steiner 1Iiilinial Trem in Euclitlcari d-Space. Al,gorithmicn, 7:13;--lB7 1992. [A041 TTT.D.Smith and P.W. Shor Steirler T r w Pioblerns Algor zthrnrca. 7 329332. 1992. [405] 1V.D. Sriiith and J.31. Smith. O n t,he S t ~ i n e rRatio in 3-spacc. J . of Combinutoric~lTheory. A. 65:301-332, 1995. [406] 11. Steel. T h e cornplesity of recorist~ructiligtrees from qualitative characters and subtrees. Jourrid of Cln.ssij?cation,9:91--116, 1992.
[407]XI. Steel, Al.\Y.ll.Dress: and S. Biicker. Simple but Fundamental Lirnitutions on Supertree and Consensus Tree h1et~hotls. Syst. Blolo,qy. 49:363368. 2000. [408] I. Stewart. N a t ~ m ' sNumbers. Baiic Boolts. 1993. [409] I. Sten-art. Gnlols Theory. Chapman arid Hall. 1998. [-I101 I. Stcn-art D2e Zahlen dei Nntur. Spclttrurn. 2001. [ A l I ] J . St~illwll.Geo7netl.y of S'urfclces. Springer, 1992.
[A131
K. Strirmrlcr and -4. van Haeseler. Quartet Puzzling: X Quartet llasiniurn-Likelihood Aletl-lot1 for Reconstrnctilig Tree Topologies. Mol. Biol. Euol.. 13:96J-969. 1996.
[414] I<. S w a n e p o ~ l . Gaps in Convex Disc Pacltirigs with Application to 1Steiiier 1\Iiriimum Trees. Mormtsllefte fiir Mutl~emutik.1999. s of Steiner 3Iiriirrial Trees in L$ and [415] K . Sn-anepoel. T y ~ r t c Degrees other Srnootli Alinlion-ski Spaccs. Discrete und Comyutatio7u~,l Geonre~ T J ,21:437-447. 1999. [416] K . J . Swanepoel. The local Steiner prohlern iri norrned planes. Networks, 36:104--113,2000. [dl71 D.L. S~vofford. PAUP*: Phylo,geraetic Annlysis D7s,irlg Pnrsimo~ny and Other Methods (sojhnrej. Sinalier ;Issociates. Sunderlarid. h1;l: 2000.
[418] D.L. Sx~oflordatid G..J. Olsen. Phylogeny Reco~lst~ruction. In D.;\I. Hills and C. XIoritz. editors, Molecular Systematics, pages 411-501. Sinauer Associates. 1990. [dl91 B. Sykes. The Seven Daughters of Eve. Bantam Press. 2001. [420] J. Sylvester. A question in the geonletr!. of situation. Quarterly Jo~umal of Mathematics. 1:79. 1857. [421] G.G. Szpiro. Kepler's Cor~,jectule..John IT'i1r.y and Sons, 2003. 14221 P. Tannenbauni and R. Arnold. Ezcvrslons in Modern Matlrenaatics. Prelitice Hall. 2001. 14231 R.E. Tarjan. Data Str.vctz~resand 3etviork A1gorzth~n.c.SIAII. Philadelphia, 1983. [424] R.E. Tarjan. Efficient Algorithms for Netn-ork Optimization. In P'roceedings of the Intemmtio7~al Congress of Mo,thematicinr~,,s;pages 1619-1633. Mhrszan-a. 1983. 1-1251 D. Thomas and .J.F. TTeng. Polynomial Tirile Algoritlms for the Rect,ilinear St,einer Tree Problem. In S.Cheng and D.Z. Du, cdit,ors, Steiner Trees isr?,Indu.stry. pages 403-426. Kluver -Academic Puhlisliers, 2001. [426] A4.C.Thompson, 1995.
filinkoviski Geometryj. Cambridge University Press,
[427] D. Trietsch. Interconnecting Networks in the Plane. Net'works, 20:93-108, 1990. [A281 LI. Tiirltey, K . Bachn~ann.R. I h z e l h a c h . and E. Stacliebrantlt,. Biodi~ e r s i t ~ 5- tdie 1-ielfalt in der n-ir lelxn. In Woltbn die Relse geht. pages 72--83. 17erbantlDeutscher Biologen. 2002. [429] S. Ueno; Y. Kajitani. ant1 H. Tlkda. hIinimurn .4ugmentat,ioti of a Tree t o a I<-Edge-Connected Graph. Netuiorks. 18:19- 25. 1088. [430] A. Cndelnood. A \lodihed Aklzdk Pro( c d u ~ cfor Cornputir~g SodeITeighted Steinel Tieei Arrtulorks. 27 73-79. 1996 [431] F.A. Iklentine. Conues Sets. LIcGran-Hill. 1964. [d32]
L.G.ITaliant. The complesity of
computing the permanent. Report CSR
14-77. Uriiv~rsit\of Edinburgh. 1977. [433] G . 17alirnte. Al,go~.ithmson Trees m d Graphs. Springer, 2002
[434]
K.R.Ihradarajan. -4 dixvide-and-conqller algoritlini for rriin-cost perfect ri~atrliingin the plane. In P~~oceeilinys of the 39th A!ri,nual IEEE Syrnposiwm on F01~n(tatio7~.s of Computer Sc~ences.pages 320-329: 1998.
Itizirani. Approrzrnutzon ,4lgor1th71~Springer. 2001 [433] \'.I-. [43G] 31. I7ingron. Algoritllrris for the cornparison arid reconstruction of evolutionary rclatio~lsliipsb ~ t w e e nDK-4 and genoniic sequences. Interriet note. 1999. [437] 11. l'ingron. Sequence Alignment and Phylogeny Coilstruction. DIAJA CS Series in Discrete Mntlaeinatics nnrl Theoretical Co'rnp~terScie7i,ce. 4 7 5 3 63. 1999. [438] 11 \ mglon. H -P Lcnhof. h i d P 1 I u t z ~ l Cornputc\tional l\Ioletular BIo l o g ~ In 11 De1l'Amic.o. F 1laffiol1, anti S Ildrtello. edltoli, Annotated Brbl~oyri~phres r7i Combznntoiial O p t m m t z o n . page\ 445-471 John 17-iley and Sorii. 1997 [439] S. \hR Stemei-Probleine
171
Gmphen. ;lnton Hain. Frankfurt a.11 . 1990
[440] J . A . IT'ald and C.J. Colhourn. Steirler Trees. Partial 2-Trees, and AIinimum IF1 Netn-orlts. Networks. 13:159-~167,1983. [Ul] H J n l t l l e r Anluendunyen dei G ~ n p h c n t h e o i ~Deutscher . T'erlag dm Tl'lrsenscliaften. Berlm. 1919
[442] L. Wang arid T. .Jiang. On the coniplcsitj. of ~riultiplesequence alignment. J. of Cornputntionul Biology. 1:337-348, 1994. [443] P.D. VBrd and D . Biownlee. Rarc~Earth. Springer. 2000 [444] D.11. ISTarnw. Spunning Trees in Hyperginphs with Applications to Striner T'rres. P h D thesis, C n i ~ e r s i t yof I7irginia, 1998. [445] T . \T'arnon.. D. Ringe, and A. Taylor. A cliaractcr bascd rnethod for reconstruct,irig evolutionary history for natural languages. Tech report. Institute for Rcscarch in Cognitive Science. 1993. [A461 A1.S. \ITater~naii. S t q u ~ n c eAligririierits. 111h1.S. \I7aterman>editor. Afathenimtical A4etlsods fmr DNA-Seq7~encirag:pages 53-92. CRC Press, 1989.
[447]11.S.ITater~nan.A4pplicationsof Co~nbi~latorics t,o lfolecular Biology. In R.L. Graham, bI. Grotschel. and L. L o ~ i s z editors. . Handbook of Corrhisnatorics, pages 1983-2001. Elsevier Science B.1-.. 1993.
REFERENCES
261
[A481 h1.S. TYaterriian. I~~troduction to Cornp~~tc~tional Biolog?~.Cliapmar~& Heil. 1995. [a391 -4. TYeber. V e b e ~clen Standort der Inclustrieri. Tiibingen, 1909.
. DU-hIont, 2000. [-I501 T P. \Yebe1 Daru~rnnnd d ~ c Arist~ffer. [45l] D . TYelsh. Approxiniate Counting. Lectur,e Note Series of the, Lo,ndon Math. Society. 241:287 324, 1997. [452] J.F. Vkng. Steiner Polj-gons in the Steiner Problem. Geometrine Dedicats, 52:119--127. 1994. [453] J . F . IT7cng. A r1c.n. liiotlel of generalized Steiner Trees and 3-coordi~late Systems. DIMACS Series in Discrete Mathe~rnaticsand Th,eoretical Co,rr~putel Srience. 40:413-424. 1998.
[455] G.O. TYesolowsky. The TVebrr Problem: History and Perspectives. Location Science: 1:5-23, 1993. [456] F..J. Ketuclinon-ski. Graplieri u r d Yet,zc. In S.lT7. Jabloriski and O.B. Lupanon-, editors, Diskrete Mr~therriatik land mnthemati~ch~e Fmgen der ~ e r l a g 1980. Kyhernetik. pages 145-197. A ~ l ~ a c l e r r ~ i e - ~Berlin. [-I571 K. TYhite, AI. Farber. and MT. Pullcj~blanlt. Strinrr Trees, Corinected Doniinatio~iand Strongly Cliordal Graphs. 1Vc:twork:.s,15:109-124, 1983. [458] .J. ST-h~tfield.Born in a water3 cornriiune. Nature, 427474-676, 20114. [439] P. TTYrnayer. Fast Approrirrantion Algorithms for Steiner's Problem In Graphs. PhD thesis. Unix-ersitat I<arlsrulie, 1987. Habilitatiorisschrift.
Cann. A4frilta~liscl~~r Ursprung cles inoderneri hIeri[461] A.C. TYilson and R.L. s d i m . In B. Streit,, editor, Euolutio~rrties Mrnschen; pages 86-93. Spektrum AIl~ltde~nischer ITerlag,1995. [462] P, Il-inter. An Xlgorithni for the Steiner Prohlrrn in the Euclidean Plane. Netviorks. 15:323-345. 1985. al~zcdSt~i11c.rProhle~nin Ser ies-Par allel S e t n orlts J [463] P TTTinter. G c n c ~ of Algor~th~ns, 7 549 566. 1986
[464] P. V'lnter Stcmer Problems ln Netnorks 167. 1987
A \
Su~r-ey Netluorks, 17 129-
[465] B.Y. \T'u and K.-;\I. Chao. Spanni7isg Trees and Optimization Problems. Chapman a l ~ dHall, 2004. [466] X D u . X H u . and S Jla. On Sho1 test k-edge-connected Steiner Netnorks of C o m b ~ r ~ a t o r(I)Ptmz~atror/. ~d 4.99-107. 2000. in AI~tlicS ~ ~ C Jo1~17ial PS 14671 G. S u e and C. \Tang. The Euclideari facilities location prohlem. In J . Sun D.Z. Du. rditor. At1z~ance.sin Optimization a7d Appromimation, pagcs 313-331. Khm-cr .\cadcniic Publisllcrs. 1994.
~S 3 Problem,s and Their, Soluer,~. [468] B.H. Yandell. The H O ' I L OCl(~s~-Hilbe7.t .A K Petcm. Natick, i\lassachusetts. 2002. [469] A.C. Yao. An o(lel log log 11:l) algoritlzn~for finding minirmrn spanning trees. hfornr. Process. Lett.. 4:21-23. 1975. 14701 H. Yockey. Infor,mation Theor!~marl Molecula7. Bioloqy. Cambridge Cniwrsity Press, 1992. [A711 11. Zacliariasen. The Rectilinear Steiner Tree Problem: A Tut,orial. In S.Clzeng and D.Z. Du, edit,ors. Steinrr Trees in Intlu,stl-y. pages 467-307. Klun-er =Icademic Publishers: 2001. 14721 N.Zadelz. Construction of Eficient Tree Ket,works: The Pipeline Prohlem. Networh, 3:l-31, 1973. [473] *\.Z. Zelikorslq~.*An 1116-rlpprosimatiorl Algorithm for the Steirier problem on graphs. Ann. of Discrete Mathrnaatics. 41:351-354, 1992. [474] 13. Zeli~zka. JIedians and peripherians of trces. Arch. AJ(ith,. (Br,no). 4:8f 9 3 , 1968. 1-1-75]C. Zong. S~111cr.ePackings. Springer. 1999 [476] A.A. Zyko~..Theory of Finite Graphs (Russian). No\-osibirsk. 1969.
INDEX
Achi~veinentof the Steincr ratio. 78 Acyclic. 1 3 Adjacency matris. 37 Adjacent, 12 ;Ilgorit,hrn. 59 ~loildetcr~riiriist,iq 66 Align~nerit,,141 induced. 132 local, 149 rriultiple, 130 optimal, 148 pairn-ise. 141 Alphabet. 132 binary, 132 cxtelided, 151 Xricestor. 183 immediate. 185 =Ipprosirnable. 75 Approxirriation. 74 Arc, 183 ahyrnptotic behavior. 175 -irerage-case perforniaric~,63 Ball, 10. 37. 195 Ball family. 196 Banach-SIiiikon-ski space. 40 Baliach-TT'iener space. 40 Banach space, 40 Base pairs, 131 Bell~nan'sprinciple. 47 Binary matrix, 37 Binary LIST: 117 Binary search. G3 Binary tree. 13. 180
Bipartite, 53 Bisector. 71 Boolean matris. 37 Borur-la's algorithm, 16 Bounded Degree 51inimuni Sparirii~igTree, 117 ge~ieralized,118 Bounded set. 37 Bp. 131 Bracket fomiat. 181 Branch arid bound. 52 Bridgc, 14 C.dj,le~.'s. tree formula, 179 Center: 10 Center function, 9 Center tree, 208 Central dogma. 209 Cent,ral Dogma of Slolecular Biology: 91. 132 Chain. 1 3 Character-state matrix: 159 Character. 126. 138 Chinese Postnlari Probleni. 109 Chomsky hierarch!-. 134 Chrorriosonie. 131 Church's thesis. 60 Circlc: 10 Class. 193 Classes of complexity. 62 Classification. 162, 192 ~ awfication 1:. .' . metric, 195 Cluster-distance, 218 Cluster. 193 Conmion ancestor. 164
last 11ni~-ersal. 164 most recent, 164 Con~patiblesplits. 182 Complete alignment, 153 Cornplet,e graph. 13 Cornpletc linkage clustering. 220 Component, 14 Computational cornplesity, 63 Concatenation. 133 Connected component, 14 Connected graph. 13 Connected strongly, 185 Consens~isletter, 155 Consensus secluellce. 154 Consensus Sequeiic.e Problem. 155 Consensus trec., 227 Consensus majority, 228 strict. 228 Construction x i t h ruler and co1npass; 5 Content. 193 Come?: polytope. 87 Cook's hypothesis. 67 Cost measure. 138 generalized, 151 Counting problern. 174 Covering, 10 Cubic graph, 12 Cubic time, 62 Cycle. 12-13 Decomposition of a tree, 102 Degree, 12 Delaunay triangulation, 20. 71. 103 Dendrograni, 198 Depth of the tree. 186 Diameter. 201 Diameter of a graph. 49 Different and isomorphic. 175 Digraph, 185
Dijkstra's algorithm (rninirnum spanning tree). 68 Dijkstra's algorithm (shortest path). 47 Directed graph. 185 Dircction of a n arc, 185 Discrct,e rnct,ric. 1 1 Discrete metric space. 41 Dissiniilarity. 32 Distance. 32 Distance graph: 46 Distancc matrix. 217 Distance t,ree, 37 Diversity, 128, 165 Di~.itieand Coriyuer rnetliocl, 229 DN.4. 131 Dominance region, 70 D T . 20, 71. 105 Dyna~nicprogran~mingalgorithm. 147 Dj.na~nicprogramming technique; 47 Edge. 12 Edit, clist,arice, 136 Elementary tree tra~~sformat~ion, 201 Ernbedcling of a graph in a metric space. 32 Empt,y circle condition. 10; Empt,y IT-ord,133 E n d v ~ r t e s ,12 Enurnc~at,ionproblem. 174. 205 Error of an algorithm. 7.5 Euclidean metric. 42 Euclitlcan planc. 3. 42 Euler's formula; 53 Eulerian chain. 107 Eulerian c j d e , 107 Eulerian graph. 107 Farris' method. 213 F e r ~ n a t ' sProblem. 2. 206 Fcrmat function. 3; 206
Index
generalized. 85 Forest,, 14 Full components, 102 Full t,ree, 25, 43 Fully polynomial approxi~nation scheme, 75 G a p , 142 G a d question, 22 Generating graphs, 188 Generating trees, 189 Genome, 131 Geodesic curve: 42 Geometry of Sumhers, 202 Gill-mt-Pollnk conjectilre. 79 Graph, 11 Graph alignment. 132 Graph regular, 12 Greedj. algorit,hm. 18 Greedy Tree, 84 GT. 84 Hadn-iger 11u1dxr. 94 Hamiltonian cycle. 108 Hatniltonian graph, 108 Hani~nirig-dist,arice extended. 137 Harriinirig clist~ance,135 Harriming n-eight, 203 Heuristic, 74 Hierarchy. 197 Hicrholzer's algorith~n.108 Hilbert's fourth problem, 40 Homologous. 138 Hypercube. 13.5; 173 Incident, 12 Indegree, 185 Indel. 139 Internal edge. 184 Internal vertes, 14 Intractable, 66 Isornorphic. 174 Isoinorphisrn. 174
K-conected Steiner ratio. 116 K-connected. 114 K-edge-connected graph. 113 K-edge-connected llinirnuln Spanning Net,work. 116 K-edge-connected St,eincr lZIinirna1 Network. 114 I<-edge-IISN, 116 Ksizr-Steiner ratio, 103 I<-size S I I T . 102 K s i z e tree. 102 I<-SLIT, 93 I<-vertex-connectcd. 114 Kissing number, 94 I
standaid. 146 hlatch. 142 hIatc hmg, 109 hIathematici. 36 hIatrix of a d m i t t m t e , 199 Slaxinium Parsimonj. 128. 213 SIasimurn par simony pioblem 156 Slaximum Par sirnony Tree, 129 nrcc, l o AIDST. 120 AIedian. 9. 207 SIedian function 9 Slcdian tree, 208 \Iel~ali's dlgor i t l m 23 SIetapln s ~ c 35 . S I e t r ~ 32 . hIetric closuie 46 h I e t 1 1 ~b p a c ~ 31 . IIetric ipace of all ipanning trees, 201 AIctric space of N-treei. 183 IIetric ipace of looted N-tlces. 195 Slillion J eals ago. 164 SIinimurn col cling cir t lc 10 hIinimurn D~anietciSpalining Tree, 119 I\liriimum Perfect ;\latcliing Problem 110 Slinimurn spanning t r cc 16 lliriimum spanning tiec pr ohlern. 11, 16 hIinlion ski functiol~al,39 hl~srnatch.142 Slitothor~drialE.\e 167 Slonotonic iterative dlgoritliin. 99 AIR. 154 S E T , 19 l I t E v e . 167 SIultigr a p h 107 AIultiplr alignment optimdl, 152 hIultiple ~ o r l r l e c t ~graph d 113
Slultiregional model, 167 AIya. 164 X-tree- 177 rooted. 178 Neighbor, 12 Set~vorli.11: 16: 46 Nenick format, 181 Norm. 39 KP-complete: 66 NP-llard, 66 NP: 66 SPC'. 66 S P I : 67 Nunlbcr-P- corr~plcte,206 N ~ m b e rof leaves in a tree, 15 Niimber of spanning trees, 199 Numbcr of splits. 182 S u ~ n b e rof words, 204 Ocltham's razor. 129. 213 Optimal algorithm. 63 Oracle. 68 Order notations. 62 Order of growing. 61, 175 Out of Africa moclel: 167 Outdegree. 185 P-norm. 85 P. 65 Pair group inet,hod. 198. 217 Partial order; 64 Partiallj- ordered set; 64 Pat,h. 13: 13 Pat,lls illdependent. 113 Perfert matching, 12. 109 Perfect phylogeny. 159 Perfect pl~ylogenyproblem, 139 Perfect phylogeny problem, 215 PGAI. 198, 217 Phylogenetic space, 136 Phylogenetic tree. 123 Phylogenetic Tree Problem. 127 Phyloger~y,123 1 2 4
Planar, 33 Planar code; 191 Polyhedral approach. 31 Polynornial algorithm. 63 Polynomially hounded, 63 Poset; 64 Priifer's decoding, 190 Priifer's encoding. 189 Priifer code, 189 Prim's algorit,lirn, 68 Problclri of Classification. 162 Problem of lriiriirnal covering. 10 Problem of AIST. 19 Protein, 131 Pseudo-length, 136 Pseudometric. 32 Quadratic t,irrie. 62 Radius, 202 R.411, 39 Rancloni Accpss SIaclii~ic,59 Real-R-All. 60 Rectilinear Steiricir Problcrii. 89 Regular, 12 Relativc neighbor, 104 Relative ueighl-)orliood graph. 103 Relative Keighborliood Problem. 104 RNG, 103 Root, 178 Root of t,he tree. 178 Rooted binary tree, 187 Rooted K-tree. 178 Rooted t,ree, 186 Rooting a tree. 187 Sausage, 91 Score matrix, 142 Scoring system, 142 Searching. 6.5 Selection. G-l Sequence. 133 Sequence space. 133 Shortost path. 46
Sliortpst P a t h Problem. 46 Shortest supersequericc, 146 Single linkage c l ~ s t ~ e r i n220 g. Size of a full colriponent, 102 SLIT. 24, 33 Solution, 38 Sorting. 64 Spanner, 121 Spanning tree. 16, 199 Spaririirig tree problem, 20 Split, 182 Split riiet,ric. 184 S t a r , 15 Star alignment. 133 Steiner's Yrtn-ork Prohlcm. 114 Steirier's P r o h l e ~ n .22 Steiner's Problem in Networks, 49 Steirier's Problem discrete version, 71 restricted. 93 n.eiglited inodificatiori, 97 Steiner 111111. 38 Steiner lliriilnal Tree! 24. 33 Steiner point, 24, 33 Steirier ratio. 76 Stciiier ratio of sequence spaces, 138 Steilier ratio of tlic Phylogerietic space. 137 S t e r n ~ n a 170 , Stirliiig's approximat,ion, 63 Stirling's formula. 63 Stirling's irieyualities. 63 String. 133 Subgraph. 13 induced. 13 Suhspacr. 32 Suhtree, 13 Successor, 183 immediate. 18.1, Sum-of-pairs alignment, 153 S11pertr.w. 229
Super trce Ploblern. 229 SJinimetilc dlffercnce, 183 T a d , 192 Taxon 159 Taxonom! 160 The Consmsus Trce Problem 227 Tirne c o m p l ~ x i t 61 ~, TLI 60 Tolricclll point, 2 207 Totnl leilgth, 33 Trax elirig Salesmm P i o b l ~ r n ,110 Tiaveling Salesman Tour 110 Tree 14 Trce alignrncnt, 133 Tree foi S.178 Tree labelled, 177 Tr imgle inequalit\, 32 wealel folm 106 Trianguliitioii 106 T S T 110 Tuiirig \fachine. 60 Ultrarnetiic 196 Ultr a ~ n e tici b p x e , 72 Union of grnphs, 13 CPGLIX, 221 I'ertex, 12 I7LSI la! out. 89 Ihronoi cell 19. 70 \-or onoi diagrain. 19. 70 IVeiszfeld algoiithrri. 7. 86 Mhid, 133 TTToid garnc. 197 IT or st-caw pel for mnnce, 63 M'PGIIA. 221