PROCEEDINGS OF THE FITH WORKSHHOP ON
ALGORITHM ENGINEERING AND EXPERIMENTS
SIAM PROCEEDINGS SERIES LIST Glowinski, R., Golub, G. H., Meurant, G. A., and Periaux, J., First International Conference on Domain Decomposition Methods for Partial Differential Equations (1988) Salam, Fathi M. A. and Levi, Mark L, Dynamical Systems Approaches to Nonlinear Problems in Systems and Circuits (1988) Datta, B., Johnson, C., Kaashoek, M., Plemmons, R., and Sontag, E., Linear Algebra in Signals, Systems and Control098&) Ringeisen, Richard D. and Roberts, Fred S., Applications of Discrete Mathematics (1988) McKenna, James and Temam, Roger, ICIAM '57; Proceedings of the First International Conference on Industrial and Applied Mathematics (1988) Rodrigue, Garry, Parallel Processing for Scientific Computing (1989) Caflish, Russel E., Mathematical Aspects of Vortex Dynamics (1989) Wouk, Arthur, Parallel Processing and Medium-Scale Multiprocessors (1989) Flaherty, Joseph E., Paslow, Pamela J., Shephard, Mark S., and Vasilakis, John D., Adaptive Methods for Partial Differential Equations (1989) Kohn, Robert V. and Milton, Graeme W., Random Media and Composites (1989) Mandel, Jan, McCormick, S. F., Dendy, J. E., Jr., Farhat, Charbel, Lonsdale, Guy, Parter, Seymour V., Ruge, John W., and Stuben, Klaus, Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods (1989) Colton, David, Ewing, Richard, and Rundell, William, Inverse Problems in Partial Differential Equations (1990) Chan, Tony F., Glowinski, Roland, Periaux, Jacques, and Widlund, Olof B., Third International Symposium on Domain Decomposition Methods for Partial Differential Equations (1990) Dongarra, Jack, Messina, Paul, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing (1990) Glowinski, Roland and Lichnewsky, Alain, Computing Methods in Applied Sciences and Engineering (1990) Coleman, Thomas F. and Li, Yuying, Large-Scale Numerical Optimization (1990) Aggarwal, Alok, Borodin, Allan, Gabow, Harold, N., Galil, Zvi, Karp, Richard M., Kleitman, Daniel J., Odlyzko, Andrew M., Pulleyblank, William R., Tardos, Eva, and Vishkin, Uzi, Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms (1990) Cohen, Gary, Halpern, Laurence, and Joly, Patrick, Mathematical and Numerical Aspects of Wave Propagation Phenomena (1991) Gomez, S., Hennart, J. P., and Tapia, R. A., Advances in Numerical Partial Differential Equations and Optimization: Proceedings of the Fifth Mexico-United States Workshop (1991) Glowinski, Roland, Kuznetsov, Yuri A., Meurant, Gerard, Periaux, Jacques, and Widlund, Olof B., Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1991) Alavi, Y., Chung, F. R. K., Graham, R. L., and Hsu, D. F., Graph Theory, Combinatorics, Algorithms, and Applications (1991) Wu, Julian J., Ting, T. C. T., and Barnett, David M., Modem Theory of Anisotropic Elasticity and Applications (1991) Shearer, Michael, Viscous Profiles and Numerical Methods for Shock Waves (1991) Griewank, Andreas and Corliss, George F., Automatic Differentiation of Algorithms: Theory, Implementation, and Application (1991) Frederickson, Greg, Graham, Ron, Hochbaum, Dorit S., Johnson, Ellis, Kosaraju, S. Rao, Luby, Michael, Megiddo, Nimrod, Schieber, Baruch, Vaidya, Pravin, and Yao, Frances, Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms (1992) Field, David A. and Komkov, Vadim, Theoretical Aspects of Industrial Design (1992) Field, David A. and Komkov, Vadim, Geometric Aspects of Industrial Design (1992) Bednar, J. Bee, Lines, L. R., Stolt, R. H., and Weglein, A. B., Geophysical Inversion (1992)
O'Malley, Robert E. Jr., ICIAM 91: Proceedings of the Second International Conference on Industrial and Applied Mathematics (1992) Keyes, David E., Chan, Tony R, Meurant, Gerard, Scroggs, Jeffrey S., and Voigt, Robert G., Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1992) Dongarra, Jack, Messina, Paul, Kennedy, Ken, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fifth SI AM Conference on Parallel Processing for Scientific Computing (1992) Corones, James P., Kristensson, Gerhard, Nelson, Paul, and Seth, Daniel L, Invariant Imbedding and Inverse Problems (1992) Ramachandran, Vijaya, Bentley, Jon, Cole, Richard, Cunningham, William H, Guibas, Leo, King, Valerie, Lawler, Eugene, Lenstra, Arjen, Mulmuley, Ketan, Sleator, Daniel D., and Yannakakis, Mlhalis, Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (1993) Kleinman, Ralph, Angell, Thomas, Cotton, David, Santosa, Fadil, and Stakgold, Ivor, Second International Conference on Mathematical and Numerical Aspects of Wave Propagation (1993) Banks, H. T, Fablano, R. H., and Ito, K., Identification and Control In Systems Governed by Partial Differential Equations (1993) Sleator, Daniel D., Bern, Marshall W., Clarkson, Kenneth L, Cook, William J., Karlin, Anna, Klein, Philip N., Lagarias, Jeffrey C., Lawler, Eugene L., Maggs, Bruce, Milenkovic, Victor J., and Winkler, Peter, Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (1994) Lewis, John G., Proceedings of the Fifth SIAM Conference on Applied Unear Algebra (1994) Brown, J. David, Chu, Moody T., Ellison, Donald C., and Plemmons, Robert J., Proceedings of the Cornelius Lanczos International Centenary Conference (1994) Dongarra, Jack J. and Tourancheau, B., Proceedings of the Second Workshop on Environments and Tools for Parallel Scientific Computing (1994) Bailey, David H., Bj0rstad, Petter E., Gilbert, John R., Mascagni, Michael V, Schreiber, Robert S., Simon, Horst D., Torczon, Virginia J., and Watson, Layne I, Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing (1995) Clarkson, Kenneth, Agarwal, Pankaj K., Atallah, Mikhail, Frieze, Alan, Goldberg, Andrew, Karloff, Howard, Manber, Udi, Munro, Ian, Raghavan, Prabhakar, Schmidt, Jeanette, and Young, Moti, Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (1995) Becache, Elaine, Cohen, Gary, Joly, Patrick, and Roberts, Jean E., Third International Conference on Mathematical and Numerical Aspects of Wave Propagation (1995) Engl, Heinz W., and Rundell, W., GAMM-SIAM Proceedings on Inverse Problems in Diffusion Processes (1995) Angell, T. S., Cook, Pamela L., Kleinman, R. E., and Olmstead, W. E., Nonlinear Problems in Applied Mathematics (1995) Tardos, Eva, Applegate, David, Canny, John, Eppstein, David, Galil, Zvi, Karger, David R., Karlin, Anna R., Unial, Nati, Rao, Satish B., Vitter, Jeffrey S., and Winkler, Peter M., Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (1996) Cook, Pamela L., Roytburd, Victor, and Tulln, Marshal, Mathematics Is for Solving Problems (1996) Adams, Loyce and Nazareth, J. L., Linear and Nonlinear Conjugate Gradient-Related Methods (1996) Renardy, Yuriko Y., Coward, Adrian V, Papageorgiou, Demetrios T., and Sun, Shu-Ming, Advances In Multi-Fluid Flows (1996) Berz, Martin, Bischof, Christian, Corliss, George, and Griewank, Andreas, Computational Differentiation: Techniques, Applications, and Tools (1996) Delic, George and Wheeler, Mary F., Next Generation Environmental Models and Computational Methods (1997) Engl, Heinz W., Louis, Alfred, and Rundell, William, Inverse Problems in Geophysical Applications (1997) Saks, Michael, Anderson, Richard, Bach, Eric, Berger, Bonnie, Blum, Avrim, Chazelle, Bernard, Edelsbrunner,Herbert, Henzinger, Monika, Johnson, David, Kannan, Sampath, Khuller, Samir, Maggs, Bruce, Muthukrlshnan, S., Ruskey, Frank, Seymour, Paul, Spencer, Joel, Williamson, David P., and Williamson, Gill, Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (1997) Alexandrov, Natalia M. and Hussaini, M. Y., Multidisciplinary Design Optimization: State of the Art (1997) Van Huffel, Sablne, Recent Advances in Total Least Squares Techniques and Errors-in-Variables Modeling (1997)
Ferris, Michael C. and Pang, Jong-Shi, Complementarity and Variational Problems: State of the Art (1997) Bern, Marshall, Fiat, Amos, Goldberg, Andrew, Kannan, Sampath, Karloff, Howard, Kenyon, Claire, Kierstead, Hal, Kosaraju, Rao, Linial, Nati, Rabani, Yuval, Rodl, Vojta, Sharir, Micha, Shmoys, David, Spielman, Dan, Spinrad, Jerry, Srinivasan, Aravind, and Sudan, Madhu, Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (1998) DeSanto, John A., Mathematical and Numerical Aspects of Wave Propagation (1998) Tarjan, Robert E., Warnow, Tandy, Amenta, Nina, Benham, Craig, Cornell, Derek G., Edelsbrunner, Herbert, Feigenbaum, Joan, Gusfield, Dan, Habib, Michel, Hall, Leslie, Karp, Richard, King, Valerie, Koller, Daphne, McKay, Brendan, Moret, Bernard, Muthukrishnan, S., Phillips, Cindy, Raghavan, Prabhakar, Randall, Dana, and Scheinerman, Edward, Proceedings of the Tenth ACM-SIAM Symposium on Discrete Algorithms (1999) Hendrickson, Bruce, Yelick, Katherine A., Bischof, Christian H., Duff, lain S., Edelman, Alan S., Geist, George A., Heath, Michael T., Heroux, Michael H., Koelbel, Chuck, Schrieber, Robert S., Sincovec, Richard F., and Wheeler, Mary F., Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (1999) Henderson, Michael E., Anderson, Christopher R., and Lyons, Stephen L, Object Oriented Methods for Interoperable Scientific and Engineering Computing (1999) Shmoys, David, Brightwell, Graham, Cohen, Edith, Cook, Bill, Eppstein, David, Gerards, Bert, Irani, Sandy, Kenyon, Claire, Ostrovsky, Rafail, Peleg, David, Pevzner, Pavel, Reed, Bruce, Stein, Cliff, Tetali, Prasad, and Welsh, Dominic, Proceedings of the Eleventh ACM-SIAM Symposium on Discrete Algorithms (2000) Bermudez, Alfredo, Gomez, Dolores, Hazard, Christophe, Joly, Patrick, and Roberts, Jean E., Fifth International Conference on Mathematical and Numerical Aspects of Wave Propagation (2000) Kosaraju, S. Rao, Bellare, Mihir, Buchsbaum, Adam, Chazelle, Bernard, Graham, Fan Chung, Karp, Richard, Lovasz, Laszlo, Motwani, Rajeev, Myrvold, Wendy, Pruhs, Kirk, Sinclair, Alistair, Spencer, Joel,Stein, Cliff, Tardos, Eva, Vempala, Santosh, Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (2001) Koelbel, Charles and Meza, Juan, Proceedings of the Tenth SIAM Conference on Parallel Processing for Scientific Computing (2001) Grossman, Robert, Kumar, Vipin, and Han, Jiawei, Proceedings of the First SIAM International Conference on Data Mining (2001) Berry, Michael, Computational Information /?efr/eva/(2001) Eppstein, David, Demaine, Erik, Doerr, Benjamin, Fleischer, Lisa, Goel, Ashish, Goodrich, Mike, Khanna, Sanjeev, King, Valerie, Munro, Ian, Randall, Dana, Shepherd, Bruce, Spielman, Dan, Sudakov, Benjamin, Suri, Subhash, and Warnow, Tandy, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2002) Grossman, Robert, Han, Jiawei, Kumar, Vipin, Mannila, Heikki, and Motwani, Rajeev, Proceedings of the Second SIAM International Conference on Data Mining (2002) Estep, Donald and Tavener, Simon, Collected Lectures on the Preservation of Stability under Discretization (2002) Ladner, Richard E., Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments (2003)
PROCEEDINGS OF THE FIFTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS
Edited by Richard E. Ladner
Society for Industrial and Applied Mathematics Philadelphia
PROCEEDINGS OF THE FIFTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS
Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments, Baltimore, MD, January 11,2003 The workshop was supported by the ACM Special Interest Group on Algorithms and Computation Theory and the Society for Industrial and Applied Mathematics. Copyright © 2003 by the Society for Industrial and Applied Mathematics. 1098765432 1 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Catalog Card Number: 2003103485 ISBN 0-89871-542-3
is a registered trademark.
CONTENTS
ix
Preface
xi
Implementing External Memory Algorithms and Data Structures (Abstract of Invited Talk) Lars Arge
xiil
Open Problems from ALENEX 2003 Erik D. Demaine
1
The Cutting-Stock Approach to Bin Packing: Theory and Experiments David L. Applegate, Luciano S. Buriol, Bernard L Dillard, David S. Johnson, and Peter W. Shor
16
The Markov Chain Simulation Method for Generating Connected Power Law Random Graphs Christos Gkantsidis, Milena Mihail, and Ellen Zegura
26
Finding the k Shortest Simple Paths: A New Algorithm and Its Implementation John Hershbergerf Matthew Maxel, and Subhash Suri
37
Efficient Exact Geometric Predicates for Delaunay Triangulations Olivier Devillers and Sylvain Pion
45
Computing Core-Sets and Approximate Smallest Enclosing HyperSpheres in High Dimensions Piyush Kumar, Joseph S. B. Mitchell, and E. Alper Yildmm
56
Interpolation over Light Fields with Applications in Computer Graphics F. Betul Atalay and David M. Mount
69
Practical Construction of Metric f-Spanners Gonzalo Navarro and Rodrigo Paredes
82
l/O-efficient Point Location Using Persistent B-Trees Lars Arge, Andrew Danner, and Sha-Mayn Teh
93
Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries Ranjan Sinha and Justin Zobel
106
Train Routing Algorithms: Concepts, Design Choices, and Practical Considerations Luzi Anderegg, Stephan Eidenbenz, Martin Gantenbein, Christoph Stamm, David Scot Taylor, Birgitta Weber, and Peter Widmayer
119
On the Implementation of a Swap-Based Local Search Procedure for the p-Median Problem Mauricio G. C. Resende and Renato F. Werneck
128
Fast Prefix Matching of Bounded Strings Adam L. Buchsbaum, Glenn S. Fowler, Balachander Krishnamurthy, Kiem-Phong Vo, and Jia Wang
141
Author Index
This page intentionally left blank
Preface
The annual workshop on Algorithm Engineering and Experiments (ALENEX) provides a forum for the presentation of original research in the implementation and experimental evaluation of algorithms and data structures. ALENEX 2003 was the fifth workshop in this series. It was held in Baltimore, Maryland on January 11, 2003. This proceedings collects extended versions of the 12 papers that were selected for presentation from a pool of 38 submissions. We would like to thank the authors and reviewers who helped make ALENEX 2003 a success. We also thank our invited speaker, Lars Arge of Duke University. Special thanks go to SLAM for taking over the arrangements of the workshop and for publishing these proceedings for the first time. Thanks also go to ACM and SIGACT for supporting the electronic submission and electronic program committee meeting used by the workshop. Finally, thanks go to members of the Steering Committee who helped ease the work of the Program Committee. January 2003
Richard E. Ladner
ALENEX 2003 Program Committee David Bader, University of New Mexico Michael Bender, State University of New York, Stony Brook Gerth Brodal, University of Aarhus, Denmark Larry Carter, University of California, San Diego Edith Cohen, AT&T Labs Tom Gormen, Dartmouth College Erik Demaine, Massachusetts Institute of Technology Sandy Irani, University of California, Irvine Richard Ladner (Chair), University of Washington
ALENEX 2003 Steering Committee Adam Buchsbaum, AT&T Labs Roberto Battiti, University of Trento, Italy Andrew V. Goldberg, Microsoft Research Michael Goodrich, University of California, Irvine David S. Johnson, AT&T Labs Catherine C. McGeoch, Amherst College David Mount, University of Maryland Bernard M.E. Moret, University of New Mexico Jack Snoeyink, University of North Carolina, Chapel Hill Clifford Stein, Columbia University
IX
This page intentionally left blank
Implementing External Memory Algorithms and Data Structures (Invited talk) Lars Arge*
Department of Computer Science Duke University Durham, NC 27708 USA Many modern applications store and process datasets much larger than the main memory of even state-of-theart high-end machines. In such cases, the Input/Output (or I/O) communication between fast internal memory and slow disks, rather than actual internal computation time, can become a major performance bottleneck. In the last decade, much attention has therefore been focused on the development of theoretically I/O-efficient algorithms and data structures [3, 13]. In this talk we discuss recent efforts at Duke University to investigate the practical merits of theoretically developed I/O-efficient algorithms. We describe the goals and architecture of the TPIE environment for efficient implementation of I/O-efficient algorithms [12, 10, 4], as well as some of the implementation projects conducted using the environment [9, 8, 2, 11, 5, 7, 6, 1], and discuss some of the experiences we have had and lessons we have learned in these projects. We especially discuss the TERRA.FLOW system for efficient flow computation on massive grid-based terrain models, developed in collaboration with environmental researchers [5]. Finally we discuss how the implementation and experimentation work has supported educational efforts.
[4]
[5]
[6] [7] [8]
[9]
References [1] P. K. Agarwal, L. Arge, and S. Govindarajan. CRBtree: An optimal indexing scheme for 2d aggregate queries. In Proc. International Conference on Database Theory, 2003. [2] P. K. Agarwal, L. Arge, O. Procopiuc, and J. S. Vitter. Bkd-tree: A dynamic scalable kd-tree. Manuscript, 2002. [3] L. Arge. External memory data structures. In J. Abello, P. M. Pardalos, and M. G. C. Resende,
[11]
* Supported in part by the National Science Foundation through ESS grant EIA-9870734, RI grant EIA-9972879, CAREER grant CCR-9984099, ITR grant EIA-0112849, and U.S.Germany Cooperative Research Program grant INT-0129182. Email: lztrgeCcs.duke.edu.
[13]
[10]
[12]
XI
editors, Handbook of Massive Data Sets, pages 313-358. Kluwer Academic Publishers, 2002. L. Arge, R. Barve, D. Hutchinson, O. Procopiuc, L. Toma, D. E. Vengroff, and R. Wickremesinghe. TPIE User Manual and Reference (edition 082902). Duke University, 2002. The manual and software distribution are available on the web at http://www.cs.duke.edu/TPIE/. L. Arge, J. Chase, P. Halpin, L. Toma, D. Urban, J. S. Vitter, and R. Wickremesinghe. Flow computation on massive grid terrains. Geolnformatica, 2003. (To appear). Earlier version appeared in Proc. 10'th ACM International Symposium on Advances in Geographic Information Systems (ACM-GIS'Ol). L. Arge, A. Banner, and S.-H. Teh. I/O-emcient point location using persistent B-trees. In Proc. Workshop on Algorithm Engineering and Experimentation, 2003. L. Arge, K. H. Hinrichs, J. Vahrenhold, and J. S. Vitter. Efficient bulk operations on dynamic R-trees. Algorithmic^ 33(1):104-128, 2002. L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter. A unified approach for indexed and non-indexed spatial joins. In Proc. Conference on Extending Database Technology, pages 413-429, 1999. L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S. Vitter. Scalable sweeping-based spatial join. In Proc. International Conf. on Very Large Databases, pages 570-581, 1998. L. Arge, O. Procopiuc, and J. S. Vitter. Implementing I/O-efficient data structures using TPIE. In Proc. Annual European Symposium on Algorithms, pages 88100, 2002. L. Arge, L. Toma, and J. S. Vitter. I/O-efficient algorithms for problems on grid-based terrains. In Proc. Workshop on Algorithm Engineering and Experimentation, 2000. D. E. Vengroff. A transparent parallel I/O environment. In Proc. DAGS Symposium on Parallel Computation, 1994. J. S. Vitter. External memory algorithms and data structures: Dealing with MASSIVE data. ACM Computing Surveys, 33(2):209-271, 2001.
This page intentionally left blank
Open Problems from ALENEX 2003 Erik D. Demaine* The following is a list of the problems presented on January 11, 2003 at the open-problem session of the 5th Workshop on Algorithm Engineering and Experiments held in Baltimore, Maryland, with Richard Ladner as program chair. Markov-Style Generation Algorithms Catherine McGeoch Amherst College
[email protected] How do we know empirically when a Markov chain has converged and we can stop iterating? There are many Markov-style algorithms for generating random instances, e.g., random graphs that have fe-colorings, random graphs that obey the triangle inequality, and random connected power-law graphs [GMZ03]. Ideally, we could prove theorems on the rate of convergence of these Markov processes. But often such theorems are not known, and we have to determine when to stop based on experiments. Many ad-hoc methods are reported in the literature; can we be more systematic in our choices of convergence rules? What are peoples' experiences, stories, and rules of thumb for empirically detecting convergence? References [GMZ03] Christos Gkantsidis, Milena Mihail, and Ellen Zegura. The Markov chain simulation method for generating connected power law random graphs. In Proceedings of the 5th Workshop on Algorithm Engineering and Experiments, 2003. To appear. TSP Approximations in High-Dimensional Hamming Space David Johnson AT&T Labs — Research
[email protected] Consider a Traveling Salesman Problem instance of 100,000-1,000,000 cities each represented by a 0/1 vector in ~ 64,000-dimensional Hamming space with ~ 1,000 nonzero entries (so relatively sparse). In this situation, implementing even some of the simplest TSP heuristics is difficult, in particular because the full distance matrix is too big to be stored. For example, how can we efficiently compute the nearest-neighbor TSP heuristic in practice? For this algorithm, we need an efficient nearest-neighbor data structure that supports queries of, given a city, find the nearest city in Hamming space. One option is to try Ken Clarkson's nearest-neighbor code for general metrics. Another option, if no exact methods work well, is to try the plethora of efficient (1 + e)-approximate nearest-neighbor data structures based on dimensionality reduction. See e.g. [IndOO]. Another interesting example is computing the 3-opt TSP heuristic. Here we need a data structure to query the k nearest neighbors of each city for A; = 10 or 20. "MIT Laboratory for Computer Science, 200 Technology Square, Cambridge, MA 02139, USA,
[email protected]
Xlll
References [IndOO] Piotr Indyk. High-dimensional Computational Geometry. PhD Thesis, Stanford University, 2000. Generating Synthetic Data Adam Buchsbaum AT&T Labs — Research
[email protected] In a variety of experimental setups, real data is hard to come by or is too secret to be distributed. How can we best generate synthetic data that is "like" real data when we have access to only a little (seed) real data? How can we then test whether we've succeeded in generating "good" data? In particular, how can we empirically validate models of data sources? One option is to just to check whether your implemented algorithms behave similarly on real vs. synthetic data, but this comparison may not capture everything you care about in the data. Solutions to these problems are likely application-dependent. In the particular application of interest here, the data consists of traces of traffic through network routers, in particular, the IP addresses of packets that pass through a particular router. Recording Experiments Erik Demaine MIT
[email protected] What should an electronic ALENEX proceedings look like? This question incurred a lot of discussion. Some of the ideas are as follows: 1. Include the code and methods/instructions for compiling the code. 2. Include the input (when possible) and the input generators. 3. Include scripts to run the code for various inputs and generate output tables. 4. Include the code/script/setup to plot the output data (when possible). 5. Include the results (in all detail) and a description of the experimental setup. 6. Mark certain results as reproducible when that's the case. Examples of exactly reproducible quantities include approximation factors, numbers of comparisons, and numbers of operations. Often these results are included in addition to setup-dependent results such as running times. The idea is that these additions to the proceedings beyond the usual paper would give us better records of what was done, so that past experiments could later be revalidated on current machines, extended to more algorithms or inputs, or challenged with respect to implementation quality, experimental setup, etc.
xiv
The Cutting-Stock Approach to Bin Packing: Theory and Experiments DAVID L. APPLEGATE * LUCIANA S. BURIOL t BERNARD L. DILLARD * § DAVID S. JOHNSON PETER W. SHOR ' Abstract We report on an experimental study of the Gilmore-Gomory cutting-stock heuristic and related LP-based approaches to bin packing, as applied to instances generated according to discrete distributions. No polynomial running tune bound is known to hold for the Gilmore-Gomory approach, and empirical operation counts suggest that no straightforward implementation can have average running tune O(m3), where m is the number of distinct item sizes. Our experiments suggest that by using dynamic programming to solve the unbounded knapsack problems that arise in this approach, we can robustly obtain average running tunes that are o(m4) and feasible for m well in excess of 1,000. This makes a variant on the previously un-implemented asymptotic approximation scheme of Fernandez de la Vega and Lueker practical for arbitrarily large values of m and quite small values of e. We also observed two interesting anomalies in our experimental results: (1) running time decreasing as the number n of items increases and (2) solution quality improving as running time is reduced and an approximation guarantee is weakened. We provide explanations for these phenomena and characterize the situations hi which they occur.
1 Introduction In the classical one-dimensional bin packing problem, we are given a list L = (ai,..., an) of items, a bin capacity J3, and a size s(aj) € (0, B] for each item in the list. We wish to pack the items into a minimum number of bins of capacity B, i.e., to partition the items into a minimum number of subsets such that the sum of the sizes of the items in each subset is B or less. This problem is NP*AT&T Labs, Room C224, 180 Park Avenue, Florham Park, NJ 07932, USA. Email: davidtoresearch.att.com. tUNICAMP - Universidade Estadual de Campinas, DENSIS/FEEC, Rua Albert Einstein 400 - Caixa Postal 6101, Campinas - SP - Brazil. Email: buriolOdensis.fee.unicamp.br. Work done while visiting AT&T Labs. * Department of Mathematics, University of Maryland, College Park, MD 20742. Email: bldtoath.umd.edu. Work done while visiting AT&T Labs. § AT&T Labs, Room C239, 180 Park Avenue, Florham Park, NJ 07932, USA. Email: dsjCresearch.att.com. 'AT&T Labs, Room C237, 180 Park Avenue, Florham Park, NJ 07932, USA. Email: shortoresearch.att.com.
1
hard and has a long history of serving as a test bed for the study of new algorithmic techniques and forms of analysis. Much recent analysis (e.g., see [3, 4, 6]) has concerned the average case behavior of heuristics under discrete distributions. A discrete distribution F consists of a bin size B € Z+, a sequence of positive integral sizes si < 82 < ••• < sm < B, and an associated vector PF = (P15P2> • • • >Pm) of rational probabilities such that YfiLiPj = 1- In a list generated according to this distribution, the zth item a» has size s(di) = Sj with probability PJ, chosen independently for each i > 1. The above papers analyzed the asymptotic expected performance under such distributions for such classical bin packing heuristics as Best and First Fit (BF and FF), Best and First Fit Decreasing (BFD and FFD), and the new Sum-of-Squares heuristic of [6, 7]. Three of the above algorithms are online algorithms, and for these the order of the items in the list is significant. However, if we are allowed to do our packing offline, i.e., with foreknowledge of the entire list of items to be packed, then there is a much more compact representation for an instance generated according to a discrete distribution: simply give a list of pairs (si,n,), 1 < i < m, where n, is the number of items of size Si. This is the way instances are represented in a well-known special case of bin packing, the onedimensional cutting-stock problem, which has many industrial applications. For such problems, an approach using linear programming plus knapsack-based column generation, due to Gilmore and Gomory [14, 15], has for 40 years been the practitioner's method of choice because of its great effectiveness when m is small. The packings it produces cannot use more than OPT(L)+m bins (and typically use significantly fewer) and although the worst-case time bound for the original algorithm may well be exponential in m, in practice running time does not seem to be a problem. In this paper we examine the Gilmore-Gomory approach and some of its variants from an experimental point of view, in the context of instances generated according to discrete distributions. We do this both to
get a clearer idea of how the Gilmore-Gomory approach scales as ra, n, and B grow (and how to adapt it to such situations), and also to gain perspective on the existing results for classical bin packing algorithms. Previous experiments with Gilmore-Gomory appeared primarily in the Operations Research literature and typically concentrated on instances with m < 100, where the approach is known to work quite quickly [10, 11, 21]. The restriction of past studies to small m has two explanations: (1) most real-world cutting stock applications have m < 100 and (2) for m this small, true optimization becomes possible via branch-and-bound, with Gilmore-Gomory providing both lower and, with rounding, upper bounds, e.g., see [10, 11]). Here we are interested in the value of the LP-based approaches as approximation algorithms and hence our main results go well beyond previous studies. We consider instances with m as large as is computationally feasible (which in certain cases can mean m = 50,000 or more). This will enable us to pose plausible hypotheses about how running time and solution quality typically scale with instance parameters. In Section 2 we describe the original GilmoreGomory approach and survey the relevant literature. In Section 3 we describe an alternative linear programming formulation for computing the Gilmore-Gomory bound using a flow-based model, independently proposed by Valeric de Carvalho [8, 9] and Csirik et al. [6, 7]. This approach can be implemented to run in time polynomial in m, log n, and B (a better bound than we have for Gilmore-Gomory) but to our knowledge has not previously been studied computationally. In Section 4 we discuss the key grouping technique introduced in the asymptotic fully-polynomial-time approximation scheme for bin packing of Fernandez de la Vega and Lueker [12], another algorithmic idea that does not appear to have previously been tested experimentally (and indeed could not have been tested without an efficient Gilmore-Gomory implementation or its equivalent). In Section 5 we describe the instance classes covered by our experiments and summarize what is known theoretically about the average case performance of classical bin packing heuristics for them. Our results and conclusions are presented in Sections 6. 2 The Gilmore-Gomory Approach The Gilmore-Gomory approach is based on the following integer programming formulation of the cutting stock problem. Suppose our instance is represented by the list L of size/quantity pairs (sj,ni), 1 < i < m. A nonnegative integer vector p = (p[l],p[2],.. . ,p[m]) is said to be a packing pattern if Y^Li P[i]si ^ &• Suppose there are t distinct packing patterns pi,... ,pt for the
given set of item sizes. The integer program has a variable Xj for each pattern PJ, intended to represent the number of times that pattern is used, and asks us to minimize ]C!-=i xj subject to the constraints
The solution value for the linear programming relaxation of this integer program, call it LP(L), is a lower bound on the optimal number of bins. Moreover, it is a very good one. For note that in a basic optimal solution there will be at most m non-zero variables, and hence at most m fractional variables. If one rounds each of these up to the nearest integer, one gets a packing of a superset of the items in the original instance that uses fewer than LP(L) + m bins. Thus an optimal packing of the original set of items can use no more bins and so OPT(L) < LP(L) + m in the worst case, and we can get a packing at least this good. In practice, a better rounding procedure is the following one, recommended by Wascher and Gau [21]: Round the fractional variables down and handle the unpacked items using FFD. It is easy to prove that this "round down" approach also satisfies an OPT(L) + m worst-case bound. Our experiments suggest that in practice its excess over OPT(L) is typically no more than 4% of m. There is an apparent drawback to using the LP formulation, however: the number t of packing patterns can be exponential in m. The approach suggested by Gilmore and Gomory was to avoid listing all the patterns, and instead generate new patterns only when needed. Suppose one finds a basic optimal solution to the above LP restricted to some particular subset of the patterns. (In practice, a good starting set consists of the patterns induced by an FFD packing of the items.) Let yi, 1 < * < in, be the dual variables associated with the solution. Then it is an easy observation that the current solution can only be improved if there is a packing pattern p' not in our subset such that Y^LiP' [*]?/« > 1> in which case adding the variable for such a pattern may improve the solution. (If no such pattern exists, our current solution is optimal.) In practice it pays to choose the pattern with the largest value of Y^Li P' [*]?/» [15]. Note that finding such a pattern is equivalent to solving an unbounded knapsack problem where B is the knapsack size, the s^'s are the item sizes, and the y^s are the item values. We thus have the following procedure for solving the original LP. 1. Use FFD to generate an initial set of patterns P. 2
must actually be individually assigned to bins, any algorithm must be fil(n), but in many situations we are only looking for a packing plan or for bounds on the number of bins needed.
2. Solve the LP based on pattern set P. 3. While not done do the following: (a) Solve the unbounded knapsack problem induced by the current LP. (b) If the resulting pattern has value 1 or less, we are done. (c) Otherwise add the pattern to P and solve the resulting LP.
3 The Flow-Based Approach
4. Derive a packing from the current LP solution by rounding down. The original approach of [14, 15] did not solve the LP in step (3a) but simply performed a single pivot. However, the reduction in iterations obtained by actually solving the LP's more than pays for itself, and this is the approach taken by current implementations. There are still several potential computational bottlenecks here: (1) We have no subexponential bound on the number of iterations. (2) Even though modern simplex-based LP codes in practice seem to take time bounded by low order polynomials in the size of the LP, this is not a worst-case guarantee. (3) The unbounded knapsack problem is itself NP-hard. Fortunately, there are ways to deal with this last problem. First, unbounded knapsack problems can be solved in (pseudo-polynomial) time O(mB) using dynamic programming. This approach was proposed in the first Gilmore-Gomory paper [14]. A second approach was suggested in the follow-up paper [15], where Gilmore and Gomory observed that dynamic programming could often be bested by a straightforward branch-and-bound algorithm (even though the worstcase running time for the latter is exponential rather than pseudo-polynomial). The current common wisdom [2, 21] is that the branch-and-bound approach is to be preferred, and this is indeed the case for the small values of m considered in previous studies. In this paper we study the effect of using relatively straightforward implementations of both approaches, and conclude that solving the knapsack problems is not typically the computational bottleneck. Note that the times for both the LP and the knapsack problems are almost independent of the number of items n. Moreover, the initial set of patterns can be found in time O(m2) by generating the FFD packing in size-by-size fashion rather than item-by-item. The same bound applies to our rounding procedure. Thus for any fixed discrete distribution, the Gilmore-Gomory approach should be asymptotically much faster than any of the classical online bin packing heuristics, all of which pack item-by-item and hence have running times that are fl(n). Of course, in applications where the items 3
An alternative approach to computing the LP bound, one that models the problem in a flow-oriented way, has recently been proposed independently by Valeric de Carvalho [8, 9] and Csirik et al. [6, 7]. We follow the details of the latter formulation, although both embody the same basic idea. Let us view the packing process as placing items into bins one at a time. For each pair of an item size s and a bin level h, 0 < h < B, we have a variable u(i, /i), intended to represent the number of items of size Si that are placed into bins whose prior contents totaled h. It is easy to see that the following linear program has the same optimal solution value as the one in the previous section: Minimize J^£Li v (*»0)> i.e., the total number of bins used, subject to
where the value of v(k, h — Sk) when h — Sk < 0 is taken to be 0 by definition for all fc. Constraints of type (3.2) say that no item can go into a bin that is too full to have room for it. Constraints of type (3.3) imply that the first item to go in any bin must be larger than the second. (This is not strictly necessary, but helps reduce the number of nonzeros in the coefficient matrix and thus speed up the code.) Constraints of type (3.4) say that all items must be packed. Constraints of type (3.5) say that bins with a given level are created at least as fast as they disappear. Solving the above LP does not directly yield a packing, but one can derive a corresponding set of packing patterns using a simple greedy procedure that will be described in the full paper. Surprisingly, this procedure obeys the same bound on the number of non-zero patterns as does the classical column-generation approach, even though here the LP has m + B constraints. In the full paper we prove the following:
THEOREM 3.1. The greedy procedure for extracting pat- (1) Discrete Uniform Distributions, (2) Near Uniform terns from a solution to the flow-based LP runs in time Sampled Distributions, and (3) Zipf's Law Sampled O(mB) and finds a set C of patterns, \C\ < m, that Distributions. provides an optimal solution to the pattern-based LP. 5.1 Discrete Uniform Distributions. These are The flow-based approach has a theoretical advan- distributions denoted by U{h,j, k}, I < h < j < fc, tage over the column-based approach in that it can be in which the bin size B = k and the item sizes are implemented to run in provably pseudo-polynomial time the integers s with h < s < j, all equally likely. Of using the ellipsoid method. However, the LP involved particular interest is the special case where h = 1, which is much bigger than the initial LP in the pattern-based has been studied extensively from a theoretical point of approach, and it will have Q(mB) nonzeros, whereas view, e.g., see [3, 4, 6]. Let Ln(F) denote a random n-item list with item the latter will have O(m2) (O(m) when the smallest item size exceeds cB for some fixed c > 0). Thus the sizes chosen independently according to distribution F, pattern-based approach may well be faster in practice, let s(L) denote the lower bound on OPT(L) obtained even though fi(ra) distinct LP's may need to be solved. by dividing the sum of the item sizes in L by the bin size, and let A(L) be the number of bins used in the packing of L generated by algorithm A. Define 4 Speedups by Grouping A key idea introduced in the asymptotic approximation scheme for bin packing of [12] is that of grouping. Suppose the items in list L are ordered so that s(ai) > Then we know from [4] that EW%PT(F) is O(l) for all 3(0,2) > -•• > s(an), suppose g « n, and let K = U{1, j,k} with j < k - 1 and ©(Vn) for j = k — I. \n/g]. Partition the items into groups Gk, 1 < k < K, The same holds if OPT is replaced by the online Sumwhere Gk = {ag(k-i)+i : I < i < g} for k < K, and of-Squares algorithm of [6, 7] (denoted by SS in what GK = fa :g(K-l)
OPT(Li). On the rithms reported in [3, 7] concentrated on discrete uniother hand, there is also a one-to-one correspondence form distributions with k < 100, the most thorough between items in L with items in L% = L\\J G\ that study being the one in [7] for U{l,j, 100}, 2 < j < 99, are at least as large, so OPT(L) < OPT(Li) +0. Thus and C/{18,j, 100}, 19 < j < 99. As we shall see, if we use one of the previous two LP-based approaches these present no serious challenge to our LP-based apto pack LI, replace each item of size s(agk+i) in LI by proaches. To examine questions of scaling, we also conan item from Gk+i in L, and then place the items of sider distributions that might arise if we were seeking GI into separate bins, we can derive a packing of L that better and better approximations to the continuous uniuses at most OPT(L) + \n/g~\ + g bins. Varying g yields form distributions f/(0,a], where item sizes are chosen a tradeoff between running time and packing quality. uniformly from the real interval (0, a}. For example the Moreover we can get better packings in practice as continuous distribution t/(0, .4] can be viewed as the follows: After computing the fractional LP solution for limit of the sequence £/{!, 200fo, 500/i} as h —> oo. LI, round down the fractional patterns of LI, obtaining a packing of subset of LI, replace the items of size 5.2 Bounded Probability Sampled Distribus(agk+i) in this packing by the largest items in Gfc+i, tions. These distributions were introduced in [10], exand then pack the leftover items from L (including those panding on an instance generator introduced in [13], from GI) using FFD. and are themselves randomly generated. To get a disNote that by the construction of LI, the solution of tribution of type BS{h,j,k,m}, I < h < j < k and the LP for it will be a lower bound on LP(L). Thus m < j — h + I we randomly choose m distinct sizes s running this approach provides both a packing and a such that h < s < j, and to each we randomly assign certificate for how good it is, just as does the original a weight w(s) e [0.1,0.9]. The probability associated Gilmore-Gomory algorithm. with size s is then w(s) divided by the sum of all the weights. Values of the bin size B = k studied in [10] 5 Instance Classes range up to 10,000, and values of m up to 100. To get In this study we considered three distinct classes of an n-item instance of type BS{h,j, fc, m}, we randomly discrete distributions for generating our test instances: generate a distribution of this type and then choose n
4
Our branch-and-bound code for the unbounded knapsack was similar to the algorithm MTU1 of [18] in that we sped up the search for the next item size to add by keeping an auxiliary array that held for each item the smallest item with lower density yi/Si. As a further speed-up trick, we also kept an array which for each item gave the next smallest item with lower density. For simplicity in what follows, we shall let PDP and PBB denote the pattern-based approaches using dynamic programming and branch-and-bound respectively, and FLO denote the flow-based approach. A sampling of the data from our experiments is presented 5.3 Zipf's Law Sampled Distributions. This is in Tables 1 through 10 and Figures 1 and 2. In the space a new class, analogous to the previous one, but with remaining we will point out some of the more interesting weights distributed according to Zipf's Law. In a type highlights and summarize our main conclusions. ZS{h, j,fc,m} distribution, m sizes are chosen as before. They are then randomly permuted as si, $2, • • • , Sm> and 6.1 Discrete Uniform Distributions with k = we set w(si) = 1/i, 1 < i < m. We tested sequences of 100. We ran the three LP-based codes plus FFD, BFD, ZS distributions mirroring the BS distributions above. and SS on five n-item lists generated according to the distributions U{l,j, 100}, 2 < j < 99, and for 6 Results t/{18,j,100}, 19 < j < 99, for n = 100, 1,000, Our experiments were performed on a Silicon Graphics 10,000, 100,000, and 1,000,000. For these instances Power Challenge with 196 Mhz MIPS R10000 proces- the rounded-down LP solutions were almost always sors and 1 Megabyte 2nd level caches. This machine optimal, using just \LP(L)~\ bins, and never used more has 7.6 Gigabytes of main memory, shared by 28 of than \LP(L)~\ + 1. (No instance of bin packing has the above processors. The parallelism of the machine yet been discovered for which OPT > \LP(L^\ -f 1, was exploited only for performing many individual ex- which has led some to conjecture that this is the worst periments at the same time. The programs were writ- possible [16, 17, 20].) The value of rounding down ten in C and compiled with the system compiler using rather than up is already clear here, as rounding up -O3 code optimization and 32-bit word options. (Turn- (when n = 1,000,000) yielded packings that averaged ing off optimization entirely caused the dynamic pro- around fLP] + 12 for PBB and PDP, and \LP\ + 16 gramming knapsack solutions to take 4 times as long, for FLO. This difference can perhaps be explained by but had less of an effect on the other codes.) LP's the observation that FLO on average had 45% more were solved by calling CPLEX 6.5's primal simplex code. fractional patterns than the other two, something that (CPLEX 6.5's dual and interior point codes were non- makes more of a difference for rounding up than down. Table 1 reports average running times for the first competitive for our LP's.) The "presolve" option was turned on, which for some instances improved speed set of distributions as a function of n and j for PDP and dramatically and never hurt much. In addition, for the FLO. (The times for PBB are here essentially the same pattern-based codes we set the undocumented CPLEX pa- as those for PDP.) The running times for all three codes rameter FASTMIP to 1, which turns off matrix refactoring were always under a second per instance, so in practice in successive calls and yields a substantial speedup. In- it wouldn't make much difference which code one chose stances were passed to CPLEX in sparse matrix format even though FLO is substantially slower than the other two. However, one trend is worth remarking. (another key to obtaining fast performance). For the larger values of m, the average running We should note here that the CPLEX running times times for PBB and PDP actually decrease as n increases, reported by the operating system for our larger inwith more than a factor of 2 difference between the times stances could vary substantially (by as much as a factor for n = 100 and for n = 10,000 when 60 < m < 69. of 2) depending on the overall load of the machine. The This is a reproducible phenomenon and we will examine times reported here are the faster ones obtained when possible explanations in the next section. the machine was relatively lightly loaded. Even so, the As to traditional bin packing heuristics, for these CPLEX times and the overall running times that indistributions both FFD and SS have bounded expected clude them are the least accurately reproducible statisexcess except for m = 99. However, while FFD is almost tics we report. Nevertheless, they still suffice to indicate as good as our LP approaches, finding optimal solutions rough trends in algorithmic performance. item sizes according to that distribution. We consider three general classes of these distributions in our scaling experiments, roughly of the form BS{1, J5/2, B, m}, BS{B/6, B/2, B, m}, and BS{B/4, B/2, B, m}. The first sequence mirrors the discrete uniform distributions U{1, B/2, B}. The last two model the situation where there are no really small items, with the third generating instances like those in the famous 3PARTITION problem. These last two are also interesting since they are unlike the standard test cases previously used for evaluating knapsack algorithms.
5
almost as frequently, SS is much worse. For instance, for j = 90, its asymptotic average excess appears to be something like 50 bins. Both these classical heuristics perform much more poorly on some of the U{l8,j, 100} distributions. Many of these distributions have EW°PT(F] = 0(n), and for these FFD and SS can use as many as 1.1 times the optimal number of bins (a linear rather than an additive excess). 6.2 How can an algorithm take less time when n increases? In the previous section we observed that for discrete uniform distributions U{l,j, 100}, the running times for PBB and PDP decrease as n increases from 100 to 10,000. This phenomenon is not restricted to small bin sizes, as is shown in Table 2, the top half of which covers the distribution U{ 1,600,1000}, with n increasing by factors of roughly \/10 from m to 1000m. Here the total running time consistently decreases as n goes up, except for a slight increase on the first size increment. What is the explanation? It is clear from the data that the dominant factor in the running time decrease is the reduction in the number of iterations as n increases. But why does this happen? A first guess might be that this is due to numeric precision issues. CPLEX does its computations in floating point arithmetic with the default tolerance set at 10~6 and a maximum tolerance setting of 10~9. Thus, in order to guarantee termination, our code has to halt as soon as the current knapsack solution has value 1 + e, where e is the chosen CPLEX tolerance. Given that the FFD packings for this distribution are typically within 1 bin of optimal, the initial error gets closer to the tolerance as n increases, and so the code might be more likely to halt prematurely as n increases. This hypothesis unfortunately is unsupported by our data. For these instances, the smallest knapsack solutions that exceed 1 also exceed 1 + 10~4, and our pattern-based codes typically get the same solution value and same number of iterations whether e is set to 10 ~4 or 10~9. Moreover the solutions appear typically to be the true (infinite precision) optima. This was confirmed in limited tests with an infinite-precision Gilmore-Gomory implementation that combines our code with (1) the exact LP solver of Applegate and Still [1] (a research prototype that stores all numbers as rationals with arbitrary precision numerators and denominators) and (2) an exact dynamic programming knapsack code. Thus precision does not appear to be an issue, although for safety we set e = 10~9 in all our subsequent experiments. We suspect that the reduction in iterations as n increases is actually explained by the number of initial
patterns provided by the FFD packing. As reported in Table 2, when n = 600,000 the FFD supplied patterns are almost precisely what is needed for the final LP - only a few iterations are needed to complete the set. However, for small n far fewer patterns are generated. This means that more iterations are needed in order to generate the full set of patterns needed for the final LP. This phenomenon is enough to counterbalance the fact that for the smallest n we get fewer item sizes and hence smaller LP's. The latter effect dominates behavior for distributions where FFD is not so effective, as shown in the lower part of Table 2, which covers the bounded probability sampled distribution B5{1,6000,10000,400}. Here the number of excess FFD bins, although small, appears to grow linearly with n, and the total PDP running time is essentially independent of n, except for the smallest value, where less than 60% of the sizes are present. 6.3 How Performance Scales. In the previous section we considered how performance scales with n. Our next set of experiments addressed the question of how performance scales with m and B, together and separately. Since we are interested mainly in trends, we typically tested just one instance for each combination of distribution, m, and B, but this was enough to support several general conclusions. Tables 3 and 4 address the case in which both m and B are growing. (B must grow if m is to grow arbitrarily.) Table 3 covers the discrete uniform distributions ?7{ 1,200ft, 500ft} for h = 1,2,4,8,16,32. In light of the discussion in the last section, we chose a small fixed ratio of n to m (n — 2m) so as to filter out the effect of n and obtain instances yielding nontrivial numbers of iterations for PDP and PBB. Had we chosen n = 1000m, we could have solved much larger instances. For example, with this choice of n, PBB finds an optimal solution to an instance of U{1,51200,128000} in just 36 iterations and 303 seconds. For the instances covered by Table 3, the rounded down LP solution was always an optimal packing, as indeed was the FFD packing used to generate the initial set of patterns. In fact, the FFD solution always equaled the size bound \(£,a£i,s(a))/B], so one could have concluded that the FFD packing was optimal without these computations. Nevertheless, it is interesting to observe how the running times for the LP-based codes scale, since, as remarked above, there are U{l,j, k} distributions for which FFD's expected excess grows linearly with n, and for these the LP-based algorithms would find better packings. The times reported for PDP are roughly consistent with the combinatorial counts. The number of arithmetic operations needed for solving
6
the knapsack problems using our dynamic programming code grows as S(mB) (and so greater increases here suggest that memory hierarchy effects are beginning to have an impact). The time for solving an LP might be expected to grow roughly as the number of columns (patterns) times the number of pivots. Using "iterations" as a reasonable surrogate for the number of patterns, we get that overall time for PDF should grow as Note that both iterations and pivots per LP are growing superlinearly, and so we would expect greater-thancubic overall time, which is what we see (the times reported in successive rows go up by more than a factor of 8). The growth rate is still less than n4, however. PBB is here faster than PDP since the knapsack time is comparatively negligible, although its advantage over PDP is limited by the fact that LP time has become the dominant factor by the time B = 16,000. It is also worth noting that for an individual instance the number of pivots per LP can be highly variable, as illustrated in Figure 1. The difficulties of the LP's can also vary significantly between PBB and PDP, whose paths may diverge because of ties for the best knapsack solution. For the instance depicted in Figure 1 the average number of pivots under PBB was 18% lower than that for PDP, although the same irregularity manifested itself. The extremely high numbers of pivots for some of the LP's in the PDP run suggest that the danger of runaway LP-time cannot be ignored, no matter what our average-case projections say. FLO's running times are again not competitive, and in any case its much larger memory requirement rules out applying it to the largest instances. Table 4 contains analogous results for bounded probability distributions in which the sizes sampled must lie in the intervals (0,#/2), (5/6, B/2), or (B/4,B/2). Once again, overall running times grow at a rate somewhere between n3 and n4 and LP time dominates dynamic programming time for the largest values of B. For the last set of distributions, however, LP time is exceeded by branch-and-bound knapsack solution time, which gets worse as the lower bound on the size interval increases. Indeed, for the (B/4, B/2) set of distributions, the time per branch-and-bound knapsack solution closely tracks the time needed for full exhaustive search, i.e., 0(m3) in this case, and PBB is slower than FLO for m as large as 16,000. Another difference between the last two sets of distributions and the first lies in the "excess" of the rounded-down packing, i.e., the difference between the number of bins contained in that packing and the LP solution value. The first set of distributions behaves much
7
like the discrete uniform distributions it resembles, with typical excesses of less than one. For the latter two, the excesses grow with m, although they are typically between 3 and 4% of m, far below the proven upper bound of m itself. It remains to be seen whether the true optimum number of bins is closer to the LP lower bound on the rounded-down upper bound. Tables 5 and 6 cover experiments in which m was held fixed and B was allowed to grow. Here growth in dynamic programming time is expected, but note that branch-and-bound knapsack time also increases, perhaps because as B increases there are fewer ties and so more possibilities must be explored. Iterations also increase (perhaps because greater precision is now needed for an optimal solution), although pivots and seconds per LP remail relatively stable once a moderate value of B has been attained. Table 7 shows the effect of increasing m while holding B fixed. Once again LP time eventually dominates dynamic programming time. In the (B/2,B/4) case, FLO time again comes to dominate PBB time, and is even gaining on PDP as m approaches its maximum possible value, but it is not clear that we will ever find a situation where it beats the latter. PBB does have one surprising advantage over PDP in the (.B/2, B/4) case. As indicated in Table 8, the patterns generated by branch-and-bound knapsack solutions seem to be better in the context of the overall algorithm. PDP needs both more iterations and more pivots per iteration than does PBB. This doesn't hold for all distributions, but was seen often enough in our experiments to be suggestive. Table 9 provides more detailed information for the (B/6, B/2) case, illustrating the high variability in the branch-and-bound times, which not only can vary widely for the same value of m, but can actually decline as m increases. Figure 2 charts the evolution of LP time and branch-and-bound knapsack time during the run for one of the more troublesome instances. Note that here LP time is relatively well-behaved (in contrast to the situation charted in Figure 1), while branch-and-bound time now can vary widely depending on the stage of the overall computation. 6.4 Grouping. See Table 10. Here is a major surprise: For instances with n < 10,000 and m = 1,600, grouping not only yields running times that are orders of magnitude faster than those for the basic GilmoreGomory (g = 1) procedure, it also provides better packings. This is presumably because for this value of n and these values of g, the savings due to having far fewer patterns (and hence far fewer fractional patterns to round down) can outweigh the cost of having to separately pack the g largest items (which FFD does
fairly efficiently anyway). Even for n = 1,000,000, where g = I is now dominant in terms of solution quality, very good results can be obtained in very little time if n/g e {100,200}. Similar results hold for m - 3,200.
[6]
6.5 Zipf's Law Distributions. We do not have space here to present our results for ZS distributions, except to note that although they typically yielded similar behavior to that for the corresponding BS distributions, a few ZS instances caused more dramatic running time explosions than we have seen so far. In particular, for a million-city ZS{1667,4999,10000,2200} instance, the first 40 iterations of PBB (out of 7802) averaged over 24 minutes per knapsack solution and took roughly 60% of the total time. 6.6 Directions for Future Research. These preliminary results are based on straightforward implementations of the algorithmic components. Presumably we can improve performance by improving those components. One way to attack LP time, the major asymptotic bottleneck, would be to identify and remove unnecessary columns from the later LP's, rather than let the LP size grow linearly with iteration count. There are also more sophisticated knapsack algorithms to try, such as those of [18, 19]. Even a simple improvement to the dynamic programming code such as identifying and removing "dominated" items can have a major effect, and can be implemented by a relatively minor change in the inner loop of the code. Preliminary experiments suggest that this idea can often reduce dynamic programming time by a factor of 3 or more, as we shall illustrate in the full paper.
References [1] D. L. Applegate and C. Still. Personal communication, 2002. [2] V. Chvatal. The cutting-stock problem. In Linear Programming, pages 195-212. W. H. Freeman and Company, New York, 1983. [3] E. G. Coffman, Jr., C. Courcoubetis, M. R. Garey, D. S. Johnson, L. A. McGeoch, P. W. Shor, R. R. Weber, and M. Yannakakis. Fundamental discrepancies between average-case analyses under discrete and continuous distributions. In Proceedings 23rd Annual ACM Symposium on Theory of Computing, pages 230-240, New York, 1991. ACM Press. [4] E. G. Coffman, Jr., C. Courcoubetis, M. R. Garey, D. S. Johnson, P. W. Shor, R. R. Weber, and M. Yannakakis. Bin packing with discrete item sizes, Part I: Perfect packing theorems and the average case behavior of optimal packings. SI AM J. Disc. Math., 13:384-402, 2000. [5] E. G. Coffman, Jr., D. S. Johnson, L. A. McGeoch, P. W. Shor, and R. R. Weber. Bin packing with discrete
[7]
[8]
[9] [10]
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
[21]
8
item sizes, Part III: Average case behavior of FFD and BFD. (In preparation). J. Csirik, D. S. Johnson, C. Kenyon, J. B. Orlin, P. W. Shor, and R. R. Weber. On the sum-of-squaxes algorithm for bin packing. In Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing, pages 208-217, New York, 2000. ACM. J. Csirik, D. S. Johnson, C. Kenyon, P. W. Shor, and R. R. Weber. A self organizing bin packing heuristic. In M. Goodrich and C. C. McGeoch, editors, Proceedings 1999 Workshop on Algorithm Engineering and Experimentation, pages 246-265, Berlin, 1999. Lecture Notes in Computer Science 1619, SpringerVerlag. J. M. Valeric de Carvalho. Exact solutions of binpacking problems using column generation and branch and bound. Annals of Operations Research, 86:629659, 1999. J. M. Valerio de Carvalho. Lp models for bin packing and cutting stock problems. European Journal of Operational Research, 141:2:253-273, 2002. Z. Degraeve and M. Peeters. Optimal integer solutions to industrial cutting stock problems: Part 2: Benchmark results. INFORMS J. Comput., 2002. (To appear). Z. Degraeve and L. Shrage. Optimal integer solutions to industrial cutting stock problems. INFORMS J. Comput, 11:4:406-419, 1999. W. Fernandez de la Vega and G. S. Lueker. Bin packing can be solved within 1+e in linear time. Combinatorica, 1:349-355, 1981. T. Gau and G. Wascher. Cutgenl: A problem generator for the standard one-dimensional cutting stock problem. European J. of Oper. Res., 84:572-579, 1995. P. C. Gilmore and R. E. Gomory. A linear programming approach to the cutting stock problem. Oper. Res., 9:948-859, 1961. P. C. Gilmore and R. E. Gomory. A linear programming approach to the cutting stock program — Part II. Oper. Res., 11:863-888, 1963. O. Marcotte. The cutting stock problem and integer rounding. Math. Programming, 33:82-92, 1985. O. Marcotte. An instance of the cutting stock problem for which the rounding property does not hold. Oper. Res. Lett., 4:239-243, 1986. S. Martello and P. Toth. Knapsack Problems. John Wiley & Sons, Chichester, 1990. D. Pisinger. A minimal algorithm for the bounded knapsack problem. INFORMS J. Computing, 12:7584, 2000. G. Scheithauer and J. Terno. Theoretical investigations on the modified integer round-up property for the onedimensional cutting stock problem. Oper. Res. Lett., 20:93-100, 1997. G. Wascher and T. Gau. Heuristics for the integer onedimensional cutting stock problem: A computational study. OR Spektrum, 18:131-144, 1996.
n 102 103 104 105 106
30 < j < 39 .040 .033 .030 .029 .029
PDP 60 < j < 69
.082 .044 .034 .032 .032
90 < j < 99 .058 .050 .041 .035 .034
FLO 60 < j < 69 .182 .206 .206 .203 .204
30 < j < 39 .144 .150 .150 .146 .147
90 < j < 99 .173 .300 .342 .367 .356
Table 1: Average running times in seconds for discrete uniform distributions U{l,j, 100} as a function of j and n. Averages are taken over 5 samples for each value of j and n. Results for PBB are similar to those for PDF. Packings under all three approaches were almost always optimal. £7(1,600,1000}
n 600 1,897 6,000 18,974 60,000 189,737 600,000
Ave# sizes 374.7 573.7 600.0 600.0 600.0 600.0 600.0
Iters 730.7 599.7 157.0 77.3 46.7 25.0 7.3
#Pat FFD Final 170.3 901.0 394.0 993.7 633.7 790.7 800.0 877.3 881.3 928.0 885.7 910.7 909.0 916.3
Pivots /iter 18.7 18.7 34.1 51.0 58.7 18.8 12.5
Ave Sees LP KNP .01 .01 .02 .02 .03 .02 .04 .02 .04 .02 .01 .02 .00 .02
Tot Sees 17.4 19.6 7.2 4.3 2.6 .7 .2
Opt Val 180.9 571.3 1797.3 5686.5 18058.4 56941.8 180328.2
FFD Excess .8 .7 .7 .5 .6 .6 .5
Tot Sees 80 201 202 194 199 193 204
Opt Val 117 404 1177 3645 11727 38977 117241
FFD Excess 1 1 3 6 21 93 197
BS{1,6000,10000,400} n 400 1,264 4,000 12,649 40,000 126,491 400,000
Ave# sizes 231.0 355.3 394.7 400.0 400.0 400.0 400.0
Iters 904.3 1303.0 1069.0 994.3 989.0 998.0 1014.7
#Pat FFD Final 110.3 1014.7 259.7 1562.7 442.3 1511.3 519.7 1514.0 561.3 1550.3 565.7 1563.7 576.7 1591.3
Pivots /iter 27.9 45.0 55.4 57.8 58.0 58.0 58.7
Ave Sees LP KNP .02 .07 .05 .11 .06 .12 .07 .12 .07 .13 .07 .12 .07 .13
Table 2: Effect of increasing N (by factors of roughly VTO) on PDP, averaged over three samples for each value of N. For the BS table, three distinct distributions were chosen and we generated one sample for each distribution and each value of N. U{1,200ft, 500/i}, h = 1,2,4,..., 64, n = 2m
m 200 400 800 1600 3200 6400
B 500 1000 2000 4000 8000 16000
Iters 175 440 1011 2055 4667 10192
Pivots /iter 2.9 4.9 10.9 24.0 57.0 202.8
Avg knp sees PDP PBB .00 .00 .01 .00 .04 .02 .20 .00 .91 .01 3.78 .02
AveLP sees .00 .00 .01 .07 .52 4.21
PDP .8 6.7 57.5 565.6 6669.1 81497.7
Total sees PBB FLO .3 13 2.0 156 28.8 2167 194.4 38285 2415.3 — 41088.6 —
Table 3: Scaling behavior for LP-based codes. The number of distinct sizes in the instances was 87±1% of m and the number of initial patterns was 39±1% of m. Unless otherwise specified, entries are for PDP.
9
U{1,6400,16000}, n = 128,000
Figure 1: Number of Pivots for successive LP's under PDP plotted on a log scale. Over 15% of the total PDP running time was spent solving the 5 hardest LP's (out of 10,192). The possibility of such aberrant behavior means that the asymptotic running time projections derived from our data are unlikely to be worst-case guarantees.
10
BS{1, |"625fc/2] - 1,625k, lOOfc}, k = 1,2,4,8,16, n = 1,000,000 m 100 200 400 800 1600 3200
B 625 1250 2500 5000 10000 20000
Iters 144 238 502 1044 2154 4617
Pivots /iter 9.7 17.9 32.2 69.3 166.0 385.4
Average LP sees .00 .01 .03 .14 1.11 10.39
Avg knp sees PDP PBB .00 .00 .01 .00 .03 .00 .12 .00 .64 .01 2.72 .01
PDP 1 4 30 281 3781 60530
PDP Total sees PBB FLO Excess .5 1 12 .6 2 145 .8 16 2353 .3 143 48620 .2 2898 — 38124 — 1.2
55{[625fc/6j + 1, |"625fc/2l - l,625fc, lOOfc}, k = 1,2,4,8,16, n = 1,000,000 m 100 200 400 800 1600 3200
B 625 1250 2500 5000 10000 20000
Iters 184 375 840 1705 3730 7845
Pivots /iter 10.4 21.2 46.9 95.8 214.3 478.5
Average LP sees .00 .01 .05 .23 1.48 10.76
Avg knp sees PDP PBB .00 .00 .01 .00 .03 .04 .12 .51 .53 .46 2.34 5.02
PDP 1 7 60 597 7527 102730
Total sees PBB 1 4 63 1092 5847 92778
PDP FLO Excess 4 2.6 41 4.8 9.6 404 4523 18.0 37.7 — 76.2 —
BS{|625fc/4j + 1, f625fc/2~| - 1,625fc, lOOfc}, k = 1,2,4,8,16, n = 1,000,000 m 100 200 400 800 1600 3200
B 625 1250 2500 5000 10000 20000
Iters 116 427 704 1422 3055 6957
Pivots /iter 5.3 17.8 29.3 52.5 119.8 265.9
Average LP sees .00 .01 .02 .08 .61 3.59
Avg knp sees PDP PBB .00 .00 .01 .03 .02 .17 .11 1.08 .47 8.07 2.16 67.73
Total sees PDP PBB FLO 0 1 2 7 14 12 33 107 101 274 1299 800 3314 20123 19415 40001 345830 —
PDP Excess 3.0 11.2 16.5 30.0 57.2 128.5
Table 4: Scaling behavior for bounded probability sampled distributions. Unless otherwise specified, entries are for PDP. Note that amount by which the PDP packing exceeds the LP bound does not grow significantly with m in the first case, and is no more than 3% or 4% of m in the latter two.
11
BS{1, 625/t - 1, 1250/t, 200}, n = 1,000,000
h I
2 4 8 16 32 64 128 256 512 1024
B 1,250 2,500 5,000 10,000 20,000 40,000 80,000 160,000 320,000 640,000 1,280,000
Iters 220 320 510 444 600 736 776 976 977 1081 1267
Pivots /iter 14.4 21.5 25.4 26.0 29.0 28.0 29.9 29.4 27.5 28.7 32.3
AveLP sees .01 .01 .02 .02 .02 .03 .05 .04 .03 .03 .04
Ave knp sees PDP PBB .01 .00 .02 .00 .03 .00 .06 .00 .13 .00 .28 .01 .58 .01 1.16 .03 2.88 .17 11.06 .19 41.49 .67
Total sees PDP PBB FLO 3 3 12 9 5 145 24 9 2353 36 10 — 93 14 — 229 23 — 485 27 — 1170 57 — 2834 205 — 11970 231 — 52532 894 —
PDP Excess .4 .5 .8 .3 .3 .4 1.0 .6 .8 .5 .8
Table 5: Effect of increasing the bin size B while m an N are fixed and other parameters remain proportionately the same (one instance for each value of h). Unless otherwise specified, column entries refer to PDP. The "Excess" values for all three algorithms are roughly the same.
BS{1250h + 1, 2500h – 1, 5000h, 1000}, n = 1,000,000
h 1 2 4 8 16 32
B 5,000 10,000 20,000 40,000 80,000 160,000
Iters 1778 2038 2299 2617 2985 3195
Pivots /iter 70.7 69.4 75.5 74.6 80.6 71.5
AveLP sees .14 .15 .19 .19 .23 .21
Ave knp sees PDP PBB .12 1.43 .34 1.80 .65 1.95 1.35 2.13 2.79 2.28 5.65 2.44
PDP 470 1002 1925 4044 9019 18732
Total sees PBB FLO 2100 1212 3346 3050 4108 8957 5243 29187 6383 — 7723 —
PDP Excess 36.9 37.8 37.4 38.5 38.9 32.8
Table 6: Effect of increasing the bin size B while m and N are fixed and other parameters remain proportionately the same (one instance for each value of h). Unless otherwise specified, the entries are for PDP, which has roughly the same number of pivots as PBB but averages about 18% more iterations than PBB (the ratio declining as B increases). The "Excess" values for all three algorithms are roughly the same.
12
BS{1,4999,10000, m}, n = 1,000,000 m 100 200 400 800 1600 3200
Iters 411 453 843 1454 2326 2166
Pivots /iter 12.0 29.3 56.2 98.3 157.4 212.6
Average LP sees .01 .02 .07 .30 1.65 4.66
Avg knp sees PDP PBB .03 .01 .06 .00 .13 .00 .29 .00 .63 .01 1.28 .01
PDP 15 37 169 859 5308 12872
PDP Total sees PBB FLO Excess 5 27131 1.2 .2 9 47749 .8 65 — .6 368 — 2443 — 1.2 .6 5684 —
BS{5001,9999,20000, m}, n = 1,000,000 m 100 200 400 800 1600 3200
Iters 203 481 1121 1864 3586 6957
Pivots /iter 6.3 13.3 27.6 57.2 116.2 265.9
Average LP sees .00 .01 .03 .11 .56 3.59
Avg knp sees PDP PBB .05 .00 .10 .02 .22 .19 .50 1.18 1.08 9.08 2.16 67.73
Total sees PDP PBB FLO 11 1 562 53 12 1197 281 219 2879 1131 2017 6745 5878 26819 19415 40001 345830 —
PDP Excess 3.7 6.3 13.8 31.5 63.9 128.5
Table 7: Effect of increasing the number of item sizes ra while keeping the bin size fixed. Unless otherwise specified, the entries are for PDP. BS{1, 4999, 10000, m} Pivots Iters /iter LP sees
BS{5001, 9999, 20000, m} Pivots /iter LP sees Iters
100 200 400 800 1600 3200
.95 1.05 .98 1.04 1.01 .97
.99 1.01 1.00 1.01 .99 1.05
1.00 1.00 1.00 1.15 .95 1.00
1.16 1.15 1.11 1.18 1.28 1.42
1.15 1.14 1.05 1.07 .98 .96
1.00 1.00 1.50 1.10 1.10 1.18
Average
1.00
1.01
1.02
1.22
1.06
1.15
m
Table 8: Ratios of statistics for PDP to those for PBB. Note that for the £S{5001,9999,20000, m} distributions, the dynamic programming knapsack solver seems to be generating worse patterns than the branch-and-bound knapsack solver, leading to consistently more iterations and usually more pivots per iteration. This appears to be typical for BS{h,j,B,m} distributions when h is sufficiently greater than 0.
13
55(1667,4999,10000, m} Iters
Ave Pivots
m
poo FBB
PDF pgg
ppo PBB
PDF pg^
100 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
262 659 1147 1477 1864 2352 2618 2902 3146 3433 3903 4221 4242
1.03 1.06 1.04 1.11 1.12 1.08 1.11 1.14 1.19 1.19 1.17 1.15 1.23
10.9 20.4 49.9 71.9 101.4 124.3 136.5 184.8 212.1 234.8 260.7 291.1 331.4
.98 1.15 1.00 1.00 .99 1.01 1.02 1.01 1.01 .98 1.08 1.02 1.03
Ave LP sees poo PDF PBB pgg .00 .01 .06 .13 .26 .44 .59 1.01 1.40 1.83 2.44 3.16 3.95
Avg knp sees PDP PBB
1.00 1.00 1.00 1.00 1.19 1.16 1.25 1.27 1.06 1.10 1.18 1.08 1.08
.03 .05 .11 .17 .24 .31 .38 .46 .53 .62 .70 .75 .81
.00 .01 .01 .02 .03 .75 2.67 .21 .46 .46 1.98 2.76 1.01
PDP 8 48 205 500 1138 2109 3240 5719 7527 10810 16437 20240 26482
Total sees PBB FLO 2 15 82 223 537 2805 8534 3567 5847 7886 17259 24973 21025
644 10833 58127 — — — — — — — — — —
PDP Excess 1.8 4.6 9.8 14.5 20.0 24.2 29.0 33.4 37.7 43.3 47.3 51.5 54.7
Table 9: Results for bounded probability distributions and n = 1,000,000 (one sample for each value of n). Excesses for PBB and the flow-based code (where available) were much the same.
Figure 2: LP and Branch-and-Bound running times (plotted on a log scale) for successive iterations when PBB is applied to a million-item instance of type 55(1667,4999,10000, m}. Note that LP times here are much more well-behaved than they were for the instance of (7(1,6400,16000} covered in Figure 1. Now the potential for running time blow-up comes from the knapsack code, whose running time for U{ 1,6400,16000} was negligible. The situation can be even worse for Zipf Law distributions.
14
g 1 5 10 20
1 25 50 100 200
1 250 500 1000 2000
1 2500 5000 10000 20000
55{1667, 4999, 10000, 1600} Percent Total Percent LB PDP Packing #Sizes Sees Excess Shortfall
1600 200 100 50 FFD 1600 400 200 100 50 FFD 1600 400 200 100 50 FFD 1600 400 200 100 50 FFD BF
ss
853 39 7 2 .02
7185 179 46 10 2 .2 7627 195 46 9 1 .2 7527 197 48 9 1 .2 134 305
g
55(5001, 9999, 20000, 1600} Percent Total Percent LB PDP Packing #Sizes Sees Excess Shortfall
4.580 1.675 1.094 1.385 5.161
n = 1,000 1 .000 50 .531 .978 100 1.879 200 .150
1600 200 100 50 FFD
1018 55 13 2 .03
5.963 2.318 1.277 1.277 7.265
.000 .471 .923 1.744 3.147
.964 .392 .362 .542 1.084 4.908
n = 10, 000 .000 1 25 .256 50 .508 1.004 100 1.949 200 .000
1600 400 200 100 50 FFD
5681 269 57 10 2 .2
1.471 .440 .518 .569 1.085 7.451
.000 .222 .433 .813 1.556 3.506
.119 .137 .249 .516 1.108 5.044
n = 100, 000 1 .000 .253 250 500 .505 .982 1000 2000 1.888 .000
1600 400 200 100 50 FFD
6117 265 57 11 2 .2
.162 .136 .229 .460 .985 7.384
.000 .218 .432 .830 1.589 3.565
.011 .126 .250 .529 1.123 5.008 6.320 .478
n = l,000,000 1 .000 .251 2500 .502 5000 .976 10000 1.780 20000 .000 -
1600 400 200 100 50 FFD BF SS
5878 266 63 13 2 .2 271 518
.017 .110 .216 .444 .954 7.310 8.909 4.341
.000 .217 .435 .856 1.644 3.142 -
Table 10: Results for grouping. The "Percent Packing Excess" is the percent by which the number of bins used in the rounded down packing exceeds the LP lower bound, the "Percent LB Shortfall" is the percent by which the fractional solution for the grouped instance falls short of LP(L). For comparisons purposes, we include results for an O(m2) implementation of FFD and, in the case of n = 1,000,000, for O(nB) implementations of the online algorithms Best Fit (BF) and Sum-of-Squares (SS). The "Shortfall" entry for FFD gives the percent gap between LP(L) and the size-based lower bound
15
The Markov Chain Simulation Method for Generating Connected Power Law Random Graphs* Christos Gkantsidis^
Milena Mihail*
Abstract Graph models for real-world complex networks such as the Internet, the WWW and biological networks are necessary for analytic and simulation-based studies of network protocols, algorithms, engineering and evolution. To date, all available data for such networks suggest heavy tailed statistics, most notably on the degrees of the underlying graphs. A practical way to generate network topologies that meet the observed data is the following degree-driven approach: First predict the degrees of the graph by extrapolation from the available data, and then construct a graph meeting the degree sequence and additional constraints, such as connectivity and randomness. Within the networking community, this is currently accepted as the most successful approach for modeling the inter-domain topology of the Internet. In this paper we propose a Markov chain simulation approach for generating a random connected graph with a given degree sequence. We introduce a novel heuristic to speed up the simulation of the Markov chain. We use metrics reminiscent of quality of service and congestion to evaluate the output graphs. We report experiments on degree sequences corresponding to real Internet topologies. All experimental results indicate that our method is efficient in practice, and superior to a previously used heuristic. 1 Introduction There has been a recent surge of interest in complex real-world networks. These include the WWW [25, 33, 6, 9,14, 27, 26] where a node corresponds to a Web page and there is an edge between two nodes if there is a hy"The first and second authors were funded by NSF ITR0220343; the third author was funded by NSF ANI-0081557. This work was also funded by a Georgia Tech Edenfield Faculty Fellowship. '''College of Computing, Georgia Institute of Technology, Atlanta, GA. email: [email protected] * College of Computing, Georgia Institute of Technology, Atlanta, GA. email: [email protected] § College of Computing, Georgia Institute of Technology, Atlanta, GA. email: [email protected]
Ellen Zegura§
perlink between the corresponding pages, the Internet at the level of Autonomous Systems (a.k.a. inter-domain level) [16, 24, 29, 34, 10, 11, 36, 4] where a node corresponds to a distinct routing administration domain (such as a University, a corporation, or an ISP) and an edge represents direct exchange of traffic between the corresponding domains, and biological networks [20] where, nodes correspond to genetic or metabolic building blocks (such as genes and proteins) and edges represent direct interactions between these blocks. Obtaining accurate graph models for such real-world networks is necessary for a variety of simulation-based studies. A very robust and persistent characteristic of complex networks, including the WWW, the Internet and biological networks, is that, while the average degree is constant, as the number of nodes have grown at least one order of magnitude, there is no sharp concentration around the average degree and there are several vertices with very large degrees. Formally, the degree sequence follows heavy tailed statistics in the following sense: (a) The «th largest degree of the graph is proportional to i~a, with a approaching 1 from below, (b)The frequency of the ith smallest degree of the graph is proportional to i~&, with j3 approaching 3 from below (see [16] for detailed Internet measurements, see [6, 14, 27, 26] for WWW measurements). This is a sharp departure from the Erdos-Renyi random graph model where the degrees are exponentially distributed around the mean. Consequently, several papers have proposed plausible graph models, based on the notion of "preferential attachment" [6, 8, 26, 3,13] and on the notion of multiobjective optimization [15, 4] for explaining this phenomenon. Despite the elegant principles of the above approaches, none of them predicts accurately all the observed measurements. In fact, none of these approaches attempts to explain the heavy tailed statistics on the high-end and the low-end of the degrees, (a) and (b) above, simultaneously, and there is further evidence that (a) and (b) cannot be captured by a single evolutionary principle ([1] argues that a Pareto distribution should result in (3 ~ 1 + ^, which is not the case for the observed numbers of the parameters a and (3 mentioned above). On the other hand, graph models for complex
16
networks are often expected to pass strict performance requirements. For example, the networking community uses such graph models to simulate a wide range of network protocols [40, 16, 24, 30, 29, 34, 10, 11, 36, 4], and hence the accuracy of the underlying topology model is considered very important. Therefore, the following alternative degree-driven approach for generating network topology models has been adopted. First predict the degrees of the graph to be generated by extrapolation from available data, for example, according to (a) and (b) above, and then generate a graph that satisfies the target degree sequence, and additional constraints, the first and most natural of which is connectivity. It has also been observed that connected graphs that satisfy the degree sequence and some further "randomness property" are good fits for real Internet topologies [36] (albeit, "randomness property" is not quantified in [36]). In the theory community the above degree-driven approach was first formalized in [2, 12] who especially addressed the connectivity issue, by isolating ranges of the parameter /3 for which the resulting random graph has a giant connected component. In particular, for target degree sequence d\ > di > ... > dn over vertices Vt, 1 < i < n, where di is the z-th largest degree and Vi is the vertex of degree di, [2] proposed to consider D = ]TV di vertices by expanding vertex Vi to di vertices, construct a random perfect matching of size D/2 over the D vertices, and consider a graph on the initial n vertices hi the natural way: v» is connected to Vj if and only if, in the random perfect matching, at least one of the di vertices that correspond to Vi is connected to one of the dj vertices that correspond to Vj. [2] further proposed to eliminate self-loops and parallel edges, and consider the largest component of the resulting graph. The advantages of this approach are its implementational efficiency, and the guarantee of uniform sampling. However, the approach also has two drawbacks: It does not produce a graph that matches the degree sequence exactly, and, the method gives small components of size G(logn). There is no known performance guarantee concerning how accurately the method of [2] approximates the target degree sequence. In the networking community the same degreedriven approach is typified by the Inet topology generator [24], which is currently the method of choice. The implementation of Inet uses the following heuristic: It first predicts a degree sequence by using di ~ a"1 for the highest 1% of the degrees, and frequency of the ith smallest degree proportional to i~® for the remaining 99% vertices. It then constructs a connected graph that meets a predicted degree sequence by placing a spanning tree to guarantee connectivity, and tries to match
17
the remaining degrees "as much as possible" using a preferential connectivity heuristic. Again, there is no known performance guarantee on how well the method of [24] approximate the target degree sequence, or to what extend their graph approximates a graph sampled uniformly at random from the target degree sequence. In this paper we propose a Markov chain simulation approach for generating a random connected graph with a given degree sequence. In Section 2 we review the necessary graph theory to obtain an initial connected realization of the degree sequence. We point out that the underlying theory allows great flexibility hi the produced output. In Section 3 we point out a Markov chain on the state space of all connected realizations of the target degree sequence. We note that, even though similar Markov chains were considered before without the connectivity requirement, the additional connectivity requirement needs a non-trivial theorem of [37] to result in a connected state space. This Markov chain requires a connectivity test in every simulation step. In Section 4 we introduce a novel speed up of the Markov chain which saves greatly on connectivity tests. For example, we can simulate 1 million steps of the speed-up process in the same time as a few thousand steps of the original process. Section 5 contains experimental results. We use metrics reminiscent of quality of service and congestion to evaluate the output graphs. We report experiments on degree sequences corresponding to real Internet topologies. All experimental results indicate that our method is efficient in practice, and superior to a previously used heuristic.
2 Markov Chain Initialization: Erdos-Gallai Conditions and the Havel-Hakimi Algorithm In this Section we address the problem of constructing a connected graph that satisfies a given target degree sequence, if such a graph exists. We point out that such constructions follow from classical graph theory, and that they allow substantial flexibility in the generated output graph. We will use these constructions as initial states of the Markov chains of Sections 3 and 4. (In addition, these fundamental theoretical primitives can replace all ad-hoc heuristics of the current implementation of Inet [24]). Let n denote the number of nodes of the graph we wish to generate. Let Vi, 1 d% > ... > dn denote the intended degrees of these nodes. We would like a simple, undirected, connected graph meeting the above degree sequence. A sequence of degrees d\ > cfe > • • • > dn is called realizable if and only if there exists a simple graph whose nodes have precisely this sequence of degrees. A straightforward necessary condition for a degree sequence to be realizable is that
for each subset of the k highest degree nodes, the degrees after each iteration, we ensure that condition (2.1) is of these nodes can be "absorbed" within the nodes and satisfied by the residual graph (this part was automatic the outside degrees. Stated formally, for 1 < A; < n—1: in case maximum degree vertices are chosen). If not, the choice of the dv vertices needs to be repeated. This observation indicates several ways in which the implementation of [24] can be improved, however, we shall refrain from such discussions since this is not the A necessary condition for the realization to be connected main focus of this paper. is that the graph contains a spanning tree, which means Next, let us deal with the second requirement of obthat: taining a connected topology. If the graph constructed as described turns out to be unconnected, then one of the connected components must contain a cycle. Let (w, v) be any edge in a cycle and let (s, t) be an edge The Erdos-Gallai theorem states that these necessary in a different connected component. Clearly, the graph conditions are also sufficient [7, 32]. The proof is inducdoes not have edges between the pairs u, s and v, t. By tive and provides the following construction, known as removing the edges (w, v) and (s,t), and inserting the the Havel-Hakimi algorithm [18, 19]. The algorithm is edges (u, s) and (v, £), we merge these two components. iterative and maintains the residual degrees of vertices, Note that the resulting graph still satisfies the given dewhere residual degree is the difference between the curgree sequence. Proceeding in this manner, we can get a rent degree and the final degree of the vertex. In each connected topology. iteration, it picks an arbitrary vertex v and adds edges from v to dv vertices of highest residual degree, where 3 A Markov Chain on Connected Graphs with dv is the residual degree of v. The residual degrees of Prescribed Degree Sequence the latter dv vertices are updated appropriately. The significance of connecting with dv highest degree ver- We now turn to the question of generating a random tices is that it ensures that condition (2.1) holds for the instance from the space of all possible connected graphs that realize a target degree sequence. In experiment, residual problem instance. For example, the algorithm can start by connecting it has been observed that "random" such instances are the highest degree vertex v\ with d\ other high degree good fits for several characteristics of complex network vertices and obtain a residual degree sequence by reduc- topologies [2, 36] (however, all these experiments fall ing the degrees of these vertices by one, and repeat the short of guaranteeing that the generated instances are same process until all degrees are satisfied (otherwise "correct" connected realizations of the target degree output "not realizable"). Alternatively, the algorithm sequence). For any sequence of integers that has a connected can connect the lowest degree vertex vn with dn (resp. realization, consider the following Markov chain. Let or a randomly chosen vertex Vi) with the dn (resp. di) highest degree vertices, reduce their degrees and pro- Gt be the graph at time t. With probability 0.5, Gt+i will be Gt (this is a standard trick to avoid ceed as above. Clearly the above algorithm runs in n iterations, periodicities). With probability 0.5, Gt+i is determined each iteration invoking the degree of a vertex (and some by the following experiment. Pick two edges at random, book-keeping for maintaining residual degrees in sorted say (w, v) and (x,y) with distinct endpoints. If (u,x) order). Thus the running time is very efficient, both in and (v,y) are not edges then consider a graph G' by theory and in practice. In addition, since the sequence removing the edges (u, v) and (ar, y) and inserting the in which it picks vertices can be chosen, it provides edges (u, x) and (v,y). Observe that G' still satisfies sequence. We further have to check the flexibility alluded to above. For example, when the given degree 1 we start with higher degree vertices we get topologies whether G is a connected graph. If it is connected that have very "dense cores", while when we start with then we perform the switching operation and let Gt+i low degree vertices we get topologies that have very be G'. Otherwise we do not perform the switching "sparse cores". For further example, we may start operation and Gt+i remains Gt. It follows from a from a highly clustered topology quantified by one or theorem of Taylor [7, 37] that, using the above switching more sparse cuts. The Erdos-Gallai condition (2.1) operation, any connected graph can be transformed to allows for further flexibility, at the cost of additional any other connected graph satisfying the same degree tests for condition (2.1), and repeated efforts until sequence (we note that the proof of Taylor's theorem condition (2.1) is satisfied. In particular, the dv vertices is somewhat more involved than the corresponding fact can be chosen according to any criterion, provided that, for realizations without the connectivity constraint] the 18
latter fact is straightforward). It now follows from standard Markov chain theory [31, 35] that this Markov chain converges to a unique stationary distribution which is the uniform distribution over the state space of all connected realizations. This is because, by definition, all transitions have the same probability. Thus, in the limit if we simulate the Markov chain for an infinite number of steps, the above Markov chain will generate a graph with the given degree sequence uniformly at random. We would be interested in a Markov chain which is arbitrarily close to the uniform distribution after simulating a polynomial number of steps (see [35] for details). Similar questions have been considered elsewhere [35, 22, 21, 23] without the connectivity requirement. In particular, it is known that uniform generation of a simple graph with a given degree sequence d = d\ > d-2 > ... > dn reduces to uniform generation of a perfect matching of the following graph Ma [28]: For each 1 < i < n, Ma contains a complete bipartite graph Hi = (Li, Ri), where \Ri\=n—I and |Lj| = n—1—d». The vertices of Ri are labeled so that there is a label for each 1 < j d% > ... > dn, introduce an additional vertex with degree n, thus forcing any realization of the new sequence to have the new vertex connected to every other vertex, and hence it is connected and the realizations are one-to-one. We thus have to devise efficient stopping rules. We have used the following rule to decide if a particular run of the Markov chain has converged sufficiently: Consider one or more quantities of interest, and measure these
19
quantities every T steps. For example, one such quantity could be the diameter. In Section 5 we will consider further quantities that are related to quality of service (and use average shortest path from a node to every other node as an indicator) and network congestion (and use number of shortest paths through a node, or link, as an indicator). We may use the criterion of the quantities having converged as stopping rule. However, the quantities under consideration may not converge, even under uniform sampling. For example, the diameter appears to deviate consistently from its mean (see Figure 1). Therefore, a better heuristic is to estimate the sample average (yo 4- VT + ... + yw) / (k + 1), where y^r is the metric under consideration at time iT. This method of sample averages has been first considered in [5]. In addition, we will consider two (or more) separate runs of the Markov chain, where the initial points of each run are qualitatively different. For example, we may consider a "dense core" and a "sparse core" starting point, as mentioned in Section 2. Now, we may consider the case where the sample averages converge to the same number for the two separate runs of the Markov chain.
4
Speed-Up of the Markov Chain Simulation
Notice that the main bottleneck in the implementation of the Markov chain of Section 3 is the connectivity test that needs to be repeated in every step. This connectivity test takes linear time, for example, using DFS. On the other hand, all other operations, namely picking two random edges and performing the swap takes time O (log n) (log n for the random choice, and constant time for the swap). In this Section we describe a process which maintains convergence to uniform distribution over all connected realizations, but, in practice, performs much fewer connectivity tests. In particular, we consider the following process. Initially we have a connected realization of the target degree sequence, as mentioned in Sections 2 and 3. Let us call this GO- We will be also maintaining a window W. This will be an estimate of how many steps of the Markov chain we can simulate without a connectivity test, and still have a reasonable probability of having a connected realization GW after W steps. However, we do not require that every intermediate step between GO and GW is connected. Initially the window is W = 1. The algorithm proceeds in stages, each stage consisting of W simulation steps without a connectivity test. In general, if after W simulation steps we ended hi a connected realization, then we will accept this realization as the next state and we will increase the window for the next stage by one: W = W 4- I. If after W simulation steps we ended in a non connected realization, then we will return to the connected real-
Figure 1: The diameter at different stages of the Markov Chain. The left figure depicts the value of the diameter every 10 steps. Observe that the diameter does not converge to a single value. The right figure gives the sample average of the diameter. The underlying graph was constructed using the degree sequence of a 2002 AS topology. The topology generation process started from a sparse core.
ization of the beginning of the current stage and we will decrease the window of the next stage to half its current size: W = |"W/2~|. This heuristic was inspired by the linear increase, multiplicative decrease of the TCP protocol. In this way, we hope to result in much fewer connectivity tests. Notice that the above process is not strictly a Markov chain. This is because the size of the windows W can vary arbitrarily. In fact, we need to argue that its stationary distribution is a uniform distribution over the set of all connected realizations. To see this partition the above process in stages PI, P2,..., P,, — The starting state of stage P, is a connected realization of the degree sequence. The window W = W (Pj) is fixed for stage Pj. The last state of stage Pj is the realization after W simulation steps and the final connectivity test. The starting state of stage Pj+i coincides with the final state of stage Pj. Now notice that, for each stage, W is fixed, and the process Pj is a Markov chain on the state space of connected realizations. In addition, the process Pj is symmetric, in the sense that the probability of ending at state X given that we started at state Y is the same as the probability of ending at state Y given that we started at state X. This follows from the symmetry of the initial Markov chain, which holds with or without connectivity tests (all transitions, with or without connectivity tests, have the same probability). We may now invoke the well known fact that aperiodic symmetric Markov chains converge to the uniform distribution [31, 35], and conclude that each one of the processes Pj has a unique stationary distribution, which is uniform. Therefore, the concatenation PI,PI, ..., Pj,..., also converges to
the uniform distribution. We will use the term fast Markov chain, or Markov chain speed up, to refer to the process with the sliding window W (as we said, strictly speaking, this is not a Markov chain, but a concatenation of Markov chains).
5
Evaluation
In this Section we evaluate the efficiency of the Markov chain in Section 3 with the speed up proposed in Section 4. Our main application focus is the case of Internet topologies at the level of Autonomous Systems. We use data made available by the National Laboratory of Applied Networking Research [17] as well as Traceroute at Oregon [38]. These involve sparse power-law graphs (average degree is less than 4.5), whose size (number of nodes) has gown from 3K in November 1997 (when the data collection started), to approximately 14K, today. The main results are as follows: 1. We observe convergence in less than 2 million steps for the following quantities of interest: • Shortest paths from a node to every other node. • Number of shortest paths through a node. The corresponding running time is less than 1 thousand seconds on a Pentium III. 2. For the same running time, the Markov chain without the speed up runs for much fewer steps. Thus, the speed up indeed resulted in improvement. 3. The convergence time appears to scale slowly with 20
the size of the topology, which is encouraging when chain, as well as the weaker behavior of the slow Markov chain: For example, look at the Max Average Path and larger topologies will need to be generated. the Max Max Path (diameter) of the slow Markov chain. 4. The Markov chain method has the advantage over We have repeated these experiments 10 times, and the the method of [2] that it achieves the exact target results are almost identical. degree sequence for any realizable sequence, while In Table 3 we indicate the convergence of the [2] differs from the target sequence even for power sped-up Markov chain starting from a synthetic initial law sequences. topology consisting of two parts separated by a sparse 5. The Markov chain method has the following addi- cut. We constructed this synthetic topology by taking two copies of the AS topology and simulated the sparse tional advantage. If we start from an extreme inicut by adding a single edge to connect the two parts. tial point, like a dense or sparse topology or a topolIn Table 4, we compare the output of the Markov ogy consisted by two well connected parts separated chain method to those of the power-law random graph by a sparse cut, we can measure the parameters of (PLRG) method of [2], and to the results of the real interest at intermediate simulation points, and see Internet topology. It is clear that the Markov chain is how these parameters converge. Such simulations a better fit. In addition, we have found that the PLRG can be useful in stress tests of protocols. method of [2] produces graphs whose highest degrees deviate from the real Internet topology. More specifically, we have measured the following In Tables 5 and 6 we indicate how the stopping time, quantities: as the topology scaled, from approximately 3K nodes to 1. For each node u, the average path from v is the aver- approximately 14K nodes. We consider the two metrics age of the shortest path from v to every other node. of average path and normalized average link load. We In the networking context, this is an indication of should note that the number of steps indicated is the the quality of service perceived by v. We will con- minimum number of steps that we needed to take in sider the quantities mean average path, when the order for the metrics under consideration to converge mean is taken over all the nodes, as well as the accurately. The first thing to notice is the robustness of the method. We have actually performed further variance and maximum of the average path. simulations on parts of the network (such as Europe, 2. For each node i>, the maximum path from v is the or North America), and on projected degree sequences maximum length shortest path from v to every for the network in the next 5 years (as given by the Inet other node. In the networking context, this is topology generator [24], and we have found qualitatively an indication of the worst case quality of service similar results. The second important thing to notice perceived by v. Again, we will consider the mean is that the scaling of the convergence time from the and maximum over all nodes of the graph. Notice 3K node topology to the 13K node topology is very that the maximum maximum path is the diameter. mild. We find this an encouraging evidence, that the convergence times will be reasonably efficient, as the 3. For each link e, the link load of e is the number topologies scale further. of shortest paths through e, when n2 shortest paths from each node to every other node have 6 Open Problems been considered (we break ties at random). We normalize by dividing with n2. In the networking A very challenging open problem in this area is the context, this is an indication of the congestion of following: In addition to a target degree sequence, we the link. We will consider the mean, variance, and may consider an underlying metric of distances. (In max of the link load, over all links of the network. reality, these would be geographic distances for which data is available). We would then want to construct In Tables 1 and 2 we indicate the convergence a minimum cost connected realization of the degree of the fast Markov chain for the degree sequence of sequence. Even without the connectivity requirement, Internet inter-domain topology on June 2002. Table 1 for a degree sequence on n nodes, this2 would be a corresponds to a dense initial topology, and Table 2 mincost perfect matching problem on O(n ) nodes along corresponds to a sparse initial topology. The last the reduction of Section 3. For n equal to several column correspond to a Markov chain without the tens of thousands, all known exact mincost perfect speed up for the same number of connectivity tests matching algorithms are inefficient. Is there an efficient (thus approximately the same running time). We may approximation [39]? Is there a proof of rapid mixing for the Markov observe, both the convergence of the sped-up Markov 21
Property Average Path - Mean - Variance - Max Max Path - Mean - Variance - Max (diameter) Link Load - Mean - Variance - Max Connectivity Tests Running time (sec)
Initial
Fast MC 20K steps
FastMC 50K steps
FastMC 200K steps
Fast MC 0.5M steps
FastMC 1M steps
Slow MC ~28K steps
13.9920 601.5923 200.6095
3.5490 0.3791 9.1933
3.4404 0.2846 7.8317
3.4269 0.2766 7.4659
3.4277 0.2811 7.9392
3.4255 0.2795 7.9760
3.4671 0.3044 7.4373
205.8013 1.2605 213
9.3013 0.4480 15
7.9866 0.3735 12
7.5114 0.4104 11
7.9634 0.4089 12
7.9992 0.4070 12.1
7.4931 0.4184 11
4.94e-4 1.20e-5 7.97e-2
1.25e-4 2.72e-7 5.36e-2 966 16.07
1.22e-4 3.13e-7 5.55e-2 1598 27.32
1.21e-4 3.18e-7 5.51e-7 5389 102.44
1.21e-4 3.17e-7 5.49e-2 14,257 222.17
1.20e-4 3.18e-7 5.53e-2 27,878 434.18
1.22e-4 2.90e-7 5.41e-2 27,878 411.67
Table 1: Evolution of average path, maximum path and link load at various stages from a dense starting state. The thing to observe is that convergence appears to have occured between 50K and 200K steps. The metrics for 1M steps appear indicative of steady state; they remain the same when running up to 5M steps. The underlying graph was constructed using the degree sequence of a 2002 AS topology with 14K nodes.
Property Average Path - Mean - Variance - Max Max Path - Mean - Variance - Max (diameter) Link Load - Mean - Variance - Max Connectivity Tests Running time (sec)
Initial
FastMC 20K steps
FastMC 50K steps
Fast MC 200K steps
FastMC 0.5M steps
Fast MC 1M steps
Slow MC ~ 12K steps
4.6661 0.8181 10.1008
3.3189 0.2156 6.8958
3.4254 0.2944 8.8231
3.4296 0.2837 8.0073
3.4270 0.2802 8.0158
3.4282 0.2826 8.2066
3.2726 0.1977 6.5994
10.1693 1.3050 14
6.9522 0.3279 10
8.8276 0.4605 13
8.0307 0.4101 12
8.0451 0.4141 12
8.2490 0.4013 12.3
6.8558 0.3355 10
1.64e-4 1.44e-6 5.95e-2
1.17e-4 3.08e-4 5.07e-2 93 2.47
1.21e-4 3.26e-7 5.55e-2 717 13.79
1.21e-4 3.28e-7 5.68e-2 4,165 64.82
1.21e-4 3.26e-7 5.64e-2 12,224 190.11
1.21e-4 3.26e-7 5.65e-2 25,736 397.47
1.16e-4 2.55e-7 4.50e-2 12,224 176.20
Table 2: Evolution of average path, maximum path and link load at various stages from a sparse starting state. The thing to observe is that convergence appears to have occured between 50K and 200K steps. The metrics for 1M steps appear indicative; they remain the same when running up to 5M steps. The underlying graph was constructed using the degree sequence of a 2002 AS topology with 14K nodes.
22
Property Average Path - Mean - Variance - Max Max Path - Mean - Variance - Max (diameter) Link Load - Mean - Variance - Max Connectivity Tests Running time (sec)
Initial
FastMC 100 steps
FastMC 500 steps
FastMC IK steps
FastMC 10K steps
FastMC 50K steps
4.8030 0.4150 8.0807
4.1460 0.3843 7.1905
3.9607 0.3644 7.6485
3.8683 0.3350 7.1923
3.7916 0.3940 9.3363
3.7854 0.3866 8.3489
9.4455 0.5182 13
7.6680 0.4985 11
7.7422 0.4940 11
7.4654 0.3923 11
9.3413 0.5199 13
8.5807 0.4767 13
2.49e-4 1.61e-5 5.00e-l
2.15e-4 1.42e-6 1.26e-l 2 0.5
2.05e-4 4.73e-7 4.39e-2 10 0.5610
2.01e-4 3.42e-7 3.85e-2 15 0.7310
1.97e-4 1.78e-7 2.28e-2 340 5.61
1.96e-4 1.92e-7 2.13e-2 2477 39.17
Table 3: Convergence of a Markov Chain which initial state is a graph composed of two separate components connected with a single link. Observe the high values of link load in the initial graph. This is natural since the graph has a bad cut. The steady state is reached after a relatively few steps, on the order of 10K in this example, even though the starting point is an extremal case of a graph.
Property Path mean Path var Path max Load mean Load var Load max
Den se Final Initial 13.9920 3.4261 601.5923 0.2851 200.6095 8.2758 4.94e-4 1.21e-4 1.20e-5 3.17e-7 7.97e-2 5.58e-2
Sparse Initial Final 4.6661 3.4337 0.8181 0.2863 10.1008 8.3023 1.65e-4 1.21e-4 1.44e-6 3.30e-7 5.95e-2 5.67e-2
Internet 3.6316 0.3181 7.5005 1.28e-4 2.94e-7 5.26e-2
PLRG 3.8735 0.4738 10.2855 1.54e-4 1.21e-7 1.94e-2
Table 4: Comparison of the average path and the link load for the AS topology, the Markov Chain and the PLRG model. The MC converges to the similar results for both dense and sparse initial points after simulating a sufficient number of steps. The Markov Chain approach gave results closer to the real AS Topology than the PLRG model. The topologies were generated from a 2002 AS topology degree sequence.
Graph 1997 1998 1999 2000 2001 2002
Nodes Links Internet 3055 5678 3.7676 3666 7229 3.7376 5053 10458 3.7148 7580 16153 3.6546 10915 22621 3.6350 13155 27041 3.6316
PLRG 3.9802 3.9009 3.9410 3.9020 3.8422 3.8735
Dense Initial Steps 4.0489 1000000 4.0732 1100000 5.0164 900000 5.7000 600000 7.0962 800000 6.1559 1100000
Final 3.5744 3.5335 3.5270 3.4717 3.4107 3.4255
Sparse Initial Steps Final 5.2917 1000000 3.6046 5.3540 800000 3.5554 5.5231 800000 3.5353 5.8082 1300000 3.4884 4.8178 900000 3.4169 4.6610 1000000 3.4282
Table 5: The average path for the AS topology and for synthetic topologies for various sizes obtained from snapshots of the AS topology between 1997 and 2002. The graphs generated starting from dense and sparse states converge to similar values after a sufficient number of steps. The required number of steps for convergence does not change dramatically as the size of the graph increases. This can be views as an indication that the method scales well with the size of the topology.
23
Graph 1997 1998 1999 2000 2001 2002
Nodes 3055 3666 5053 7580 10915 13155
Links 5678 7229 10458 16153 22621 27041
Internet PLRG 7.19e-4 8.62e-4 5.62e-4 6.57e-4 3.85e-4 4.54e-4 2.42e-4 2.88e-4 1.61e-4 1.91e-4 1.28e-4 1.54e-4
Initial l.Ole-1 8.56e-2 9.92e-2 8.50e-2 6.23e-2 4.94e-4
Dense Steps 1000000 1100000 900000 600000 800000 1000000
Final 6.82e-2 5.66e-2 3.66e-2 2.30e-2 1.15e-2 1.21e-4
Initial l.OOe-3 7.01e-4 3.69e-4 3.84e-4 7.85e-5 1.65e-4
Sparse Steps 1000000 800000 800000 1300000 900000 1000000
Final 6.83e-2 5.65e-2 3.68e-2 2.32e-2 1.16e-2 1.21e-4
Table 6: The maximum link load for the AS topology and for synthetic topologies for various sizes obtained from snapshots of the AS topology between 1997 and 2002. The graphs generated starting from dense and sparse states converge to similar values after a sufficient number of steps. The required number of steps for convergence does not change dramatically as the size of the graph increases. This can be viewed as an indication that the method scales well with the size of the topology. chains considered here? A proof would be interesting even for a special cases like trees. References [1] L. Adamic. Zipf, power-laws, and pareto, a ranking tutorial. http://www. hpl. hp. com/'shl/papers/ranking/, 2002. [2] William Aiello, Fan R. K. Chung, and Linyuan Lu. A random graph model for power law graphs. In Proc. 41st Symposium on Foundations of Computer Science (FOGS), pages 171-180. IEEE, 2000. [3] William Aiello, Fan R. K. Chung, and Linyuan Lu. Random evolution in massive graphs. In Proc. 42nd Symposium on Foundations of Computer Science (FOGS), pages 510-519. IEEE, 2001. [4] D. Alderson, J. Doyle, and W. Willinger. Toward an optimization-driven framework for designing and generating realistic internet topologies. In HotNets, 2002. [5] D. Aldous and J. Fill. Reversible markov chains and random walks on graphs. Monograph, http://statwww. berkeley. edu/users/aldous/book.html, 2002. [6] Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random networks. Science, 286:509-512, 1999. [7] Claude Berge. Graphs and Hypergraphs. North Holland Publishing Company, 1973. [8] B. Bollobas, O. Riordan, J. Spencer, and G. Tusnady. The degree sequence of a scale-free random graph process. Random Structures and Algorithms, 18(3) :279290, 2001. [9] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomikns, and J. Wiener. Graph structure in the web. 9th International World Wide Web Conference (WWW9)/Computer Networks, 33(1-6):309-320, 2000. [10] Tian Bu and Don Towsley. On distinguishing between internet power law topology generators. In Proc. Infocom. IEEE, 2002.
[11] Chen Chang, Jamin Govindan, and Willinger Shenker. The origins of power-laws in internet topologies revisited. In Proc. Infocom. IEEE, 2002. [12] L. Chung, F.R.K. amd Lu. Connected components in random graphs with given degree sequences. http://www.math.ucsd.edu/ fan, 2002. [13] C. Cooper and A. Frieze. A general model for web graphs. In ESA, pages 500-511, 2001. [14] S. Dill, R. Kumar, K. McCurley, S. Rajagopolan, D. Sivakumar, and A. Tomkins. Self-similarity in the web. In International Conference on Very Large Data Bases, pages 69-78, Rome, 2001. [15] A. Fabrikant, E. Koutsoupias, and C.H. Papadimitriou. Heuristically optimized tradeoffs: A new paradigm for powerlaws in the internet. ICALP 2002, 2002. [16] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proc. SigComm. ACM, 1999. [17] National Laboratory for Applied Retwork Research. Route views archive. http://moat.nlanr.net/Routing/rawdata, 2002. [18] S.L. Hakimi. On the realizability of a set of integers as degrees of the vertices of a graph. SI AM J. Appl. Math, 10, 1962. [19] V. Havel. A remark on the existence of finite graphs. Caposis Pest. Mat. 80, 80, 1955. [20] H. Jeong, B. Tombor, R. Albert, Z. Oltvai, and A. Barabasi. The large-scale organization of metabolic networks. Nature, (407):651, 2000. [21] M. Jerrum and A. Sinclair. Fast uniform generation of regular graphs. TCS, (73):91-100. [22] M. Jerrum and A. Sinclair. Approximating the permanent. SIAM J. of Computing, 18:1149-1178, 1989. [23] M. Jerrum, A. Sinclair, and E. Vigoda. A polynomial time approximation algorithm for the permanent of a matrix with non-negative entries. In Proc. Symposium on Theory of Computing (STOC), pages 712-721. ACM, 2001. [24] C. Jin, Q. Chen, and S. Jamin. Inet: Internet topology generator. Technical Report CSE-TR-433-00, U. Michigan, Ann Arbor, 2000.
24
[25] John Kleinber. Authoritative sources in a hyperlinked environment. In 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). [26] R. Kumar, P. Raghavan, S. Rajagopalan, Sivakumar D., A. Tomkins, and E. Upfal. Stochastic models for the web graphs. In Proc. 41st Symposium on Foundations of Computer Science (FOGS), pages 5765. IEEE, 2000. [27] R. Kumar, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Trawling the web for emerging cybercommunities. In WWW8/Computer Networks, volume 31, pages 1481-1493, 1999. [28] L. Lovasz and M.D. Plummer. Matching Theory. Acedemic Press, New York, 1986. [29] A. Medina, I. Matta, and J. Byers. On the origin of power-laws in internet topologies. ACM Computer Communications Review, April 2000. [30] A. Medina, I. Matta, and J. Byers. Brite: Universal topology generation from a user's pespective. Technical Report BUCS-TR2001003, Boston University, 2001. Available at http://www.cs.bu.edu/brite/publications. [31] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[32] Erdos. P. and T. Gallai. Graphs with prescribed degrees of vertices. Mat. Lapok, 11, 1960. [33] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Stanford Digital Library Technologies Project, 1998. [34] C. Palmer and J. Steffan. Generating network topologies that obey power laws. In Globecom, 2000. [35] A. Sinclair. Algorithms for Random Generation and Counting: A Markov Chain Approach. SpringerVerlag, 1997. [36] H. Tagmunarunkit, R. Govindan, S. Jamin, S. Shenker, and W. Willinger. Network topology generators: Degree-based vs structural. In Proc. SigComm. ACM, 2002. [37] R. Taylor. Constrained switchings in graphs. SI AM J. Alg. DISC. METH., 3(1):115-121, 1982. [38] Traceroute.org. Public route server and looking glass site list, http://www.traceroute.org, 2002. [39] V.V. Vazirani. Approximation Algorithms. SpringerVerlag, 2001. [40] E. Zegura, K. Calvert, and S. Bhattacharjee. How to model an internetwork. In Proc. Infocom. IEEE, 1996.
25
F i n d i n g t h ek Sh o r t e s t Simp l e P a t h s:
A New Algorithm and its Implementation John Hershberger"
Matthew Maxel1"
Abstract We describe a new algorithm to enumerate the k shortest simple (loopless) paths in a directed graph and report on its implementation. Our algorithm is based on a replacement paths algorithm proposed recently by Hershberger and Suri [7], and can yield a factor O(n) improvement for this problem. But there is a caveat: the fast replacement paths subroutine is known to fail for some directed graphs. However, the failure is easily detected, and so our k shortest paths algorithm optimistically uses the fast subroutine, then switches to a slower but correct algorithm if a failure is detected. Thus the algorithm achieves its 0(n) speed advantage only when the optimism is justified. Our empirical results show that the replacement paths failure is a rare phenomenon, and the new algorithm outperforms the current best algorithms; the improvement can be substantial in large graphs. For instance, on GIS map data with about 5000 nodes and 12000 edges, our algorithm is 4-8 times faster. In synthetic graphs modeling wireless ad hoc networks, our algorithm is about 20 times faster.
1
Subhash Suri*
between the k shortest paths problem with and without the simplicity constraint. (As the figure shows, even in graphs with non-negative weights, although the shortest path is always simple, the subsequent paths can have cycles.) The k shortest paths problem in which paths are not required to be simple turns out to be significantly easier. An O(m + fcnlogn) time algorithm for this problem has been known since 1975 [3]; a recent improvement by Eppstein essentially achieves the optimal time of O(m + nlogn + k}—the algorithm computes an implicit representation of the paths, from which each path can be output in O(n) additional time [2].
Introduction
The k shortest paths problem is a natural and longstudied generalization of the shortest path problem, in which not one but several paths in increasing order of length are sought. Given a directed graph G with nonnegative edge weights, a positive integer fc, and two vertices s and £, the problem asks for the k shortest paths from s to t in increasing order of length. We require that the paths be simple (loop free). See Figure 1 for an example illustrating the difference * Mentor Graphics Corp, 8005 SW Boeckman Road, Wilsonville, OR 97070, [email protected]. t Department of Computer Science, University of California, Santa Barbara, CA 93106. ^Department of Computer Science, University of California, Santa Barbara, CA 93106, [email protected]. Supported in part by National Science Foundation grants CCR-9901958 and IIS0121562.
Figure 1: The difference between simple and nonsimple k shortest paths. The three simple shortest paths have lengths 6,20 and 21, respectively. Without the simplicity constraint, paths may use the cycles (a, &, a) and (d, e, d), giving shortest paths of lengths 6,8,10. The problem of determining the k shortest simple paths has proved to be more challenging. The problem was originally examined by Hoffman and Pavley [10], but nearly all early attempts to solve it led to exponential time algorithms [16]. The best result known to date is an algorithm by Yen [18, 19] (generalized by Lawler [12]), which using modern data structures can be implemented in O(kn(m+nlogn)) worstcase time. This algorithm essentially performs O(n) 26
finds the best candidate path for each possible branch point off previously chosen paths is also subject to this lower bound, even for k = 2. All known algorithms for the k simple shortest paths fall into this category.
single-source shortest path computations for each output path. In the case of undirected graphs, Katoh, Ibaraki, and Mine [11] improve Yen's algorithm to O(fc(ra+nlogn)) time. While Yen's asymptotic worstcase bound for enumerating k simple shortest paths in a directed graph remains unbeaten, several heuristic improvements to his algorithm have been proposed and implemented, as have other algorithms with the same worst-case bound. See for example [1, 6, 13, 14, 15]. In this paper we propose a new algorithm to find the k shortest simple paths in a directed graph. Our algorithm is based on the efficient replacement paths algorithm of Hershberger and Suri [7], which gives a 6(n) speedup over the naive algorithm for replacement paths. The efficient algorithm is known to fail for some directed graphs [8]. However, the failure is easy to detect, and so our k shortest paths algorithm optimistically tries the fast replacement paths subroutine, then switches to a slower but correct algorithm if a failure occurs. We have implemented our optimism-based k shortest paths algorithm, and our empirical results show that the replacement paths algorithm almost always succeeds—in our experiments, the replacement paths subroutine failed less than 1% of the time. Thus, we obtain a speedup over Yen's algorithm that is close to the best-case linear speedup predicted by theory. The algorithm is never much worse than Yen's, and on graphs where shortest paths have many edges, such as those derived from GIS data or locally-connected radio networks, the improvement is substantial. For instance, on GIS data representing road networks in the United States, our algorithm is about 20% faster for paths from Washington, D.C., to New York (relatively close cities), and about 8 times faster for paths from San Diego to Piercy, CA. Similarly, on synthetic models of wireless networks, our algorithm is 10 to 20 times faster. Random graphs tend to have small diameters, but even on those graphs, our algorithm usually outperforms Yen's algorithm. The fact that Yen's worst-case bound for directed graphs remains unbeaten after 30 years raises the possibility that no better algorithm exists. Indeed, Hershberger, Suri, and Bhosle recently have shown that in a slightly restricted version of the path comparison model, the replacement paths problem in a directed graph has complexity Q(ra-v/n), if m = O(n^/n) [9]. (Most known shortest path algorithms, including those by Dijkstra, Bellman-Ford and Floyd-Warshall, satisfy this model. The exceptions to this model are the matrix multiplication based algorithms by Fredman and others [4,17, 20].) The same lower bound construction shows that any k simple shortest paths algorithm that
2 Path Branching and Equivalence Classes In the k shortest paths problem we are given a directed graph G = (V, E), with n vertices and m edges. Each edge e € E has an associated non-negative weight c(e). A path in G is a sequence of edges, with the head of each edge connected to the tail of its successor at a common vertex. A path is simple if all its vertices are distinct. The total weight of a path in G is the sum of the weights of the edges on the path. The shortest path between two vertices s and t, denoted by path(s,t), is the path joining s to t, assuming one exists, that has minimum weight. The weight of path(s, t) is denoted by d(s,t). We begin with an informal description of the algorithm. Our algorithm generates k shortest paths in order of increasing length. Suppose we have generated the first i shortest paths, namely, the set 11, = {Pi, P2,..., Pi}. Let Ri denote the set of remaining paths that join s to t; this is the set of candidate paths for the remaining k — i shortest paths. In order to find the next shortest path in Ri efficiently, we partition this set into O(i) equivalence classes. These classes are intimately related to the branching structure of the first i paths, which we call the path branching structure 7J. This is a rooted tree that compactly encodes how the first i paths branch off from each other topologically, and it can be defined procedurally as follows. (N.B. The procedural definition is not part of our algorithm; it is just a convenient way to describe the path branching structure.) The subroutine to construct 71 is called with parameters (s,Ilj), where s is the fixed source and Hi = {Pi, P2,..., Pi} is the set of the first i shortest paths. We initialize 7* to the singleton root node s. Let (s, a, 6,..., u) be the longest subpath that is a common prefix of all the paths in 11,. We expand 7* by adding a node labeled u, and creating the branch (s,u). The node u is the child of s. The set of paths {Pi,P2,.. .,Pi} is defined to be the path bundle of (s,w) and denoted by B(s, u).1 We now partition the path bundle B(s,u) into sets Si, 62, • • • > Sa such that 1 Strictly speaking, the bundle notation should have the subscript i to indicate that it refers to a branch in tree 7^. We drop this subscript to keep the notation simple, since the choice of branching structure will always be clear from the context.
27
in 71 for each of the k shortest paths, which is why we distinguish the t nodes by their prefix paths (e.g., tp). Likewise, a pair of vertices u, v may correspond to multiple branches, each with its own branch path. The following lemma is straightforward. LEMMA 2.1. The path branching structure 7^ has O(i) nodes and exactly i leaves.
Figure 2: (i) Four shortest paths, PI, . . . , associated path branching structure.
(ii) The
all paths in Sj follow the same edge after w, and paths in different groups, Sj,Si, follow distinct edges. We then make recursive calls (u, Sj), for j = 1,2,..., a, to complete the construction of 7*. The recursion stops when \Sj\ = 1, i.e., 5^- contains a single path P. In this case we create a leaf node labeled tp, representing the target vertex parameterized by the path that reaches it, and associate the bundle {P} with the branch (u, tp). See Figure 2 for an example of a path branching structure. Remark: It is possible that the paths in the bundle jB(s,w) have no overlap, meaning u = s. In this case, the branch (s, u) is not created, and we go directly to the recursive calls. For any given branch (u, v), B(u, v) is exactly the set of paths that terminate at leaf descendants of v. The bundles are not maintained explicitly; rather, they are encoded in the path branching structure. Each branch (w, v) has a path in G associated with it. This path, denoted branchPath(u,v), is shared by all the paths in B(u,v). Its endpoints are the vertices of G corresponding to nodes u and v. The first edge of branchPath(u, v) is the lead edge of the path, denoted lead(u,v). A node u in 7* has a prefix path associated with it, denoted prefix? ath(u). The prefix path is the concatenation of the branch paths for all the branches from s to u in 7^. Thus a shortest path P is equal to prefixPath(tp). It is important to note the distinction between nodes in 7* and their corresponding vertices in G. A single vertex u € V may have up to k corresponding nodes in 7i, depending on how many different paths from s reach it. For example, vertex t has one node
The preceding lemma would apply to a branching structure built from any i simple paths from s to t— it does not depend on the paths of Hi being shortest paths. The following lemma characterizes the branches of Ti more precisely, and it does depend on the shortness of the paths. LEMMA 2.2. Let (u,v) be a branch in 7$. Let P be the shortest path from vertex u to vertex t in G that starts with lead(u, v) and is vertex-disjoint from prefixPath(u). Then branchPath(u,v) C P. Proof. Recall that the ith path Pi is the shortest path that is not an element of IIj-i. Therefore, Pi must branch off from each path in IL,_i. Let x be the vertex of PI farthest from s where Pi branches off from a path in Tii-i- Since Ti-\ contains exactly the paths in IL,_i, Pi branches off from some branchPath(-, •) of Ti-\ at x. Let e be the edge on Pi that follows x. Because Pi is the shortest simple path not in IL,_i, it must a fortiori be the shortest simple path that branches off from Ti-i at x and follows e. The simplicity requirement means that the part of Pi after x must be disjoint from all the vertices on the prefix path in 7i_i from s to x. However, the only other requirement on the part of Pi after e is that it be as short as possible; this implies that the part of the path after x contains no loops. More generally, for every edge e1 on Pi after x, the subpath optimality property of shortest paths implies that the part of Pi after e' is the shortest path from the head vertex of e' to t that avoids the vertices of the prefix of Pi up through the tail vertex of e'. Any branch (u, v) in 7^ is a subset of a branch (z,^.) that was created as a branch off Tj-\ for some j < i. (Branches are not moved by later branches, only subdivided.) The edge lead(u,v) belongs to branchPathfaitpj), and hence as noted in the preceding paragraph, the part of Pj after w, which contains branchPath(u,v), is the shortest path that starts with lead(u,v) and is vertex-disjoint from prefixPath(u). We associate the equivalence classes of candidate paths with the nodes and branches of 7^, as follows. Consider a non-leaf node u € 7^, and suppose that the 28
smallest of these O(i) heap entries. Once this path is chosen (and deleted from the heap), the path branching structure is modified, and so is the equivalence class partition. Computationally, this involves refining one equivalence class into at most four classes. For each class, we determine the minimum element, and then insert it into the heap. See Figure 3.
children of u are labeled vi,V2,..., va. We associate one equivalence class with each branch out of u, denoted C(w, Vj), and one with node u itself, denoted C(u). The class C(u,Vj) consists of those paths of Ri that overlap with the paths in prefixPath(vj) up to and including the lead edge lead(u, t^-), but this overlap does not extend to Vj. That is, each path of C(u,Vj) diverges from prefixPath(vj) somewhere strictly between u and Vj. The final set C(u) consists of those paths that overlap with each prefixPath(vj] up to w, but no farther. That is, these paths branch off at u, using an edge that is distinct from any of the lead edges lead(u, Vj), for j = 1,2,..., a. The equivalence partition associated with the path branching structure Ti is the collection of these sets over all branches and non-leaf nodes of Ti. For instance, consider node a in Figure 2. There are three equivalence classes associated with a: class C(a, c) includes those paths that share the subpath from s to a with PS, PI, and branch off somewhere strictly between a and c (assume that the subpath from a to c contains more than one edge). Similarly, the class C(a,6) contains paths that coincide with Pi,P2 until a, then branch off before b. Finally, the class C(a) contains paths that coincide with PI, ..., P± up to a, then branch off at a.
ALGORITHM k-SHORTESTPATHS
• Initialize the path branching structure T to contain a single node s, and put path(s, t) in the heap. There is one equivalence class C(s) initially, which corresponds to all the s-t paths. • Repeat the following steps k times. 1. Extract the minimum key from the heap. The key corresponds to some path P. 2. If P belongs to an equivalence class C(u) for some node u then (a) Add a new branch (w,tp) to T that represents the suffix of P after u. (b) Remove from C(u) the paths that share at least one edge with P after u and put all of them except P into the newly created equivalence class C7(w,tp).
LEMMA 2.3. Every path from stot that is not among the i shortest paths belongs to one of the equivalence classes associated with the nodes and branches of Ti. The number of equivalence classes associated with the path branching structure Ti is O(i}.
3. Else (P belongs to the equivalence class C(u,v) for some branch (u,v)) (a) Let w be the vertex where P branches off from branchPaih(u,v). (b) Insert a new node labeled w, and split the branch (u, v) into two new branches (u,w) and (w,v). Add a second branch (w,tp) that represents the suffix of P after w. (c) Redistribute the paths of C(u, v) \ P among the four new equivalence classes (7(u,ty), C(w,v), C(w,tp), and C(w), depending on where they branch from branchPath(u, v) and/or P. i. Paths branching off branchPath(u, v) before node w belong to C(u,w). ii. Paths branching off branchPath(w, v) after node w belong to C(w, v). Hi. Paths branching off P after node w belong to C(w,tp). iv. Paths branching off both P and branchPath(u, v) at node w belong to C(w).
Proof. Consider a path P different from the first i shortest paths. Suppose Pj, where 1 < j < i, is the path that shares the longest common prefix subpath with P. In Ti, let (w, v) be the first branch where P and Pj diverge. Then P belongs in the equivalence class associated with either u or (u, v). Finally, the total number of equivalence classes is O(i) because each node and branch of Ti has only one equivalence class associated with it, and there are O(i) nodes and branches in Ti. I We are now ready to describe our algorithm for enumerating the k shortest simple paths.
3
Computing the k Shortest Paths
We maintain a heap, which records the minimum path length from each of the O(i) equivalence classes. Clearly, the next shortest path from s to t is the
29
4. For each new or changed equivalence class (at most four), compute the shortest path from s to t that belongs to the class. Insert these paths into the heap.
Figure 3: Illustration of how the equivalence class partition and branching structure change with the addition of a new path. The left figure shows the structure before adding the new path. There is an equivalence class for each non-leaf node and branch of the tree. The right figure shows the portion of the structure that is affected if the newly added path came from the class belonging to branch (w, v). The classes corresponding to the node and three branches that are inside the shaded oval are modified.
4
The key step in our k shortest paths algorithm is the computation of the best path in each equivalence class, which can be formulated as a replacement paths problem. Let H = (V,E) be a directed graph with non-negative edge weights, and let x, y be two specified nodes. Let P = {^1,^2,- • • >Vm}> where v\ = x, and vm = y, be the shortest path from x to y in H. We want to compute the shortest path from x to y that does not include the edge (vi,Vi+i), for each i € {1,2,..., m — 1}. We call this the best replacement path for (uj,Vi+i); the reference to the source x and target y is implicit. A naive algorithm would require ra — 1 invocations of the single-source shortest path computation: run the shortest path algorithm in graph H-i, where H-i is the graph H with edge (^,1^+1) deleted. The following algorithm does batch computation to determine all the replacement paths in O(\E\ + |F[log|V|) time; as mentioned earlier, it can fail for some directed graphs, but the failure can easily be detected. This algorithm is a slight simplification of the one in [7]. For full details of the algorithm's data structures, refer to that paper.
ALGORITHM REPLACEMENT 1. In the graph H , let X be a shortest path tree from the source x to all the remaining nodes, and let Y be a shortest path tree from all the nodes to the target y. Observe that P, the shortest path from x to y, belongs to both X and Y.
LEMMA 3.1. Algorithm k-ShortestPaths correctly computes the ith shortest path, the branching structure Ti, and the equivalence class partition of the candidate paths Ri, for each i from I to fc.
2. For every edge e* = (i>t,fi+i) 6 P
The complexity of the algorithm described above is dominated by Step 4. Step 1 takes only O(logfc) time per iteration of the repeat loop, and Steps 2 and 3 take O(n) time for path manipulation. The redistribution of candidate paths among equivalence classes is conceptual—the hard work is computing the minimum element in each class in Step 4. In the following section, we discuss how to implement this step efficiently. Remark: Our algorithm is conceptually similar to those of Yen and Lawler. The main difference is that our algorithm partitions the candidate paths into equivalence classes determined by the path branching structure, and those algorithms do not. This additional structure together with the fast replacement paths subroutine (Section 4) is the key to our algorithm's efficiency.
The Replacement Paths Problem
(a) Let Xi = X\6i. Let E* be the set of all edges (a, 6) € E\ei such that a and 6 are in different components of Xi, with o in the same component as x.
(b) For every edge (o, 6) e Ei Let pathWeight(a, 6) = d(x, a) + c(a, b) + d(b,y). Observe that d(x,a) and d(b,y) can be computed in constant time from X and Y. (c) The replacement distance for Ci is the minimum of pathWeight(a, b) over all (a, b) £ Ei. The quantity pathWeight(a, b) is the total weight of the concatenation of path(x,a), (a, 6), and path(b,y). By sweeping over the edges of P from one end of P to the other while maintaining a priority queue on the 30
5
edges of Ei, with pathWeight(e) as the key of each edge e e Ei, the entire algorithm takes the same asymptotic time as one shortest path computation. Let us now consider how this algorithm may fail in some directed graphs. It is clear that pathWeight(e) is the length of the shortest path from x to y that uses edge e = (a, 6) € Ei, and hence the algorithm finds the shortest path that uses an edge of E^ However, this path may not be the path we seek, because the suffix path(b,t) may traverse 6*. A simple example of this pathological behavior is shown in Figure 4.
The Shortest Path in an Equivalence Class
We describe briefly how the replacement path subroutine is used to compute the shortest path in an equivalence class. Consider the four equivalence classes created in step (3), in which P branches off from branchPath(u, v) at a vertex w. First consider a branch's equivalence class. Let (a,c) be a branch in T, and choose 6 such that lead(a,c) = (a, 6). The paths in C(a,c) follow prefixPath(c) up through 6, then branch off strictly before c. Thus it suffices to find the shortest suffix starting at 6, ending at t, subject to the constraints that the suffix (1) is vertex-disjoint from preftxPath(a) and (2) branches off branchPath(a,c) before c. We determine this path using the replacement path problem hi a subgraph H of G, defined by deleting from G all the vertices on prefixPath(a), including a. The shortest path in the node's equivalence class C(w) is easier to find: We obtain a graph H by deleting from G all the vertices hi preftxPath(w) except «;, plus all the lead edges that leave from w. We compute the shortest path from w to t in H, then append it to prefixPath(w). If the next shortest path P belongs to a node equivalence class C(u) (step (2) of Algorithm kShortestPaths), then C(u) is modified and a new equivalence class C(u, tp) is created. We can find the shortest paths in C(w, tp} and C(u) as above. (In the latter case, we simply remove one more edge lead(u, tp) from H and recompute the shortest path from u to t.) Thus, the overall complexity of the k shortest paths algorithm is dominated by O(k) invocations of the replacement paths subroutine. In the optimistic case, this takes O(m + nlogn) time per invocation; in the pessimistic case, it takes O(n(m + nlogn)) time per invocation.
Figure 4: A directed graph for which the replacement paths algorithm fails. The shortest path from v to y goes through the edge e, which causes our algorithm to compute an incorrect path for the replacement edge candidate (x,v). The correct replacement path for e uses the second shortest path from v to y, which does not go through e. Define low(v) to be the smallest i such that path(v, y) contains vertex v$. (In the original paper [7], low(v) is replaced by an equivalent but more complicated quantity called minblock(v), for reasons specific to that paper.) We say that (a, 6) e Ei is the min-yielding cut edge for e» if (a, 6) has the minimum pathWeightQ over all cut edges in £+. We say that (a, 6) is valid if low(b) > i. The following theorem identifies when the replacement paths algorithm may fail.
6
THEOREM 4.1. The algorithm REPLACEMENT correctly computes the replacement path for ei if the minyielding cut edge for e± is valid.
Implementation and Empirical Results
6.1 Implementation
In undirected graphs, all min-yielding cut edges are valid. In directed graphs, exceptions can arise. However, an exception is easily detected—the low() labels for all the vertices can be computed by a preorder traversal of the shortest path tree Y, and so we certify each min-yielding cut edge in constant time. When an exception is detected, our algorithm falls back to the slower method of running separate shortest path algorithms for each failing 6j.
We have implemented our algorithm using Microsoft Visual C++ 6.0, running on a 1.5 Ghz Pentium IV machine. The implementation follows the pseudo-code in Section 3 and the more detailed algorithm description of the replacement paths algorithm in [7]. We list a few of the notable features of the implementation here: 1. The Fibonacci heap data structure is used in both Dijkstra's shortest path algorithm and our replace31
ment paths subroutine. Fibonacci heaps therefore contribute to the performance of both our k shortest paths algorithm and Yen's algorithm. We implemented the Fibonacci heap from scratch, based on Fredman and Tarjan's paper [5]. 2. Our graph data structure is designed to reduce memory allocation of small structures, since measurements showed it to be a significant cost. The chief components of the graph data structure are an array of nodes, an array of edges, and scratch space for use in creating subgraphs. We store two arrays of pointers to edges, one sorted by source node and one sorted by destination node. Each node gets access to its incident edges by pointing to the appropriate subsequences in these arrays. A primary operation for the k shortest paths algorithm is producing subgraphs efficiently. Since memory allocation/deallocation is relatively expensive and most of the information in a subgraph is the same as that hi the parent graph, a subgraph borrows structures from the parent graph, uses these structures to compute some path, and then returns them to the parent graph. Because a subgraph generally has nodes or edges eliminated, we maintain a separate array of edges as scratch space in the parent graph for use by the subgraph. 3. Our program sometimes chooses to use a naive algorithm instead of the replacement paths algorithm of Section 4. The naive algorithm deletes each shortest path edge in turn, and finds the shortest path from the source to the sink in this new subgraph. Because the replacement paths algorithm calculates two shortest path trees and also performs a priority queue operation for every graph edge, we estimated that each naive shortest path computation should take about 1/3 of the time of the whole replacement paths algorithm. Therefore, our k shortest paths implementation is a hybrid: it chooses whichever replacement paths subroutine is likely to be faster, using a threshold of 3 for the branch path length. Subsequent measurement suggested that a threshold closer to 5 might be more appropriate. See Figure 5—the crossover point between the two algorithms appears to be around five. Future work will explore this more fully.
Figure 5: Time to compute replacement paths by the naive algorithm (circles) and our algorithm (triangles). The small glyphs show the raw data; the large ones show the average time value for each shortest path edge count. Note the large variance in the runtime of the naive algorithm, and the essentially constant runtime of our algorithm. There are equal numbers of small circles and triangles for each x-value; the low y-variance of the triangles means some small triangles are hidden by the large ones. data structures (Fibonacci heap) and optimized memory management. Our test suite had three different kinds of experimental data: real GIS data for road networks in the United States, synthetic graphs modeling wireless networks, and random graphs. GIS Road Networks. We obtained data on major roads from the Defense Mapping Agency. These graphs represent the major roads in a latitude/longitude rectangle. The edge weights in these graphs are road lengths. The first graph contains the road system in the state of California, and the second contains the road system in the northeastern part of the U.S. The experiments show that in the California graph, for 250 shortest paths from San Diego to Piercy, our algorithm is about 8 times faster than Yen's. For a closer source-destination pair (Los Angeles, San Francisco), the speedup factor is about 4. Finally, when the source and destination are fairly close (Washington, D.C., and New York), the relative speed advantage is about 20%. Figure 6 summarizes the results of these experiments.
6.2 Experiments We compared our new algorithm with an implementation of Yen's k shortest paths algorithm. We implemented Yen's algorithm ourselves, using state of the art 32
Figure 7: Time to compute 100 shortest paths on neighborhood graphs, plotted versus the average number of edges hi all the paths. Each plot summarizes 100 trials on graphs with the same (n, m), but with the grid rectangle's aspect ratio varied to control the average shortest path length. Circles represent Yen's algorithm, triangles our algorithm. The charts for m = Sn are similar to those for m = 4n, and are omitted to save space.
Figure 6: Time to compute k shortest paths on GIS graphs. Circles represent Yen's algorithm; triangles represent our algorithm.
33
Geometric Neighborhood Graphs. We generated synthetic models of ad hoc wireless networks, by considering nodes in a rectangular grid. The aspect ratio of the rectangle was varied to create graphs of varying diameter. The source and target were chosen to be opposite corners of the grid. In each case, we considered two models of edges: in one case, all 8 neighbors were present, and their distances were chosen uniformly at random in [0,1000]; in the second case, 4 of the 8 neighbors were randomly selected. Our experiments, summarized in Figure 7, show that our new algorithm is faster by a factor of at least 4. In the large graphs (10000 nodes, 40000 edges), the speedup is around twenty fold. Random Graphs. The new algorithm achieves its theoretical potential most fully when the average shortest path has many edges. This is clear from the experiments on the GIS and neighborhood data. Random graphs, on the other hand, tend to have small diameter. (In particular, a random graph on n nodes has expected diameter O(log n).) These graphs, therefore, are not good models for systems with long average paths. Even in these graphs, our new algorithm does better than Yen's in most cases, although the speed advantage is not substantial, as expected. Each random graph is generated by selecting edges at random until the desired number of edges is generated. Edge weights are chosen uniformly at random in [0,1000]. We tried three random graph classes: (IK nodes, 10K edges), (5K nodes, 20K edges), and (10K nodes, 25K edges). We plot the time needed to generate 100 shortest paths between a random (s, i) pair, against the average number of edges in the 100 paths. See Figure 8.
6.3
Discussion
We can characterize our results according to the following broad generalizations.
Average Number of Edges in Shortest Paths.
The efficiency of the new algorithm derives mainly from performing batch computations when finding the best path in an equivalence class. The relative gain is proportional to the number of edges in the branch path where the replacement paths subroutine is applied. Thus, if the replacement paths subroutine works without failover, our algorithm is likely to deliver a speedup that grows linearly with the average number of edges in the k
Figure 8: Time to compute 100 shortest paths on random graphs, plotted versus the average number of edges in all the paths. Each plot summarizes 100 random (s, t) trials in random graphs with the same (n, m) parameters. Circles represent Yen's algorithm, triangles our algorithm. The x-coordinate of each glyph is the nearest integer to the average number of edges in all 100 paths. The small glyphs show the raw data; the large ones show the average time value for each shortest path edge count. Note the heavy concentration of the average path edge count around a mean value that increases with the graph size, probably logarithmically, and also note the large variance in the runtime of Yen's algorithm. There are equal numbers of small circles and triangles for each x-value; the low y-variance of the triangles means some small triangles are hidden by the large ones. 34
fast subroutine optimistically and switching to a slower algorithm when it fails works very well in practice. We are exploring several additional directions for further improvements in the algorithm's performance.
shortest paths. This advantage is minimized for random graphs, because the expected diameter of a random graph is very small. This is corroborated by the data in Figure 8. In geometric graphs, such as those obtained from GIS or ad hoc networks, shortest paths are much more likely to have many edges, and our algorithm has a corresponding advantage. This is borne out by Figures 6 and 7.
1. When should we switch to the naive replacement paths algorithm? Is (path length < 3) the right cutoff, or would a more sophisticated condition give better results? To help answer this question, we ran the k shortest paths algorithm on 140 different random and neighborhood graphs, measuring the runtime for each threshold value between 2 and 7. Figure 9 combines the results for all experiments. For each test case, we computed the minimum running time over the six threshold values. We then computed a normalized runtime for each of the threshold values by dividing the actual runtime by the minimum runtime. Figure 9 shows the average normalized runtime over all 140 test cases.
Replacement Path Failure. Our experiments show that the replacement paths algorithm rarely fails. When averaged over many thousands of runs, the replacement paths subroutine failed 1.2% of the time on random graphs, 0.5% on neighborhood graphs, and never on GIS graphs. Thus, in practice our A; shortest paths algorithm shows an asymptotic speedup over Yen's algorithm. It also exhibits far more consistency in its runtime. Contraindications. Yen's algorithm optimizes over the same set of candidate paths as our algorithm. If the average path length is small, our hybrid algorithm does essentially the same work as Yen's algorithm, running Dijkstra's algorithm repeatedly. In this case our algorithm is slightly less efficient than Yen's because of the extra bookkeeping needed to decide which subroutine to use, but the relative loss is only about 20% in speed.
Averaged normalized runtime emphasizes the importance of test cases for which the threshold matters. A test case for which the threshold choice makes little difference has little effect on the average normalized time, because all its normalized times will be near 1.0. A test case for which one threshold is clearly better will assign high normalized weights to the other thresholds, and hence will select against them strongly.
In other cases, the replacement paths algorithm may be beaten by repeated Dijkstras even when the shortest path length is greater than three. This seems to occur most often in dense random graphs where Dijkstra's algorithm can find one shortest path without building the whole shortest path tree; the replacement paths algorithm, on the other hand, always builds two complete shortest path trees.
7
Concluding Remarks and Future Work
We have presented a new algorithm for enumerating the k shortest simple paths in a directed graph and reported on its empirical performance. The new algorithm is an interesting mix of traditional worstcase analysis and optimistic engineering design. Our theoretical advantage comes from a new subroutine that can perform batch computation in a specialized equivalence class of paths. However, this subroutine is known to fail for some directed graphs. Nevertheless, our experiments show that the strategy of using this
Figure 9: Average normalized runtime for all test cases. This chart suggests that 5 and 6 are the best thresholds. They should give about a 2% improvement in runtime over the threshold of 3 that we used in the experiments of Section 6.2. We 35
[12] E. L. Lawler. A procedure for computing the K best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18:401-405, 1972. [13] E. Martins and M. Pascoal. A new implementation of Yen's ranking loopless paths algorithm. Submited for publication, Universidade de Coimbra, Portugal, 2000. [14] E. Martins, M. Pascoal, and J. Santos. A new algorithm for ranking loopless paths. Technical report, Universidade de Coimbra, Portugal, 1997. [15] A. Perko. Implementation of algorithms for K shortest loopless paths. Networks, 16:149-160, 1986. [16] M. Pollack. The kth best route through a network. Operations Research, 9:578, 1961. [17] R. Seidel. On the all-pairs-shortest-path problem in unweighted undirected graphs. Journal of Computer and System Sciences, 51(3):400-403, 1995. [18] J. Y. Yen. Finding the K shortest loopless paths in a network. Management Science, 17:712-716, 1971. [19] J. Y. Yen. Another algorithm for finding the K shortest loopless network paths. In Proc. of 41st Mtg. Operations Research Society of America, volume 20, 1972. [20] U. Zwick. All pairs shortest paths using bridging sets and rectangular matrix multiplication. Journal of the ACM, 49(3):289-317, 2002.
plan further measurements to confirm this expectation. Note that the chart also shows that no single threshold is ideal for all the test cases: the best thresholds (5 and 6) give a runtime 2% worse than would be obtained by an optimal threshold choice for each experiment. 2. We have discovered an improved version of the algorithm that makes only two calls to the replacement paths subroutine after each new path is discovered. Currently, our algorithm makes three calls to the subroutine, plus one Dijkstra call. This change should improve the running time of our algorithm by about 40%.
References [1] A. Brander and M. Sinclair. A comparative study of jFf-shortest path algorithms. In Proc. of llth UK Performance Engineering Workshop, pages 370-379, 1995. [2] D. Eppstein. Finding the k shortest paths. SIAM J. Computing, 28(2):652-673, 1998. [3] B. L. Fox. fc-th shortest paths and applications to the probabilistic networks. In ORSA/TIMS Joint National Mtg., volume 23, page B263, 1975. [4] M. Fredman. New bounds on the complexity of the shortest path problem. SIAM Journal on Computing, 5:83-89, 1976. [5] M. Fredman and R. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM, 34:596-615, 1987. [6] E. Hadjiconstantinou and N. Christofides. An efficient implementation of an algorithm for finding K shortest simple paths. Networks, 34(2):88-101, September 1999. [7] J. Hershberger and S. Suri. Vickrey prices and shortest paths: What is an edge worth? In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science, pages 252-259, 2001. [8] J. Hershberger and S. Suri. Erratum to "Vickrey prices and shortest paths: What is an edge worth?". In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. [9] J. Hershberger, S. Suri, and A. Bhosle. On the difficulty of some shortest path problems. In Proc. 20th Sympos. Theoret. Aspects Comput. Sci., Lecture Notes Comput. Sci. Springer-Verlag, 2003. [10] W. Hoffman and R. Pavley. A method for the solution of the Nth best path problem. Journal of the ACM, 6:506-514, 1959. [11] N. Katoh, T. Ibaraki, and H. Mine. An efficient algorithm for k shortest simple paths. Networks, 12:411-427, 1982.
36
Efficient Exact Geometric Predicates for Delaunay Triangulations Olivier Devillers^
Sylvain Pion*
Abstract A time efficient implementation of the exact geometric computation paradigm relies on arithmetic niters which are used to speed up the exact computation of easy instances of the geometric predicates. Depending of what is called "easy instances", we usually classify filters as static or dynamic and also some in between categories often called semi-static. In this paper, we propose, hi the context of three dimensional Delaunay triangulations: — automatic tools for the writing of static and semistatic niters, — a new semi-static level of filtering called translation filter, — detailed benchmarks of the success rates of these filters and comparison with rounded arithmetic, long integer arithmetic and filters provided in Shewchuk's predicates [25]. Our method is general and can be applied to all geometric predicates on points that can be expressed as signs of polynomial expressions. This work is applied in the CGAL library [10]. 1 Introduction A geometric algorithm usually takes decisions based on some basic geometric questions called predicates. Numerical inaccuracy in the evaluation of geometric predicates is one of the main obstacles in implementing geometric algorithms robustly. Among the solutions proposed to solve this problem, the exact geometric computation paradigm is now recognized as an effective solution [27]. Computing the predicates exactly makes an algorithm robust, but also very slow if this is done by using some expensive exact arithmetic. The current way to speed up the algorithms is to use some rounded evaluation with certified error to answer safely and quickly the easy cases, and to use some expensive exact "Partially supported by the 1ST Programme of the EU as a Shared-cost RTD (FET Open) Project under Contract No IST2000-26473 (EGG - Effective Computational Geometry for Curves and Surfaces). tQlivier.DevillersQsophia.inria.fr INRIA SophiaAntipolis, BP 93, 06902 Sophia-Antipolis cedex, France. iSylvain.Pion6mpi-sb.mpg.de Max Planck Institut fur Informatik, Saarbriicken, Germany.
arithmetic only hi nearly degenerate situations. This approach, called arithmetic filtering, gives very good results hi practice [17, 7, 8]. Although what we propose is quite general, we focus now on the particular case which has been used to validate our ideas: the predicates for the three dimensional Delaunay triangulations. This work is implemented hi the CGAL library [10]. Many surface reconstruction algorithms [5, 1, 2] are based on Delaunay triangulations and we will take our point sets for benchmarking from that context. Predicates evaluation can take from 40% to almost 100% of the running tune depending of the kind of filters used, thus it is critical to optimize them. The predicates used in Delaunay algorithms are the orientation predicate which decides the orientation of four points and the in_sphere predicate which decides among five points if the fifth is inside the sphere passing through the four others. Like many other geometric predicates, those reduce to the evaluation of the sign of some polynomial P(x). A filter computes a rounded value for P(ar), and a certified bound for the rounding error. The filter is called static if the error is computed off-line based on hypotheses on the data, dynamic if the error is computed at run tune step by step in the evaluation of P(x) and semi-static if the error is computed at run-time by a simpler computation. In this paper, we propose, — to compute the off-line error in static and semi-static filter by an automatic analysis of the generic code of the predicate, — to use an almost static filter where the error bound is updated when the new data does not fit any longer the hypotheses, — a new semi-static level of filtering: the translation filter which starts by translating the data before the semi-static error computation and — detailed benchmarks on synthetic and real data providing evidence of the efficiency of our approach. The efficiency of static filters was proved by Devillers and Preparata [14] but their analysis was based on probabilistic hypotheses which does not usually apply to real data. Automatic code generation for exact predicates has been proposed [8, 23], but it was limited to dynamic filters [7] or static filters with strong hypotheses on the data [17]. Some existing techniques
37
using static filtering need hypotheses on the input co- be simplified to a 3x3 determinant and an initial set of ordinates, like a limited bit length [4], or requiring to subtractions: have fixed point values which may require truncation as a preprocessing step [10, 17]. Finally, we also compare running times with the simple floating point code (which is not robust), with a naive implementation of multi-precision arithmetic, and with the well known robust implementation of these with p'x — px — sx and so on for the y and z coordinates predicates by Jonathan Shewchuk [25]. and the points q and r. We use the C++ template mechanism hi order to 2 Our case study implement the orientation predicate generically, using 2.1 Algorithm Our purpose is to study the behavior only the algebraic formula above. This template can be of the predicates in the practical context of a real used with any type T which provides functions for the application, even if our results can be used for other subtraction, addition, multiplication and comparison. algorithms, we briefly present the one used in this paper. We will see how to use this code in different ways later. The algorithm used for the experiments is the Delaunay hierarchy [11], which uses few levels of Delaunay template triangulations of random samples. The triangulation is int updated incrementally inserting the points in a random orientation(T px, T py, T pz, T qx, T qy, T qz, T rx, T ry, T rz, T sx, T sy, T sz) order, when a new point is added, it is located using walking strategies [13] across the different levels of the { T psx=px-sx, psy=py-sy, psz=pz-sz; hierarchy, and then the triangulation is updated. The T qsx=qx-sx, qsy=qy-sy, qsz=qz-sz; location step uses the orientation predicate while the T rsx=rx-sx, rsy=ry-sy, rsz=rz-sz; update step relies on the in_sphere predicate. The randomized complexity of this algorithm is T ml = psx*qsy - psy*qsx; related to the expected size of the triangulation of a T m2 = psx*rsy - psy*rsx; sample, that is quadratic in the worst case, but subT m3 = qsx*rsy - qsy*rsx; quadratic with some realistic hypotheses [3, 16, 15], T det = ml*rsz - m2*qsz + m3*psz; practical inputs often give a linear output [19] and an if (det>0) return 1; O(nlogn) algorithmic complexity. if (det<0) return -1; The implementation is the one provided in return 0; CGAL [6, 26]. The design of CGAL allows to switch the predicates used by the algorithm and thus makes the experiments easy [21]. Similarly, the in_sphere predicate of 5 points *jP?<7> r ? s is the sign of a 5x5 determinant, which can 2.2 Predicates The two well-known predicates be simplified to the sign of the following 4x4 determineeded for Delaunay triangulation are: nant: — the orientation test, which tests the position of a point relative to the oriented plane defined by three other points, when they are not coplanar. — the in_sphere test, which, given four positively oriented points, decides whether a fifth point lies inside the circumscribing sphere of the four points, or not. We implement it similarly using a template funcIn this paper, we do not focus on degenerate cases. tion, the 4x4 determinant being computed using the Dealing with 4 coplanar or 5 cospherical points can be dynamic programming method (i.e. first compute the handled in the algorithm or by standard perturbation minors of rank 2, then use them to compute the mischemes. This can possibly involve other predicates (e.g. nors of rank 3, then use them in turn to compute the in-circle test for coplanar points) but these degenerate determinant). evaluations are rare enough to be neglected in the whole computation tune. 2.3 Data sets For the experiments, we have used the The orientation predicate of the four points following data sets (see Figures 1, 2, 3): p, q, r, s boils down, when using Cartesian coordinates, — (R5) — 500,000 random points uniformly distributed to the sign of the following 4x4 determinant, which can in a cube (the coordinates have been generated by the 38
drand48 C function). — (R20) — 2,000,000 random points uniformly distributed in a cube. — (E) — 500,000 random points almost uniformly distributed on the surface of an ellipsoid. — (M) — 525,296 points on the surface of a molecule. — (B) — 542,548 points on the surface of a Buddha statue (data from Stanford scanning repository). — (D) — 49,787 pouits on the surface of a dryer handle (data provided by Dassault Systemes). The way the scanning was done has produced a lot of coplanar points which exercises the robustness a lot. Experiments have all been performed on a Pentium III PC at 1 GHz, with 1 GB of memory, the compiler used is GCC 2.95.3. We have gathered some general data on the computation of the 3D Delaunay triangulations of these sets of points in Table 1.
Figure 1: Points on the surface of a molecule (M).
3 Simple floating point computation The naive method consists of using floating point arithmetic in order to evaluate the predicates. Practically, this means using the C++ built-in type double as T in the generic predicates described above. This does not give a guaranteed result due to roundoff errors, but is the most efficient method when it works. The triangulation algorithm happened to crash only on data set (D). Also note that even if it doesn't crash, the result may not be exactly the Delaunay triangulation of the points, so some mathematical properties may not be fulfilled, and this may have bad consequences on the later use of the triangulation. We first measured the number of times the predicates gave a wrong result when computed with floating point, by comparing its result with some exact computation (see Table 2), with the algorithm using the exact result to continue its path. Figure 2: Points on the surface of a Buddha statue (B). The orientation predicate is used for walking in the triangulation during the point location. Depending on the walking strategy [13] and the particular situation, orientation failures often have no consequences but it may cause a loop or return a wrong tetrahedron as a result, which could result hi an incorrect Delaunay triangulation. The reason why the in_sphere predicate fails more often is due to its larger algebraic complexity, which induces larger roundoff errors. The failure of the in_sphere predicate will definitely create a nonDelaunay triangulation after the insertion of the point. We also note that the random distribution does not incur any failure, it is due to a better conditioning Figure 3: Points on the surface of a dryer handle (D). of the computed determinants. This fact questions the relevance of theoretical studies based on random distributions. 39
II # points # tetrahedra # orientation calls # in.sphere calls MB of memory
R5 500,000 3,371,760 39,701,472 22,181,509 153
R20 2,000,000 13,504,330 166,623,287 89,033,404 574
E 500,000 3,241,972 56,340,962 13,945,582 148
M 525,296 3,588,527 44,280,741 23,697,290 161
B 542,548 3,864,194 55,201,542 25,073,049 172
D 49,787 321,148 3,810,257 1,924,558 25
Table 1: Informations on the computation of the Delaunay triangulations of the data sets. IH8 % of wrong results for orientation 0% % of wrong results for injsphere 0%
R20 0% 0%
E 0% 0%
M 0.0001% 0.005%
B 0.002% 0.0005%
D 0.17% 0.8%
Table 2: Wrong results given by the floating point evaluation of the predicates.
Running times for the computations can be found hi Table 6, they provide a lower limit, and the goal is to get as close as possible from this limit with exact methods. The percentage of tune spent in the predicates can be roughly evaluated to 40%.
over the previous naive exact method. We have used the implementation of this method provided by CGAL through the FilterecLexact functionality. It is still a very general answer to the filtering approach. When the interval arithmetic is not precise enough, we rely on HP-Float in order to decide the exact result. We can reuse the template version of the predicate over both the interval arithmetic number type, as well as MP_Float, and we use the C++ exception mechanism to notify filter failures : the comparison operators of the interval arithmetic class raise an exception in case of overlapping intervals. So we basically use the following code, which is independent of the particular content of the algebraic formula of the predicate:
4 Naive exact multi-precision arithmetic The easiest solution to evaluate exactly the predicates is to use an exact number type. Given that, in order to evaluate exactly the sign of a polynomial, it is enough to use multi-precision floating point arithmetic, which guarantees exact additions, subtractions and multiplications. These number types are provided by several libraries such as GMP [20]. CGAL also provides such a data type on its own (via the HP-Float class), int dynamic_filter_orientation( which is efficient enough for numbers of reasonable bit double px, double py, double pz, length which is the case in our predicates. Again, it is double qx, double qy, double qz, enough to pass the MP_Float type as the type T of the double rx, double ry, double rz, double sx, double sy, double sz) template functions described above. Here we notice that the naive exact method is try approximately 70 times slower (see Table 6) compared to floating point, which makes it hardly usable for most return orientation applications, at least if we use it naively. However, exact (px, py, pz, qx, qy, qz, evaluation is acceptable as the last stage of a filtered rx, ry, rz, sx, sy, sz); evaluation of the predicate, where performance doesn't matter so much. GMP does not give better result in that catch ( . . . ) context, the Delaunay triangulation of 500,000 random points with integer coordinates on 31 bits needs about return orientation<MP_Float> 2000 seconds, which is of the same order as the 3000 (px, py, pz, qx, qy, qz, rx, ry, rz, sx, sy, sz); seconds we get with MP_Float and very far from the running time using floating point arithmetic. 5
General dynamic filter based on interval Table 3 shows the number of times the interval arithmetic arithmetic is not able to decide the correct result for a As we will show, interval arithmetic [7, 24] allows us to achieve an already quite important improvement predicate, and thus needs to call a more precise version. For the randomly generated data sets, there is not a 40
single filter failure. For the other data sets, if we compare these numbers to the number of wrong results given by floating point (Table 2), we can see that this filter doesn't require an expensive evaluation too often, about three times the real failures of the floating point evaluation. Moreover, if we compare these numbers to the total number of calls, we obtain that the filter fails with a probability of 1/200,000 for the orientation predicate, and 1/13,000 for in_sphere, for the M data set, which is a high success rate. For the D data set, we obtain 1/254 and 1/77. Given that the exact computation with MP.Float is 70 times slower than floating point, its global cost at run tune is negligible for the M data set, since its called rarely, and what dominates is therefore the evaluation using interval arithmetic. Table 6 shows that the running time overhead compared to floating point is now approximately on the order of 3.4. This is far better than the previous naive exact multi-precision method, but still quite some overhead that we might want to remove. 6 Static filter variants The previous method gives a very low failure rate, which is good, but for the common case it is still more than 3 times slower compared to the pure floating point evaluation. The situation can be improved by using static filter techniques [17, 8]. This kind of filter may be used before the interval computation, but it usually needs hypotheses on the data such as a global upper bound on the coordinates. Given that we need to evaluate the sign of a known polynomial, if we know some bound b on the input coordinates, then we can derive a bound e(6) on the total round-off error that the floating point evaluation of this polynomial value will introduce at worst. At the end, if the value computed using floating point has a greater absolute value than e(6), then one can decide exactly the sign of the result. We explain in the next section how to compute e(6), but let us just give here the result for the orientation predicate: c(&) = 3.908 x 10~14 x 63. We can apply this remark in two different ways. First, if we know a unique global bound b on all the input point coordinates, then we need to compute e(&) only once for the whole algorithm, and e(6) can be considered a constant. This is traditionally called a static filter. Sometimes it is not possible to know a global bound, at compile time or even at run time, or it is simply inconvenient to find one for some dynamic algorithms. In that situation, we introduce the almost static filter in which the global bound b on the data is updated for each point added to the triangulation, and e(6) 41
is updated when 6 changes. This is relatively cheap, and completely amortized since inserting a point in a triangulation costs hundreds of calls to predicates (see Table 1). Notice also, that this approach needs to get out from the classical filtering scheme where only the code of the predicates is modified ; we need here to change also the point constructor to be able to maintain 6 and e(6). Since b is growing along with the inserted points, b may be largely over evaluated for some specific instance of the predicate. Thus if the almost static filter fails, we use a second stage which computes a bound 6' from the actual arguments of the predicate, at each call, and computes c(6') from it. This one is usually called a semi-static filter. Table 6 shows that these methods behave far better than the interval arithmetic filter alone from the running time point of view on all data sets. We now get within 10% to 70% more time compared to the floating point version. On the other hand, Table 4 shows that these filters fail much more often than interval arithmetic: the static filter fails between 6% and 75% of the time for in_sphere, so we'd better keep all stages to achieve best overall running time. Here we also notice an important difference between orientation and in_sphere, the later failing much more often, even on the random data set. We explain this behavior by the fact that the points over which the predicates are called are closer to each other on average in the in_sphere case, due to the way the triangulation algorithm works. Automatic error bound computation Evaluating an error bound e(6) is tedious to do by hand for each different predicate, but it is easy to compute automatically, given the polynomial expression by its template code, by following the IEEE 754 standard rules on floating point computations. First, we remark that the polynomials we manipulate are homogeneous of some degree d, it means we can compute an error bound of the form: e(6) = e(l) * bd where e(l) is a constant, as it depends only on the way the polynomial is computed. Computing e(b) from e(l) is therefore quite cheap once b is known, and b is not very expensive to compute either since it is the maximum of the absolute values of the input coordinates. In order to compute e(l), we wrote a class that allows to compute it easily for any polynomial expression, and since it has the interface of a number type, we can directly pass it through the code of the template predicates, via the C++ template mechanism, again. It automatically propagates the error bounds through each addition, subtraction and multiplication, and when a
% of failures for orientation % of failures for in_sphere
II R5 0% 0%
R20 0% 0%
E 0% 0%
M 0.0005% 0.007%
B 0.005% 0.001%
D 0.4% 1.35%
Table 3: Number of filter failures for interval arithmetic.
II
R5
orientation || 0% almost static filter failures semi static filter failures 0% in_sphere almost static filter failures 7% semi static filter failures 0.6%
R20
E
M
B
D
0.000003% 0%
0.001% 0.0002%
0.015% 0.01%
0.1% 0.1%
7.3% 7.3%
37% 4.6%
39% 20%
16% 4%
74% 37%
17% 7.5%
Table 4: Static and semi static filter failures.
This comes from the following property of floating point arithmetic : given two floating point numbers a and b, if they differ by at most a factor of two, i.e. a/2 < b < 2a when they are positive and similarly when they are negative, then their subtraction is computed exactly. Testing that all the initial subtractions have not created any roundoff errors must be done in a first step. We can quickly determine whether the subtraction c = a-b was exact by testing that the two following equalities are true: a == b+c and b == a-c. This can be shown with the IEEE 754 properties. Table 5 shows that, among the semi static filter failures, only a low percentage had inexact subtractions, for the in_sphere predicate. The new smaller error bound allowed to conclude in all cases for (R5), (R20) and (E), almost all cases for (M) and (B) and 90% of 7 Translation filter the cases for the difficult point set (D). What this shows We now make the following observation: the predicate is that this filtering stage is also very effective. code begins by translating the input points so that point s goes to the origin. In the new frame, the 8 Conclusion algebraic expression is simplified, which makes the static Table 6 gives the running tune of all the different comerror bound slightly better. But the most important binations of filters we have tried and shows their effiis that the algorithm often calls the predicates with ciency by a comparison with Shewchuk's predicates [25]. points which are close to each other and thus have small We have also considered general approaches which do coordinates hi the new frame. This makes a semi-static not involve changing the code of the predicates at all filter, for this new simplified predicate, very efficient : they still use some kind of dynamic filtering, but since the bound b' is often small. Translating the they pay a price for dynamic memory allocation for aldata before applying the simplified predicate, without most all arithmetic operation (since they store expresgetting out the exact geometric computation paradigm, sion DAGs). These approaches are the Expr class of needs that the translation is done in an exact way. CORE [22] (a beta version of 1.6), the real class [9] of Fortunately, this is often the case for the same reason LEDA (version 4.2), and the Lazy_exact_nt<MPJFloat> as previously: if the points are closer to the new origin class of CGAL. In this paper we gave a detailed analysis of the effithan to the previous one the translation is done exactly by the floating point arithmetic. ciency of various filter techniques to compute geometric comparison is attempted, it just stops and prints the error bound e(l). More precisely, we instantiate the predicate with a special number type used only for error bounds computation, this number type has two fields: the upper bound field x.m and the error field x.e. For this number type, the default constructor creates values with bound 1 and error 0, the addition and multiplication are overloaded such that (x+y).m = x.m + y.m (x+y).e = x.e + y.e + ulp((x+y).m)/2 (x*y) .m = x.m * y.m (x*y).e = x.e*y.m + y.e*x.m + ulp((x*y).m)/2 and the comparison operator is overloaded to print the current bound on the error (ulp is the "unit in the last place" function which gives the value of the smallest bit of the mantissa of a floating point number).
42
semi static filter failures inexact cliffs among semi static failures translation filter failures
R5 0.6% 0.06% 0%
R20 4.6% 0.04% 0%
E 20% 21% 0%
M 4% 0.3% 0.007%
B 37% 2% 0.003%
D 7.5% 7% 0.8%
Table 5: Statistics about the efficiency of translation filter stage for the in_sphere predicate.
double HP-Float Interval + HP-Float semi static + Interval + MP_Float almost static + semi static + Interval + HP-Float almost static + translation + Interval + HP .Float almost static + semi static + translation + Interval + HP-Float Shewehuk's predicates CORE Expr LEDA real Lazy-exact _nt
R5 40.6 3,063 137.2 51.8
R20 176.5 12,524 574.1 233.9
E 41.0 2,777 133.6 61.0
M 43.7 3,195 144.6 59.1
B 50.3 3,472 165.1 93.1
D loops 214 15.8 8.9
44.4
210.1
55.0
52.0
87.2
8.0
43.6
198.8
52.2
48.8
64.3
7.5
43.8 57.9 570 682 705
195.8 249.2 2431 2784 2750
48.6 57.5 3520 640 631
48.7 62.8 1355 742 726
63.6 71.7 9600 850 820
7.7 7.2 173 125 67
Table 6: Timings in seconds for the different computation methods of the triangulations.
predicates on points. We have introduced an almost static filter which reaches the efficiency of a static filter without the drawback of imposing hypotheses on the data. We also identified a new filtering step taking into account the initial translation which is usually performed in the predicates in order to simplify their algebraic expression. Our benchmarks show that our scheme is effective and compares favorably to other previous methods for the case of 3D Delaunay triangulations. We actually simplified the text of the paper by mentioning only a unique bound 6 of the static filter variants, but we indeed experimented with per-coordinate bounds: bxiby,bz, and this gave slightly nicer results, especially in the case of data sets with points which can sometime be in a plane parallel to a coordinate plane, as this gives the semi static filter a smaller error bound. In the future, we plan to make these predicates directly available in CGAL, as well as applying these methods to more predicates such as those needed to compute 2D and 3D regular triangulations. Some work has already been done for predicates on circle arcs [12]. We also propose automatic tools based on C++ template technique to compute the error bounds involved in all the proposed filters directly from the generic code of the predicate; this avoids painful code duplication and error bound computation. We have pre43
ferred this use of computer aided predicate design to a complete code generation tool [18, 8] for its simplicity. Acknowledgments The authors would like to thank the developpers of CGAL who made the 3D triangulation code available, as well as Monique Teillaud for helpful discussions.
References [1] N. Amenta, M. Bern, and M. Kamvysselis. A new Voronoi-based surface reconstruction algorithm. In Proc. SIGGRAPH '98, Computer Graphics Proceedings, Annual Conference Series, pages 415-412, July 1998. [2] N. Amenta, S. Choi, T. K. Dey, and N. Leekha. A simple algorithm for homeomorphic surface reconstruction. In Proc. 16th Annu. ACM Sympos. Comput. Geom., pages 213-222, 2000. [3] D. Attali and J.-D. Boissonnat. Complexity of the Delaunay triangulation of points on polyhedral surfaces. In Proc. 7th ACM Symposium on Solid Modeling and Applications, 2002. [4] F. Avnaim, J.-D. Boissonnat, O. Devillers, F. P. Preparata, and M. Yvinec. Evaluation of a new method to compute signs of determinants. In Proc. llth Annu. ACM Sympos. Comput. Geom., pages C16C17, 1995. [5] J.-D. Boissonnat and F. Cazals. Smooth surface reconstruction via natural neighbour interpolation of distance functions. In Proc. 16th Annu. ACM Sympos. Comput. Geom., pages 223-232, 2000. [6] J.-D. Boissonnat, O. Devillers, S. Pion, M. Teillaud, and M. Yvinec. TViangulations in CGAL. Comput. Geom. Theory Appl., 22:5-19, 2002. [7] H. Bronnimann, C. Burnikel, and S. Pion. Interval arithmetic yields efficient dynamic filters for computational geometry. In Proc. 14th Annu. ACM Sympos. Comput. Geom., pages 165-174, 1998. [8] C. Burnikel, S. Funke, and M. Seel. Exact geometric computation using cascading. Internat. J. Comput. Geom. Appl., 11:245-266, 2001. [9] C. Burnikel, K. Mehlhorn, and S. Schirra. The LEDA class real number. Technical Report MPI-I-96-1-001, Max-Planck Institut Inform., Saarbriicken, Germany, Jan. 1996. [10] The CGAL Manual, 2002. Release 2.4. [11] O. Devillers. Improved incremental randomized Delaunay triangulation. In Proc. 14th Annu. ACM Sympos. Comput. Geom., pages 106-115, 1998. [12] O. Devillers, A. Fronville, B. Mourrain, and M. Teillaud. Algebraic methods and arithmetic filtering for exact predicates on circle arcs. In Proc. 16th Annu. ACM Sympos. Comput. Geom., pages 139-147, 2000. [13] O. Devillers, S. Pion, and M. Teillaud. Walking in a triangulation. In Proc. 17th Annu. ACM Sympos. Comput. Geom., pages 106-114, 2001. [14] O. Devillers and F. P. Preparata. A probabilistic analysis of the power of arithmetic filters. Discrete Comput. Geom., 20:523-547, 1998. [15] R. Dwyer. On the convex hull of random points in a polytope. J. Appl. Probab., 25(4):688-699, 1988. [16] J. Erickson. Nice point sets can have nasty Delaunay triangulations. In Proc. 17th Annu. ACM Sympos. Comput. Geom., pages 96-105, 2001.
[17] S. Fortune and C. J. Van Wyk. Static analysis yields efficient exact integer arithmetic for computational geometry. ACM Trans. Graph., 15(3):223-248, July 1996. [18] S. Fortune and C. V. Wyk. LN User Manual. AT&T Bell Laboratories, 1993. [19] M. J. Golin and H.-S. Na. On the average complexity of 3D-Voronoi diagrams of random points on convex polytopes. In Proc. 12th Annu. ACM Sympos. Comput. Geom., pages 127-135, 2000. [20] T. Granlund. GMP, the GNU multiple precision arithmetic library, http://www.swox.com/gmp/. [21] S. Hert, M. Hoffmann, L. Kettner, S. Pion, and M. Seel. An adaptable and extensible geometry kernel. In Proc. Workshop on Algorithm Engineering, volume 2141 of Lecture Notes Comput. Sci., pages 79-90. SpringerVerlag, 2001. [22] V. Karamcheti, C. Li, I. Pechtchanski, and C. Yap. A core library for robust numeric and geometric computation. In 15th ACM Symp. on Computational Geometry, 1999. [23] A. Nanevski, G. Blelloch, and R. Harper. Automatic generation of staged geometric predicates. In International Conference on Functional Programming, Florence, Italy, 2001. Also Carnegie Mellon CS Tech Report CMU-CS-01-141. [24] S. Pion. De la g6ometrie algorithmique au calcul g6omltrique. These de doctorat en sciences, University de Nice-Sophia Antipolis, France, 1999. TU-0619. [25] J. R. Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric predicates. Discrete Comput. Geom., 18(3):305-363, 1997. [26] M. Teillaud. Three dimensional triangulations in CGAL. In Abstracts 15th European Workshop Comput. Geom., pages 175-178. INRIA Sophia-Antipolis, 1999. [27] C. Yap. Towards exact geometric computation. Comput. Geom. Theory Appl., 7(l):3-23, 1997.
44
Computing Core-Sets and Approximate Smallest Enclosing HyperSpheres in High Dimensions* Piyush Kumar*
Joseph S. B. Mitchell*
Abstract We study the minimum enclosing ball (MEB) problem for sets of points or balls in high dimensions. Using techniques of second-order cone programming and "core-sets", we have developed (1 +e)-approximation algorithms that perform well in practice, especially for very high dimensions, in addition to having provable guarantees. We prove the existence of core-sets of size O(l/e), improving the previous bound of 0(1/e2) , and we study empirically how the core-set size grows with dimension. We show that our algorithm, which is simple to implement, results in fast computation of nearly optimal solutions for point sets in much higher dimension than previously computable using exact techniques.
1 Introduction We study the minimum enclosing ball (MEB) problem: Compute a ball of minimum radius enclosing a given set of objects (points, balls, etc) in Rf*. The MEB problem arises in a number of important applications, often requiring that it be solved in relatively high dimensions. Applications of MEB computation include gap tolerant classifiers [8] in Machine Learning, tuning Support Vector Machine parameters [10], Support Vector Clustering [4,3], doing fast farthest neighbor query approximation [17], fc-center clustering [5], testing of radius clustering for k = 1 [2], approximate 1-cylinder problem [5], computation of spatial hierarchies (e.g., sphere trees [18]), and other applications [13]. In this paper, we give improved time bounds for approximation algorithms for the MEB problem, applicable to an input set of points or balls in high dimensions. We prove a time bound of O( ^ + ^ log ^ j, which is based on an improved bound of (9(1 /e) on the size of "core-sets" as well as the use of second-order cone programming (SOCP) for solving subproblems. We have performed an experimental investigation to determine how the core-set size tends to behave in practice, for a variety of input distributions. We show that substantially larger instances, both in terms of the number « of input points and the dimension d, of the MEB problem can be solved (1 + e)-approximately, with very small 45
E. Alper Yildmm§
values of e > 0, compared with the best known implementations of exact solvers. We also demonstrate that the sizes of the core-sets tend to be much smaller than the worst-case theoretical upper bounds. Preliminaries. We let Bc>r denote a ball of radius r centered at point c e Rd. Given an input set S = (p\,»-,pn} of n objects in Rd, the minimum enclosing ball MEB(S) of S is the unique minimum-radius ball containing S. (Uniqueness follows from results of [14,33]; if BI and 82 are two different smallest enclosing balls for S, then one can construct a smaller ball containing BI fl 82 and therefore containing S.) The center, c", of MEB(S) is often called the 1-center of S, since it is the point of Rd that minimizes the maximum distance to points in S. We let f denote the radius of MEB(S). A ball Bc/(i+e)r is said to be (1 +e)-approximation ofMEB(S) if r < r* and S c BCr(i+£)r. Throughout this paper, S will be either a set of points in Rd or a set of balls. We let n = \S\. Given e > 0, a subset, X c S, is said to be a core-set of S if Bc/(i+e)r 3 S, where Bc/r = MEB(X); in other words, X is a core-set if an expansion by factor (1+e) of its MEB contains S. Since X c S, r < r*; thus, the ball Bc/(i+e)r is a (1 + ^-approximation of MEB(S). Related Work. For small dimension d, the MEB problem can be solved in O(ri) time forn points using the fact that it is an LP-type problem [21,14]. One of the best implementable solutions to compute MEB exactly in moderately high dimensions is given by Gartner and Schonherr [16]; the largest instance of MEB they solve is d = 300, n = 10000 (in about 20 minutes on their platform). In comparison, the largest instance we solve (1 + (^-approximately is d = 1000, n = 100000, € = 10~3; *The code associated with this paper can be downloaded from http://www.cofflpgeom.com/meb/. This research was partially supported by a DARPA subcontract from HRL Laboratories and grants from Honda Fundamental Research Labs, NASA Ames Research (NAG2-1325), NSF (CCR-9732220, CCR-0098172), and Sandia National Labs. f Stony Brook University, [email protected]. Part of this work was done while the author was visiting MPI-Saarbriicken. *Stony Brook University, j sbm@ams. sunysb. edu. §Stony Brook University, yildirim@ams. sunysb. edu.
in this case the virtual memory was running low on the system1. Another implementation of an exact solver is based on the algorithm of Gartner [15]; this code is part of the CGAL2 library. For large dimensions, our approximation algorithm is found to be much faster than this exact solver. We are not aware of other implementations of polynomial-time approximation schemes for the MEB problem. Independently from our work, the MEB problem in high dimensions was also studied in [33]. The authors consider two approaches, one based on reformulation as an unconstrained convex optimization problem and another based on a Second Order Cone Programming (SOCP) formulation. Similarly, four algorithms (including a randomized algorithm) are compared in [31] for the computation of the minimum enclosing circle of circles on the plane. Both studies reveal that solving MEB using a direct SOCP formulation suffers from memory problems as the dimension, d, and the number of points, n, increase. This is why we have worked to combine SOCP with core-sets in designing a practical MEB method. In a forthcoming paper of BSdoiu and Clarkson [6], the authors have independently also obtained an upper bound of O(\ /e) on the size of core-sets and have, most recently [7], proved a worst-case tight upper bound of \l/e~\. Note that the worst case upper bound does not apply to our experiments since in almost all our experiments, the dimension d satisfies d < ^. The worst case upper bound of [6,7] only applies to the case when d > i. Our experimental results on a wide variety of input sets show that the core set size is smaller than min(i,d + 1) (See Figure 2). B&doiu et al. [5] introduced the notion of core-sets and their use in approximation algorithms for highdimensional clustering problems. In particular, they give an O\^ + ^o log^j time (1 + ^-approximation algorithm based on their upper bound of 0(1/e2) on the size of core-sets; the upper bound on the coreset size is remarkable in that it does not depend on d. In comparison, our time bound (Theorem 3.2) is
the size of core-sets. Section 4 is devoted to discussion of the experiments and of the results obtained with our implementation. 2 SOCP Formulation The minimum enclosing ball (MEB) problem can be formulated as a second-order cone programming (SOCP) problem. SOCP can be viewed as an extension of linear programming in which the nonnegative orthant is replaced by the second-order cone (also called the "Lorenz cone," or the "quadratic cone"), defined as Therefore, SOCP is essentially linear programming over an affine subset of products of second-order cones. Recently, SOCP has received a lot of attention from the optimization community due to its applications in a wide variety of areas (see, e.g., [20, 1]) and due also to the existence of very efficient algorithms to solve this class of optimization problems. In particular, any SOCP problem involving n second-order cones can be solved within any specified additive error e > 0 in O( V«log(l/e)) iterations by interior-point algorithms [22,26]. The MEB problem can be formulated as an SOCP problem as
min r, c,r
where c\, . . . , cn and r\, , , . , rn constitute the centers and the radii of the input set S c Rd, respectively, c and r are the center and the radius of the MEB, respectively (Note that the formulation reduces to the usual MEB problem for point sets if r; = 0 for i = 1, . . . , n). By introducing slack variables the MEB problem can be reformulated in (dual) standard form as
along with the constraints (y,, s,) e K, i = 1,..., n, where y denotes the n-dimensional vector whose components Outline of paper. We first show in Section 2 how are given by y\,..., yn. The Lagrangian dual is given to use second-order cone programming to solve the in (primal) standard form by MEB problem in O(-^/nd2(n + d)log(l/e)) arithmetic operations. This algorithm is specially suited for problems in which n is small and d is large; thus, we study algorithms to compute core-sets in Section 3 in an effort to select a small subset X, a core-set, that is sufficient for approximation purposes. This section 1. This instance took approximately 3 hours to solve. includes our proof of the new upper bound of O(l/e) on 2. http: //www.cgal.org
46
whereo := (ai,...,an)T.
The most popular and effective interior-point methods are the primal-dual path-following algorithms (see, e.g., Nesterov and Todd [23,24]). Such algorithms generate interior-points for the primal and dual problems that follow the so-called central path, which converges to a primal-dual optimal solution in the limit. The major work per iteration is the solution of a linear system involving a (d +1) x (d +1) symmetric and positive definite matrix (see, e.g., [1]). For the MEB problem, the matrix in question can be computed using O(nd2) basic arithmetic operations (flops), and its Cholesky factorization can be carried out in O(d®) flops. Therefore, the overall complexity of computing an approximation, with additive error at most e, to the MEB problem with an interior-point method is O( ^|rtd2(n + d) log(l/e)). In practice, we stress that the number of iterations seems to be (9(1) or very weakly dependent on n (see, for instance, the computational results with SDPT3 in [30]). The worst-case complexity estimate reveals that the direct application of interior-point algorithms is not computationally feasible for large-scale instances of the MEB problem due to excessive memory requirements. In [33], the largest instance solved by an interior-point solver consists of 1000 points in 2000 dimensions and requires over 13 hours on their platform. However, largescale instances can still be handled by an interior-point algorithm if the number of points n can somehow be decreased. This can be achieved by a filtering approach in which one eliminates points that are guaranteed to be in the interior of the MEB or by selecting a subset of points and solving a smaller problem and iterating until the computed MEB contains all the points. The latter approach is simply an extension of the well-known column generation approach initially developed for solving large-scale linear programs that have much fewer constraints than variables. The MEB problem formulated in the primal standard form as above precisely satisfies this property since n » d for instances of interest in this paper. We use the column generation approach to be able to solve large-scale MEB instances. The success of such an approach depends on the following factors:
state-of-the-art interior-point solver SDPT3 [29] in our implementation. > Core-set Updates: An effective approach should update the core-set in a way that will minimize the number of subsequent updates. In the following sections, we describe our approach in more detail in light of these three factors. 3 Using Core-Sets for Approximating the MEB We consider now the problem of computing a MEB of a set S = {Bi, B2,.»/ Bn] of n balls in Rd. One can consider the MEB of points to be the special case in which the radius of each ball is zero. We note that computing the MEB of balls is an LP-type problem [21, 14]; thus, for fixed d, it can be computed in O(ri) time, where the constant of proportionality depends exponentially on d. Our goal is to establish the existence of small coresets for MEB of balls and then to use this fact, in conjunction with SOCP, to compute an approximate MEB of balls quickly, both in theory and in practice. We begin with a lemma that generalizes a similar result known for MEB of points [5]: LEMMA 3.1. Let Bc>r be the MEB of the set of balls S = {Bi,B2/•••/#„} in Rd where n > d + I. Then any closed halfspace passing through c contains at least one point in B,, for some i e {1,..., n}, at distance rfrom c.
Proof. We can assume that each ball of S touches <9Bc/r; any ball strictly interior to Bc>r can be deleted without changing its optimality. Further, it is easy to see that there exists a subset S' c S, with \S'\ < d + 1, such that MEB(S') = MEB(S), and that c must lie inside the convex hull, Q, of the centers of the balls S'; see Fischer [14]. Consider a halfspace, H, defined by a hyperplane through c. Since c e Q, the halfspace H must contain a vertex of Q, say c', the center of a ball f BC>' ?. Let p = cc1 n dBc/r be the point where the ray cc ....» f exits Bc,r and let q = cc n dBc's be the point where cc' exits Bc>tr>. Then, \\cp\\ = r. By the triangle inequality, all points of B^s \ {q} are at distance from c at most that of > Initialization: The quality of the initial core set is q; thus, since Bc/^ touches dBCiTl we know that p = q. ffl crucial since a good approximation would lead to Our algorithm for computing MEB(S) for a set S fewer updates. Furthermore, a small core set with a good approximation would yield MEB instances of n points or balls begins with an enclosing ball of S with relatively few points that can efficiently be based on an approximate diameter of S. If S is a set of points, one can compute a (1 - ^-approximation of the solved by an interior-point algorithm. diameter, 6, yielding a pair of points at distance at least > Subproblems: The performance of a column gener- (1 - e)6; however, the dependence on dimension d is ation approach is closely related to the efficiency exponential [9]. For our purposes, it suffices to obtain with which each subproblem can be solved. We use any constant factor approximation of the diameter 6, 47
—./»
so we choose to use the following simple O(dri) -time Algorithm 2 Outputs a (1+^-approximation of MEB(S) method, shown by Egecioglu and Kalantari [11] to yield and an O(l/e) -size core-set a 4=-approximate diameter of a set S of points: Pick any Require: Input set of points S 6 Rd, parameter e = 2~m, p 6 S; find a point q 6 S that is furthest from /?; find a subset XQ c S point q' 6 S that is furthest from g; output the pair l: for / = 1 to m do (q, q'). It is easy to see that the same method applies to 2: Call Algorithm 1 with input S, e = 2~l, Xj_i the case in which S is a set of balls, yielding again a -4=- 3: Xz <— the output core-set approximation. (Principal component analysis can be 4: end for used to obtain the same approximation ratio for points 5: Return MEB(Xm),Xm but does not readily generalize to the case of balls.) Algorithm 1 Outputs a (1+e)-approximation of MEB(S) and an (9(1/e2) -size core-set Require: Input set of points S e Rd, parameter e > 0, subset XQ c S 1: X «- X0 2: loop
3: 4: 5: 6: 7: 8:
Compute Bc/r = MEB(X). if S c Bc,(i+e)r then Return Bc/r, X else p <— point g 6 S maximizing ||c^|| end if
9:
X<-XU{p}
10: end loop
we know that (1 + 2 ')rz > r* > r\. For round i + 1, e = 2~('+1), so in each iteration of Algorithm 1, the radius goes up by factor (1 + e2/^) = (1 + 2~2z'~6); thus, each iteration increases the radius by at least 2~2l~6r,. If in round i + 1 there are Jt,+i points added, the ball at the end of the round now has radius rj+i > r, + kj+i • 2~2j~6rj. Since we know that (1 + 2'% > r* and that r* > ri+i, we get that (1 + 2~')r,- > (1 + ki+i - 2~2f~6)r,-/ implying that ki+i < 2i+e as claimed. n THEOREM 3.1. The core-set output by Algorithm 2 has size Proof. The size of |Xm| is equal to the sum of the number of points added in each round, which is, by Lemma 3.3, at most ET=i 2/+6 = O(2m) = O(l/e) . m
If Algorithm 1 is applied to input data, with XQ = {q, q'} given by the simple -^-approximation algorithm for diameter, then it will yield an output set, X, that is of size O(l/e2), as shown by BSdoiu et al. [5] Their same proof, using Lemma 3.1 to address the case of balls, yields the following:
THEOREM 3.2. A (1 + e)-approximation to the MEB of a set of n balls in d dimensions can be computed in time
In fact, the proof of Lemma 3.2 is based on showing that each iteration of Algorithm 1 results in an increase of the radius of the current ball, MEB(X), by a factor of at least (1 + e2/16); this in turns implies that there can be at most O(l /e2) iterations in going from the initial ball of radius at least -4= to the final ball (whose radius is at most 6). We bootstrap Lemma 3.2 to give a O(l/e) -size core-set, as shown in Algorithm 2.
Remark. The above theorem also improves the best known time bounds for approximation results independent of d on the 2-center clustering (2^ J ^dri) problem and the /c-center clustering (2O("J?~)rfn) problem [5].
Proof. Since the size of the basis (core-set) is O(^)f each call to our SOCP solver incurs a cost of d o(^(l+ d] log i j. We parse through the input <9(i) LEMMA 3.2. [5] For any set S of balls in R and any 2 0 < € < 1 there exists a subset X c S, with |X| = O(l/e ), times, so the total cost is O(^ + -j^ (± +d)logl). such that the radius ofMEB(S) is at most (1 + e) times the Putting d = 0(1 le) , as in [5], we get a total bound radius o/MEB(X). In other words, there exists a core-set X of 0(£ + £ log}). m of size O(l/e2).
4 Implementation and Experiments We implemented our algorithm in Matlab. The total code is less than 200 lines; it is available at http : //www . compgeom . com/meb/. No particular attenLEMMA 3.3. The number of points added to X in round i +1 tion was given to optimizing the code; our goal was to demonstrate the practicality of the algorithm, even is at most 2i+6. with a fairly straightforward implementation. The curProof. Round i gave as output the set X,, of radius r,-, rent implementation takes only point sets as input; exwhich serves as the input core-set to round (i+1). Thus, tending it to input sets of balls should be relatively 48
straightforward. We implemented Algorithm 1 and 2. We also implemented the new gradient descent type algorithm of [6] exactly the way it has been presented in Claim 3.1 of the paper. The problem with this algorithm is that as soon as e decreases, it starts taking too much time. As is predicted by theory, it takes exactly O(dn/e2) time. Our algorithm's analysis is not as tight, so it performs much better in practice than predicted by theory as e decreases. Very recently, the running time of this algorithm (Claim 3.1 [6]) was improved to O(^ log2 i) by another algorithm that also avoids quadratic programming [27]. Using this algorithm as the base case solver in Theorem 3.2, the running time can be reduced to O (^ + ^ log2 i). This is slightly better than our running time and does not use SOCP. The improved algorithm might be a good competitor to our algorithm in practice. For the SOCP component of the algorithm, we considered two leading SOCP solvers: SeDuMi [28] and SDPT3 [29]. Experimentation showed SDPT3 to be superior to SeDuMi for use in our application, so our results here are reported using SDPT3. We found it advantageous to introduce random sampling in the step of the algorithm that searches for the existence of a point that is substantially outside the current candidate ball. This can speed the detection of a violator; a similar approach has been used recently by Pellegrini [25] in solving linear programming problems in moderately high dimension. In particular, we sample a subset, S', of the points, of size L-\fn, where L is initialized to be 1 and is incremented by 1 at each iteration. Only if a violator is not found in S' do we begin scanning the full set of points in search of a violator. We implemented random sampling only for the implementation of Algorithm 1. For Algorithm 2, we always found the violator farthest from the current center in order to make the core set size as small as possible. We recommend using Algorithm 2 only when the user wants to minimize the size of the core set. Another desirable property of the implementation is that it is I/O efficient if we assume that we can solve O(^) size subproblems in internal memory (This was always the case for our experiments, since the size of the core-set did not even approach ^ in practice). If this is true then the current implementation in the I/O model does at most O(^) I/Os3 and the same bound also generalizes to the cache-oblivious model [12]. The implementation of Algorithm 2 has an I/O bound of O(j£) . Some of the large problems we report results for here did actually use more memory while running than we had installed on the system. For instance in Figure 1, the sudden increase around dimension d = 400 for n = 105 points is due to effects of paging from disk, 49
as this is the largest input size that allowed subproblems to fit within main memory. We believe that with our algorithm and an efficient implementation(C++), really large problems (n w 107,d » lO4,^ « lO"4) would become tractable in practice on current state of the art systems with sufficient memory and hard disk space. Platform. All of the experimental results reported in this paper were done on a Pentium III IGhz, 512MB notebook computer, running Windows 2000. The hard disk used was a 4200rpm/20GB hard disk drive (Fujitsu MHM2200AT). Datasets. Most of our experiments were conducted on randomly generated point data, according to various distributions. We also experimented with the USPS data4 which is a dataset of handwritten characters created by the US Postal service. We used Matlab to generate random matrices. For generating uniform data we used rand, for generating specific distributions we used random and for generating normally distributed random numbers we used randn. Specifically, we considered the following four classes of point data: ® uniformly distributed within a unit cube; ® uniformly distributed on the vertices of a unit cube; (S> normally distributed in space, with each coordinate chosen independently according to a normal distribution with mean 0 and variance 1; ® point coordinates that are Poisson random variables, with parameter A = 1. Methods for comparison. Bernd Gartner [15] provides a code on his website that we used. We also used the CGAL 2.4 implementation (using Welzl's move-tofront heuristic, together with Gartner's method [15]). We were not able to compile code available from David White's web page5. We could not replicate the timings reported in the paper by Gartner and Schonherr [16]. In the near future, a recent implementation of [16] is going to appear in CGAL. We decided not to include the comparison with this version because the implementation is not robust enough yet and sometimes gives wrong results. Experimental results. In Figure 1 we show how the running time of Algorithm 1 varies with dimension, for each of three sizes of inputs (n = 103,104, and 105) for points that are normally distributed (with mean p = 0 and variance a = 1). Here, e = 0.001. Corresponding to 3. Here B denotes the disk block size. 4. http: //www. kernel-machines. org/data/ups. mat. gz , 29MB 5. http://vision.ucsd.edu/~dwhite
the same experiment, Figure 2 shows how the number of iterations in Algorithm 1 varies with dimension. Recall that the number of iterations is simply the size of the core-set that we compute. Note that, while the core-set size is seen to increase with dimension, it is no where near the worst case predicted by theory (of O(l/e)). Also notable is the fact that as the dimension grows, the timings in Figure 1 do not seem to be linearly increasing (as predicted by theory). This seems to stem from the fact that the core set size is not constant with e fixed, and the SOCP solver takes more time as the core set becomes bigger. These experiments also point to the fact that the theoretical bounds, both for the core set size and the running time are not tight, at least for normally distributed points. In Figures 3 and 4 we show how the running time and the core-set size varies with dimension for each of the four distributions of input points, and n = 10,000, e = 0.001. Note again the correlation between running times and core set sizes. Figure 5 shows a timing comparison between our algorithm, the CGAL 2.4 implementation, and Bernd Gartner's code available from his website. Both these codes assume that the dimension of the input point set is fixed and have a threshold dimension beyond which the computation time blows up. Figures 6, 7,8 and 9 compare the implementations of Algorithm 1 and 2. For all the these experiments the points were picked from a normal distribution with ju = 0, a = 1, n = 10000. Figure 6 compares the timings of Algorithm 1 and 2 for e = 2~w. Note that Algorithm 2 does not implement random sampling and hence is quite slow in comparison to Algorithm 1. Recall that Algorithm 2 was designed to keep the core set size as small as possible. Also the implementation of Algorithm 2 is not well optimized. Figure 7 and 8 compare the radius computed by both algorithms on the input point set. Figure 8 is a plot of the difference of the radii computed by the two algorithms on the same input. Note that the radii computed by both algorithms is always inside the window of variability they are allowed. For instance, at dimension d = 1400, the radii output from Algorithm 1 and 2 could vary by at most er < 2"10 x 39.2 < 0.0009 whereas actually it differs by 0.000282. Figure 9 shows the difference between the core set sizes computed by Algorithm 1 and 2. Its surprising that the theoretical improvement suggested by the worst case core set sizes does not show up at all in practice. This unexpected phenomenon might be the result of Step 7 (Algorithm 1), which chooses the farthest violator and hence maximizes the expansion of the current ball. The slight difference between the
core set sizes of Algorithm 1 and 2 seems to be there because Algorithm 1 does not find the exact farthest violator but uses random sampling to find a good one as explained in section 4. It seems that Algorithm 1 is as good as Algorithm 2 in practice as far as core set sizes are concerned, but it remains open to prove a tight bound on the core set size of Algorithm 1. Figures 10 and 11 are an attempt to show the running times and core set sizes of Algorithm 1 on real world data sets. Note that these graphs suggest that core set sizes are linear compared to log (^ j! In Figures 12 and 13 we show results for low dimensions (d - 2,3), as the number n of points increases, for two choices of e. Note the logarithmic scale. In both these experiments the core set size was less than 10 for all runs. This means that the time to do 2-center clustering on these data sets would take at most O(2wn) time. It remains an open problem to determine if 2-center clustering is really practical in 2,3-dimensions. Finally in Figure 14 we compare our implementation with the implementation of Claim 3.1 of [6] for different values of e. The number of points n for this experiment is 1000. Note that for e = 0.03, this algorithm is already very slow compared to Algorithm 1. We have not implemented the improved version of [6] which has a slightly slower running time than our algorithm (O (*j- + ^ j\ but it seems that when e is small, the running time of the improved algorithm might suffer because of the base case solver (Claim 3.1 [6]). But certainly, the improved algorithm suggested in section 4, using [27] seems to be a better candidate for implementation than [6]. 5 Open Problems There are interesting theoretical and practical problems that this research opens up. D> In Practice : Can one do MEB with outliers in practice? 1-cylinder, 2-center and k-center approximations? Is computing minimum enclosing ellipsoid approximately feasible in higher dimensions? Are there core sets for ellipsoids of size less than &(d2)? Does dimension reduction help us to solve large dimensional problems in practice [19]? Can one use warm start strategies to improve running times by giving a good starting point at every iteration [32]? How does the improved algorithm with running time O(y + J log2 j) suggested in section 4 using the new base case algorithm of [27] with running time O(^ log2 ^) compare with the implementation of Algorithm 1? 50
> In Theory : Of particular theoretical interest is the question about the optimal core-set size of the MEB problem. Are there similar dimension independent core-sets for other LP-Type problems? It remains an open question to tighten the coreset bounds for different distributions when d < ^. From our experiments, at least for normal distributions it seems that the core set size is less than
Acknowledgements The first author would like to thank Sariel Har-Peled and Edgar Ramos for helpful discussions. The authors thank Sachin Jambawalikar for help with importing the USPS data set in their code.
Figure 2: The number of iterations(Core set size) vs. dimension for n = 103,104,105 and e = 0.001 for inputs of normally distributed points(/i = 0,cr = 1).
Figure 1: Running time in seconds of algorithm 1's implementation vs. dimension for n = 103,104,105 and e = 0.001 for inputs of normally distributed points(^ =
£
Figure 3: Running times for four distributions: uniform within a unit cube, normally distributed, Poisson distribution, and random vertices of a cube. Here, n = 10000,6 = 0.001.
51
Figure 4: Number of iterations (core-set size), as a Figure 6: Timing comparison between Algorithm land function of d, for four distributions. Here, n = 10000, 2. (n = 10000, e = 2~10, input from normal distribution, e = 0.001.
Figure 5: Timing comparison with CGAL and Bernd Figure 7: Radius comparison of Algorithm 1 and 2. Gartner's code [15], e = 10~6,n = 1000, normally (n = 10000, e = 2~10, input from normal distribution, distributed points(ju = 0,cr = 1). ju = 0,cr = 1)
52
Figure 8: Radius difference plot of Algorithm 1 and 2 Figure 10: USPS data timing comparison with Noron the same input data, (n = 10000, e = 2~10, input from mally distributed data(^ = 0,a = 1). The data contains normal distribution, ju = 0,a = 1) 7291 points in 256 dimensions and is a standard data set used in clustering and machine learning literature of digitized hand written characters.
Figure 9: Core Set Size comparison between Algorithm 1 and 2. Note that Algorithm 2 does not implement random sampling, (n = 10000, e = 2~10, input from Figure 11: USPS data core set size comparison with normal distribution, /*• = 0,a = 1) Normally distributed data(^ = 0,cr = 1).
53
Figure 12: Experiments in 2D with different e. All core set sizes for this experiment were less than 10 in size. Input from normal distribution^ = 0, a — 1).
Figure 14: Our algorithm is compared with BC [6] for different epsilon values. As is evident, as soon as epsilon becomes small, their algorithm performance drastically surfers. (Is expected because of 1/e2 iterations on the point set).
Figure 13: Experiments in 3D with different e. Input from normal distribution^ = 0,a = 1). The core set sizes in this experiment were all less than 10. This shows that 2-center approximation in 2D/3D is practical. This could have applications in computation of spatial hierarchies based on balls [18].
54
References [1] F. Alizadeh and D. Goldfarb. Second-order Cone Programming. Technical Report RRR 51, Rutgers University, Piscataway, NJ 08854. 2001 [2] N. Alon, S. Dar, M. Parnas and D. Ron. Testing of clustering. In Proc. 41st Annual Symposium on Foundations of Computer Science, pages 240-250. IEEE Computer Society Press, Los Alamitos, CA, 2000. [3] Y. Bulatov, S. Jambawalikar, P. Kumar and S. Sethia. Hand recognition using geometric classifiers. Manuscript. [4] A. Ben-Hur, D. Horn, H. T. Siegelmann and V. Vapnik. Support vector clustering. In Journal of Machine Learning, revised version Jan 2002,2002. [5] M. BSdoiu, S. Har-Peled and P. Indyk. Approximate clustering via core-sets. Proceedings of 34th Annual ACM Symposium on Theory of Computing, pages 250-257,2002. [6] M. BSdoiu and K. L. Clarkson. Smaller core-sets for balls. In Proceedings of 14th ACM-SIAM Symposium on Discrete Algorithms, to appear, 2003. [7] M. Badoiu and K. L. Clarkson. Optimal core-sets for balls. Manuscript. [8] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167,1998. [9] T. M. Chan. Approximating the diameter, width, smallest enclosing cylinder, and minimum-width annulus. In Proceedings of 16th Annual ACM Symposium on Computational Geometry, pages 300-309,2000. [10] O. Chapelle, V. Vapnik, O. Bousquet and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1/3):131,2002. [11] O. Egecioglu and B. Kalantari. Approximating the diameter of a set of points in the euclidean space. Information Processing Letters, 32:205-211,1989. [12] M. Frigo, C. E. Lieserson, H. Prokop and S. Ramachandran. Cache Oblivious Algorithms. Proceedings of 40th Annual Symposium on Foundations of Computer Science, 1999. [13] D. J. Elzinga and D. W. Hearn. The minimum covering sphere problem. Magangement Science, 19(1):96-104, Sept. 1972. [14] K. Fischer. Smallest enclosing ball of balls. Diploma thesis, Institute of Theoretical Computer Science, ETH Zurich, 2001. [15] B. Gartner. Fast and robust smallest enclosing balls6. In Proceedings of 7th Annual European Symposium on Algorithms (ESA). Springer-Verlag, 1999. [16] B. Gartner and S. Schonherr. An efficient, exact, and generic quadratic programming solver for geometric optimization. In Proceedings of 16th Annual ACM Symposium on Computational Geometry, pages 110-118, 2000. [17] A. Goel, P. Indyk and K. R. Varadarajan. Reductions among high dimensional proximity problems. In Poceedings of 13th ACM-SIAM Symposium on Discrete Algorithms, pages 769-778,2001.
55
[18] P. M. Hubbard. Approximating polyhedra with spheres for time-critical collision detection. ACM Transactions on Graphics, 15(3):179-210, July 1996. [19] W. Johnson and J. Lindenstrauss Extensions of Lipschitz maps into a Hilbert space. Contemp. Math. 26, pages 189-206,1984. [20] M. S. Lobo, L. Vandenberghe, S. Boyd and H. Lebret. Applications of second-order cone programming. Linear Algebra and Its Applications, 248:193-228,1998. [21] J. Matousek, Micha Sharir and Emo Welzl. A subexponential bound for linear programming. In Proceedings of 8th Annual ACM Symposium on Computational Geometry, pages 1-8,1992. [22] Y. E. Nesterov and A. S. Nemirovskii. Interior Point Polynomial Methods in Convex Programming. SLAM Publications, Philadelphia, 1994. [23] Y. E. Nesterov and M. J. Todd. Self-scaled barriers and interior-point methods for convex programming. Mathematics of Operations Research, 22:1-42,1997. [24] Y. E. Nesterov and M. J. Todd. Primal-dual interiorpoint methods for self-scaled cones. SIAM Journal on Optimization, 8:324-362,1998. [25] M. Pellegrini. Randomized combinatorial algorithms for linear programming when the dimension is moderately high. In Proceedings of 13th ACM-SIAM Symposium on Discrete Algorithms, 2001. [26] J. Renegar. A Mathematical View of Interior-Point Methods in Convex Optimization. MPS/SIAM Series on Optimization 3. SIAM Publications, Philadelphia, 2001. [27] Sariel Har-Peled. Personal Communications. [28] J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11/12:625-653,1999. [29] K. C. Toh, M. J. Todd and R. H. Tutuncii. SDPT3 — a Matlab software package7 for semidefinite programming. Optimization Methods and Software, 11:545-581, 1999. [30] R. H. Tutuncii, K. C. Toh and M. J. Todd. Solving semidefinite-quadratic-linear programs using SDPT3. Technical report, Cornell University, 2001. To appear in Mathematical Programming. [31] S. Xu, R. Freund and J. Sun. Solution methodologies for the smallest enclosing circle problem. Technical report, Singapore-MIT Alliance, National University of Singapore, Singapore, 2001. [32] E. A. Yildmm and S. J. Wright Warm-Start Strategies in Interior-Point Methods for Linear Programming. S7AM Journal on Optimization 12/3, pages 782-810. [33] G. Zhou, J. Sun and K.-C. Toh. Efficient algorithms for the smallest enclosing ball problem in high dimensional space. Technical report, 2002. To appear in Procedings of Fields Institute of Mathematics.
6. http://www.inf.ethz.ch/personal/gaertner 7. http://HWW.math.nus.edu.sg/"mattohkc/sdpt3.html
Interpolation over Light Fields with Applications in Computer Graphics" F. Betul Atalayt Abstract We present a data structure, called a ray interpolant tree, or RItree, which stores a discrete set of directed lines in 3-space, each represented as a point in 4-space. Each directed line is associated with some small number of continuous geometric attributes. We show how this data structure can be used for answering interpolation queries, in which we are given an arbitrary ray in 3-space and wish to interpolate the attributes of neighboring rays in the data structure. We illustrate the practical value of the Rl-tree in two applications from computer graphics: ray tracing and volume visualization. In particular, given objects defined by smooth curved surfaces, the Rl-tree can produce high-quality renderings significantly faster than standard methods. We also investigate a number of tradeoffs between the space and time used by the data structure and the accuracy of the interpolation results.
1 Introduction There is a growing interest in algorithms and data structures that combine elements of discrete algorithm design with continuous mathematics. This is particularly true in computer graphics. Consider for example the process of generating a photo-realistic image. The most popular method for doing this is ray-tracing [12]. Ray-tracing models the light emitted from light sources as traveling along rays in 3-space. The color of a pixel in the image is a reconstruction of the intensity of light traveling along various rays that are emitted from a light source, transmitted and reflected among the objects in the scene, and eventually entering the viewer's eye. There are many different methods for mapping this approach into an algorithm. At an abstract level, all ray-tracers involve forming an image by combining various continuous quantities, or attributes, that have been generated from a discrete set of sampled rays. These continuous attributes include color, radiance, surface normals, and reflection and refraction vectors. These attributes vary continuously either as a function of the location on the surface of an object or as a us material is based upon work supported by the National Science Foundation under Grant No. 0098151. t Department of Computer Science, University of Maryland, College Park, Maryland. Email: [email protected]. * Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland. Email: [email protected].
David M. Mount* function of the location of the viewer and the locations of the various light sources in 3-space. The reconstruction process involves combining various discretely sampled attributes in the context of some illumination model. Producing images by ray-tracing is a computationally intensive process. The degree of realism in the final image depends on a number of factors, including the density and number of samples that are used to compute a pixel's intensity and the fidelity of the illumination model to the physics of illumination. Scenes can involve hundreds of light sources and from thousands to millions of objects, often represented as smooth surfaces, including implicit surfaces [6], subdivision surfaces [22], and Bezier surfaces and NURBS [9]. Reflective and transparent objects cause rays to be reflected and refracted, further increasing the numbers of rays that need to be traced. In traditional ray-tracing solutions, each ray is traced through the scene as needed to compute the intensity of a pixel in the image [12]. Much of the work involves determining the first object that is intersected by each ray and the location that the ray hits. In this paper, we propose one approach to help to accelerate this process by reducing the number of intersection calculations. Our algorithm facilitates fast, approximate rendering of a scene from any viewpoint, and is most useful when the scene is rendered from multiple viewpoints, as arises in computing animations. Rather than tracing each input ray to compute the required attributes, we collect and store a relatively sparse set of sampled rays based on demand and associate some continuous geometric attributes with each sample in a fast data structure. We can then use inexpensive interpolation methods to approximate the value of these sampled quantities for other input rays. Using an adaptive strategy, it is possible to avoid oversampling in smooth areas while providing sufficiently dense sampling in regions of high variation. We dynamically maintain a cache of the most recently generated samples, in order to reduce the space requirements of the data structure. The information associated with a given ray is indexed according to the directed line that supports the ray, which in turn is modeled as a point in a 4-dimensional line space. Given a ray to be traced, we access the data structure to locate a number of neighboring sampled rays in line space. We then interpolate the continuous information from the
56
neighboring rays. The resulting data structure is called an varies smoothly, and for generating high-resolution images and/or antialiased images generated by supersampling [12] Rl-tree or ray interpolant tree. in which multiple rays are shot for each pixel of the image. Although we have motivated the Rl-tree data structure 1.1 Related Work The idea of associating radiance information with points in line space has a considerable history, from the perspective of ray-tracing, there are a number of dating back to work in the 1930's by Gershun on vector irra- applications having to do with lines in 3-space that can bendiance fields [10] and Moon and Spencer's concept of photic efit from this general approach. To illustrate this, we have fields [17], and more recent papers by Levoy and Hanrahan studied two different applications, one involving image gen[16] and Gortler, Grzeszczuk, Szeliski, and Cohen [13]. The eration through ray-tracing and the other involving volume term "light field" in our title was coined by Levoy and Han- visualization with applications in medical imaging for radiarahan, but our notion is more general than theirs because tion therapy. There are a number of issues that arise in engineering a we consider interpolation of any continuous information, not just radiance. Most methods for storing light field infor- practical data structure for interpolation in line space. These mation in computer graphics are based on discretizing the include the following. space into uniform grids. In contrast, we sample rays adapHow and where to sample rays? Regions of space where tively, concentrating more samples where they are greatest continuous information varies more rapidly need to be value. The most closely related work to ours is the Intersampled with greater density than regions that vary polant Ray-tracer system introduced by Bala, Dorsey, and smoothly. Teller [4], which combines adaptive sampling of radiance information, caching, and interpolation for rendering con- Whether to interpolate? In the neighborhood of a disconvex objects. Our method generalizes theirs by storing ad intinuity, the number of rays that may need to be samterpolating not only radiance information but other sorts of pled to produce reasonable results may be unacceptcontinuous information, which may be relevant to the renably high. Because the human eye is very sensitive dering process. We also allow nonconvex objects. Unlike to discontinuities near edges and silhouettes, it is oftheir method, however, we do not provide guarantees on the ten wise to avoid interpolating across discontinuities. worst-case approximation error. This raises the question of how to detect discontinuities. In addition, there has been earlier research on acceleratWhen they are detected, is it still possible to interpolate ing ray tracing by reducing the cost of intersection compuor should we avoid interpolation and use standard raytations using bounding volume hierarchies [19], space partracing instead? titioning structures [11, 15], and methods exploiting ray coherence [2,14,1,18]. Some of these methods can be applied How many samples to maintain? Even for reasonably smooth scenes, the number of sampled rays that would in combination with ours. need to be stored for an accurate reconstruction runs A simpler variant of this data structure was introduced well into millions. For this reason, we cache the results by the authors in [3]. The current data structure has two main of only the most relevant rays. What are the space-time improvements over the previous one. Firstly, the previous tradeoffs involved with this approach? data structure was not dynamic: as a pre-processing phase, the entire data structure was built sampling millions of rays We investigate these and other questions in the context originating from the entire space of viewpoints in various diof a number of experiments based on the applications menrections. This resulted in high pre-processing times, and high tioned above. The paper is organized as follows. In the rest space requirements. The current data structure, on the other of this section, we give a brief overview of our algorithm hand, is dynamically built generating samples on demand— in the context of ray-tracing. In Section 2, we explain the only when they are actually needed, and maintained by a construction of our data structure, and its use for answering caching mechanism. Secondly, the current system generalinterpolation queries. In Section 3, we report experimental izes the framework and can handle general scenes containing results. multiple objects with different surface reflectance properties, whereas the previous system focused on single reflective or 1.3 Algorithm Overview In order to make things contransparent objects. In addition, a systematic analysis of the crete, we will describe our data structure in terms of a rayperformance of the data structure is presented in this paper. tracing application. As mentioned in the introduction, raytracing works by simulating the light propagation in a scene. 1.2 Design Issues The approach of computing a sparse set A traditional ray-tracer shoots one or more rays from the of sample rays and interpolating the results of ray shooting is viewpoint through each pixel of the image plane. The ray most useful for rendering smooth objects that are reflective is traced through the scene and the intensity gathered constior transparent, for rendering animations when the viewpoint tutes the color of that pixel. To reduce problems of aliasing,
57
multiple rays may also be traced for each pixel and these results are interpolated. We can distinguish two major tasks in a ray-tracer. The geometric component is responsible for calculating the closest visible object point along a specific ray, as the shading component computes the color of that point. If the object is reflective or transparent, the reflection and transmission rays are traced recursively to gather their contribution to the intensity. The primary expense in ray-tracing lies in the geometric component, especially for scenes that contain complex objects such as Bezier or NURBS surfaces, and/or reflective and refractive objects. Each object can be modeled abstractly as a function / mapping input rays to a set of geometric attributes that are used in color computation. The attributes depend on the object's surface reflectance properties. For objects whose surfaces are neither reflective nor transparent, denoted simple surfaces, the function returns the point of intersection and the surface normal at this point. For objects whose surfaces are either reflective or transparent the function additionally returns the exit ray, that is, the reflected or refracted ray, respectively, that leaves the object's surface. The exit ray is represented by its origin, the exit point, and directional exit vector. In general, objects that are both reflective and refractive could be handled by associating multiple exit rays with an input ray, but our implementation currently does not support this. These quantities are depicted in Fig. 1 and the function is described schematically below. We will refer to the combination of the underlined attributes below as the output ray.
Figure 2: The two-plane parameterization of directed lines and rays. The +X plane pair is shown.
refractions. In the neighborhood of discontinuities, however, nearby input rays may follow quite different paths. In cases where we cannot find sufficient evidence to interpolate, we perform ray-tracing instead. In a traditional ray-tracer, each object is associated with a procedure that computes intersections between rays and this object. For objects whose boundaries are sufficiently smooth, we replace this intersection procedure with a data structure, which will be introduced in the next section. This data structure approximates the function / through interpolation.
2 The Ray Interpolant Tree In this section we introduce the main data structure used in our algorithm, the Rl-tree or ray interpolant tree. Each Rl-tree is associated with a single object of the scene, Otherwise: which is loosely defined to be a collection of logically / : Ray -> {Normal.IntersectionPoint,ExitPointExitVector} related surfaces. The object is enclosed by an axis-aligned bounding box. The data structure stores the geometric attributes associated with some set of sampled rays, which may originate from any point in space and intersect the object's bounding box.
For simple surfaces: / : Ray —>• {NormalJntersectionPoint}
Figure 1: Geometric attributes. For many real world objects, which have large smooth surfaces, / is expected to vary smoothly. In the context of ray-tracing, this is referred to as ray coherence. Nearby rays follow similar paths, hit nearby points having similar normal vectors and hence are subject to similar reflections and/or
2.1 Parameterizing Rays as Points We will model each ray by the directed line that contains the ray. Directed lines can be represented as a point lying on a 4-dimensional manifold in 5-dimensional projective space using Pliicker coordinates [21], but we will adopt a simpler popular representation, called the two-plane parameterization [13, 16, 4]. A directed line is first classified into one of 6 different classes (corresponding to 6 plane pairs) according to the line's dominant direction, defined to be the axis corresponding to the largest coordinate of the line's directional vector and its sign. These classes are denoted +X, -X, +Y, -Y, +Z, -Z. The directed line is then represented by its two intercepts (s, t) 58
tributes have greater variation. For this reason, the subdivision is carried out adaptively based on the distance between output attributes. The distance between two sets of output attributes are defined as the distance between their associated output rays. We define the distance between two rays to be the L% distance between their 4-dimensional representations. To determine whether a cell should be subdivided, we first compute the correct output ray associated with the midpoint of the cell, and then we compute an approximate output ray by interpolation of the 16 comer rays for the same point. If the distance between these two output rays exceeds a given user-defined distance threshold and the depth of the cell in the tree is less than a user-defined depth constraint, the cell is subdivided. Otherwise the leaf is said to be final. If we were to expand all nodes in the tree until they are final, the resulting data structure could be very large, depending on the distance threshold and the depth constraint. For this reason we only expand a node to a final leaf if this leaf node is needed for some interpolation. Once a final leaf node is used, it is marked with a time stamp. If the size of the data structure exceeds a user-defined cache size, then the the tree is pruned to a constant fraction of this size by removing all but the most recently used nodes. In this way, the Rl-tree behaves much like an LRU-cache.
Figure 3: Subdivision along s-axis.
and (u, v) with frontplane and backplane, respectively, that are orthogonal to the dominant direction and coinciding with the object's bounding box. For example, as shown in Fig. 2, ray R with dominant direction +X first intersects the front plane of the +X plane pair at (s, t), and then the back plane at (u, v), and hence is parameterized as (s, £, u, v). Note that, the +X and —X involve the same plane pair but differ in the distinction between front and back plane. 2.2 The Rl-tree The Rl-tree is a binary tree based on a recursive subdivision of the 4-dimensional space of directed lines. It consists of six separate 4-dimensional kd-trees [5, 20] one for each of the six dominant directions. The root of each kd-tree is a 4-dimensional hypercube in line space containing all rays that are associated with the corresponding plane pair. The 16 corner points of the hypercube represent the 16 rays from each of the four corners of the front plane to the each of the four corners of the back plane. Each node in this data structure is associated with a 4-dimensional hyperrectangle, called a cell. The 16 corner points of a leaf cell constitute the ray samples, which form the basis of our interpolation. When the leaf cell is constructed, these 16 rays are traced and the associated geometric attributes are stored in the leaf. 2.3 Adaptive Subdivision and Cache Structure The Rltree grows and shrinks dynamically based on demand. Initially only the root cell is built by sampling its 16 corner rays. A leaf cell is is subdivided by placing a cut-plane at the midpoint orthogonal to the coordinate axis with the longest length. In terms of the plane pair, this corresponds to dividing the corresponding front or back plane through the midpoint of the longer side. We partition the existing 16 corner samples between the two children, and sample eight new corner rays that are shared between the two child cells. (These new rays are illustrated in Fig. 3 in the case that the s-axis is split.) Rays need to be sampled more densely in some regions than others, for example, in regions where geometric at-
2.4 Rendering and Interpolation Queries Recall that our goal is to use interpolation between sampled output rays whenever things are sufficiently smooth. Rl-tree can be used to perform a number of functions in rendering, including determining the first object that a ray hits, computing the reflection or refraction (exit) ray for nonsimple objects, and answering various visibility queries, which are used for example to determine whether a point is visible to a light source or in a shadow. Let us consider the interpolation of a given input ray R. We first map R to the associated point in the 4-dimensional directed line space and, depending on the dominant direction of this line, we find the leaf cell of the appropriate kd-tree through a standard descent. Since the nodes of the tree are constructed only as needed, it is possible that R will reside in a leaf that is not marked as final. This means that this particular leaf has not completed its recursive subdivision. In this case, the leaf is subdivided recursively, along the path R would follow, until the termination condition is satisfied, and the final leaf containing R is now marked as final. (Other leaves generated by this process are not so marked.) Given the final leaf cell containing R, the output attributes for R can now be interpolated. Interpolation proceeds in two steps. First we group the rays in groups of four, which we call the directional groups. Rays in the same group originate from the same corner point on the front plane, and pass through each of the four corners of the back plane (For example, Fig. 4 shows the rays that originate from the north-
59
Figure 4: Sampled rays within a directional group.
east corner of the front plane). Within each directional group bilinear interpolation with respect to (u, v) coordinates is performed to compute intermediate output attributes. The outputs of these interpolations are then bilinearly interpolated with respect to (s, t) coordinates to get the approximate output attributes for R. Thus, this is essentially a quadrilinear interpolation. 2.S Handling Discontinuities and Regions of High Curvature Through the use of interpolation, we can greatly reduce the number of ray samples that would otherwise be needed to render a smooth surface. However, if the rayoutput function / contains discontinuities, as may occur at the edges and the outer silhouettes of the object, then we will observe bleeding of colors across these edges. This could be remedied by building a deeper tree, which might involve sampling of rays up to pixel resolution in the discontinuity regions. This would result in unacceptably high memory requirements. Instead our approach will be to detect and classify discontinuity regions. In some cases we apply a more sophisticated interpolation. Otherwise we do not interpolate and instead simply revert to ray-tracing. We will present a brief overview of how discontinuities are handled here. Further details are presented in [3]. Our objects are specified as a collection of smooth surfaces, referred to as patches. Each patch is assigned a patch-identifier. Associated with each sample ray, we store the patch-identifier of the first patch it hits. Since each ray sample knows which surface element it hits, it would be possible to disallow any interpolation between different surfaces. It is often the case, however, that large smooth surfaces are composed of many smaller patches, which are joined together along edges so that first and second partial derivatives vary continuously across the edge. In such cases interpolation is allowed. We assume that the surfaces of the scene have been provided with this information, by partitioning patches into surface equivalence classes. If the patch-identifiers associated with the 16 corner ray samples of a final leaf are in the same equivalence class, we conclude that there is no discontinuities crossing the region surrounded by the 16 ray hits, and we apply the interpolation process described above.
Requiring that all 16 patches arise from the same equivalence class can significantly limit the number of instances in which interpolation can be applied. After all, linear interpolation in 4-space can be performed with as few as 5 sample points. If the patch-identifiers for the 16 corner samples of the leaf arise from more than two equivalence classes, then we revert to ray tracing. On the other hand, if exactly two equivalence classes are present, implying that there is a single discontinuity boundary, then we perform an intersection test to determine which patch the query ray hits. Let pr denote this patch. This intersection test is not as expensive as a general tracing of the ray, since typically only a few patches are involved, and only the first level intersections of a raytracing procedure is computed. Among the 16 corner ray samples, only the ones that hit a patch in the same equivalence class as pr are usable as interpolants. These are the ray samples hitting the same side of a discontinuity boundary as the query ray. If we determine that there is a sufficient number of usable ray samples, we then interpolate the ray. Otherwise, we use ray-tracing. See [3] for further details. Even if interpolation is allowed by the above criterion, it is still possible that interpolation may be inadvised because the surface has high curvature, resulting in very different output rays for nearby input rays. High variations in the output ray (i.e. normal or the exit ray), signal a discontinuous region. As a measure to determine the distance between two output rays, we use the angular distance between their directional vectors. If any pairwise distance between the output rays corresponding to the usable interpolants is greater than a given angular threshold, then interpolation is not performed. 3 Experimental Results The data structure described in the previous section is based on a number of parameters, which directly influence the algorithm's accuracy and the size and depth of the tree, and indirectly influences the running time. We have implemented the data structure and have run a number of experiments to test its performance as a function of a number of these parameters. We have performed our comparisons in the context of two applications. Ray-tracing: This has been described in the previous section. We are given a scene consisting of objects that are either simple, reflective or transparent and a number of light sources. The output is a rendering of the scene from one or more viewpoints. Volume Visualization: This application is motivated from the medical application of modeling the amount of radiation absorbed in human tissue [7]. We wish to visualize the absorption of radiation through a set of nonintersecting objects in 3-space. In the medical application these objects may be models of human organs, bones, and tumors. For visualization purposes,
60
generated, giving rise to 4n Bezier surface patches. The volumes are used both for the ray-tracing and the volume visualization experiments. For ray-tracing we rendered anti-aliased images of size 300 x 300 (with 9 rays shot per pixel). For volume visualization we rendered 600 x 600 images without antialiasing. Results are averaged over three different random scenes containing 8, 6, and 5 volumes respectively. Fig. 11 shows a scene of refractive volumes.
we treat these shapes as though they are transparent (but do not refract light rays). If we imagine illuminating such a scene by x-rays, then the intensity of a pixel in the image is inversely proportional to the length of its intersection with the various objects of the scene. For each object stored as an Rl-tree, the geometric attribute associated with each ray is this intersection length.
We know of no comparable algorithms or data structures with which to compare our data structure. Existing imagebased data structures [13,16] assume a dense sampling of the Tomatoes: This is a realistic scene used to demonstrate the performance and quality of our algorithm for real light field, which would easily exceed our memory resources scenes. The scene consists of a number of tomatoes, at the resolutions we would like to consider. The Interpolant modeled as spheres, placed within a reflective bowl, Ray-tracer system by Bala, et al. [4] only deals with convex modeled using Bezier surfaces. This is covered by objects, and only interpolates radiance information. Our data a reflective and transparent but non-refractive plastic structure can handle nonconvex objects and may interpolate wrap (the same Bezier surface described above). There any type of continuous attributes. is a Bezier surface tomato next to the bowl, and they are both placed on a reflective table within a large sphere. 3.1 Test Inputs We have generated a number of input The wrap reflects the procedurally textured sphere. The scenes including different types of objects. As mentioned scene is shown in Fig. 9. earlier, for each object in a scene we may choose to represent it in the traditional method or to use our data structure. Our choice of input sets has been influenced by the fact that the Rl-tree is most beneficial for high-resolution renderings of smooth objects, especially those that are reflective or transparent. We know of no appropriate benchmark data sets satisfying these requirements, and so we have generated our own data sets. Bezier Surface: This is a surface is used to demonstrate the results of interpolation algorithm for smooth reflective objects. It consists of a reflective surface consisting of 100 Bezier patches, joined with C2 continuity at the edges. The surface is placed within a large sphere, which has been given a pseudo-random procedural texture [8]. Experiments run with the Bezier surface have been averaged over renderings of the surface from 3 different viewpoints. Fig. 10(a) shows the Bezier surface from one viewpoint. We rendered images of size 600 x 600 without antialiasing (that is, only one ray per pixel is shot.) Random volumes: We ran another set of experiments on randomly generated refractive, nonintersecting, convex Bezier objects. In order to generate nonintersecting objects, a given region is recursively subdivided into a given number of non-intersecting cells by randomly generated axis-aligned hyperplanes, and a convex object is generated within each such cell. Each object is generated by first generating a random convex planar polyline that defines the silhouette of right half of the object. The vertices of the polyline constitute the control points for a random number (n) of Bezier curves, ranging from 5 to 16. Then a surface of revolution is
3.2 Metrics We investigated the speedup and actual error committed as a function of four different parameters. Speedup is defined both in terms of number of floating point operations, or FLOPs, and CPU-time. FLOP speedup is the ratio of the number of FLOPs performed by traditional raytracing to the number of FLOPs used by our algorithm to render the same scene. Similarly, CPU speedup is the ratio of CPU-times. Note that FLOPs and CPU-times for our algorithm include both the sampling and interpolation time. The actual error committed in a ray-tracing application is measured as the average LZ distance between the RGB values of corresponding pixels in a ray-traced image and the interpolated image. RGB value is a 3-dimensional vector with values normalized to the range [0,1]. Thus the maximum possible error is v/3- The error in a volume visualization application is measured as the average distance between the actual length attribute and the corresponding interpolated length attribute. 3.3 Varying the Distance Threshold Recall that the distance threshold, described in Section 2.3, is used to determine whether an approximate output ray and the corresponding actual output ray are close enough (in terms of L? distance) to terminate a subdivision process. We varied the distance threshold from 0.01 to 0.25 while the other parameters are fixed. The results for the Bezier surface scenes are shown in Fig. 5. As expected, the actual error decreases as the threshold is lowered, due to denser sampling. But, the overhead of more sample computations reduces the speedup. However, even for low thresholds where the image quality is high, the CPU-speedup is greater than 2 and the FLOP-speedup
61
Figure 5: Varying the distance threshold, (angular threshold = 30°, maximum tree depth = 28, 600 x 600 image, nonantialiased). Note that the y-axis does not always start at 0. is greater than 3. These speedups can be quite significant for ray-tracing, where a single frame can take a long time to render. Fig. 10 (b) and (c) demonstrate how the variation in error reflects the changes in the quality of the rendered image. Notice the blockiness in part (c) when the data structure is not subdivided as densely as in part (b). 3.4 Varying the Angular Threshold The angular threshold, described in Section 2.5, is applied to each query to determine whether the surface curvature variation is too high to apply interpolation. We investigated the speedup and error as a function of the angular threshold over the renderings of three different random volume scenes. The angular threshold is varied from 5° to 30°. The results are shown in Fig. 6. For lower thresholds, fewer rays could be interpolated due to distant interpolants, and those rays are traced instead. In this case, the actual error committed is smaller but at the expense of lower speedups. However, the speedups are still
acceptable even for low thresholds. 3.5 Varying the Maximum Tree Depth Recall that the maximum tree depth, described in Section 2.3, is imposed to avoid excessive tree depth near discontinuity boundaries. We considered maximum depths ranging from 22 to 30. (Because this is a kd-tree in 4-space, four levels of descent are generally required to halve the the diameter of a cell.) The results for the Bezier surface scenes are shown in Fig. 7. The angular threshold is fixed at 30°, and the distance threshold is fixed at 0.05. As the tree is allowed to grow up to a higher depth, rays are sampled with increasing density in the regions where the geometric attributes have greater variation, and thus, error committed by the interpolation algorithm decreases with higher depths. The speedup graph shows a more interesting behavior. Up to a certain depth, the speedup increases with depth. This is due to the fact that for lowdepth trees, many of the interpolants cannot pass the angular
62
Figure 6: Varying angular threshold (distance threshold=0.25, maximum depth=28,300 x 300, antialiased).
Figure 7: Varying tree depth (distance threshold=0.05, angular threshold=30,600 x 600, non-antialiased). threshold test, and many rays are being traced rather than interpolated. And so, the speed-ups are low for low-depth trees. Until the peak value of speed-up at some depth value is reached, the performance gain we get from replacing raytraced pixels by interpolations dominates the overhead of denser sampling. However, with sufficiently large depth values, the speedup decreases as the tree depth becomes higher, since the overhead caused by denser sampling starts dominating. It seems that a wise choice of depth would be a value that results in both a lower error, and reasonable speedup. For example for the given graph, depth 28 could be a good choice. In addition, Table 2 shows the required memory when depth is varied. When the tree is unnecessarily deep, not only does the speedup decrease, but space requirements increase as well.
3.6 Varying the Cache Size As mentioned earlier, the RItree functions as an LRU cache. If an upper limit for the available memory—the cache size—is specified, the least recently used paths are pruned based on time stamps set whenever a path is accessed. Excessively small cache sizes can result in frequent regeneration of the same cells. For the Bezier surface scene, we have varied the cache size from 0.128 to 2.048 megabytes (MB). The resulting speedup graph is shown in Fig. 8. Notice that we used small cache sizes to demonstrate the sudden increase in speedup as the cache size approaches a reasonable value. Normally, we set the cache size to 100MB which is high enough to handle bigger scenes with many data structures. There are additional parameters involved in garbage collection, such as what percentage of the cache should be pruned. In these experiments, each garbage
63
Figure 8: Varying cache size (distance threshold = 0.05, angular threshold = 30, maximum tree depth = 28,600 x 600 image, non-antialiased). Dist Thresh 0.25 0.05
Ang Thresh 30 10
Tree Depth 28 28
Speedup (FLOP) 2.65 2.40
Speedup(CPU-time) 1.89 1.63
Error 0.00482 0.00190
Memory (MB) 34 47
Table 1: Sample results for tomatoes scene (1200 x 900 non-antialiased).
collection prunes 70% of the cache.
Note that the closest objects along the eye rays are correctly determined by interpolation, as are the reflection rays from the wrap and the bowl, and the shadows. The sky is reflected on the wrap. As expected, for lower threshold values we can get a very high quality image and still achieve speedups of 2 or higher. If quality is not the main objective, we can get approximate images at higher speedups. The error given is the average RGB-error as explained above.
3.7 Volume Visualization Experiments We have tested the algorithm for the volume visualization application using the same random volumes we have used for refractive objects. Images are 600 x 600 and not antialiased. Sample run results are shown in Table 2. The FLOP speedup varies from 2.817 to 3.549, and CPU speedup varies from 2.388 to 2.814. For higher resolutions, or anti-aliased images the speedups could be higher. The error could be as low as 0.008 for low distance thresholds, and is still at a reasonable value References for higher thresholds. Fig. 12 shows the actual image, and [1] J. Amanatides. Ray tracing with cones. Computer Graphics the interpolated image visualizing one of the random volume (Proc. ofSIGGRAPH84), 18(3): 129-135, 1984. scenes. All objects have 0.5 opacity, and all have solid gray [2] J. Arvo and D. Kirk. Fast ray tracing by ray classification. colors. Computer Graphics (Proc. of SIGGRAPH 87), 21(4): 196205, 1987. 3.8 Performance and Error for Tomatoes Scene Finally, [3] F.B. Atalay and D.M. Mount. Ray interpolants for fast raywe have tested the algorithm on the tomatoes scene generattracing reflections and refractions. Journal of WSCG (Proc. ing an image of size 1200 x 900, non-antialiased. Table 1 International Conf. in Central Europe on Comp. Graph., shows sample results for the tomato scene and Fig. 9 shows Visualization and Comp. Vision), 10(3): 1-8, 2002. the corresponding images. Fig. 9(a) shows the ray-traced [4] K. Bala, J. Dorsey, and S. Teller. Radiance interpolants for accelerated bounded-error ray tracing. ACM Trans, on image. Part (b) shows the interpolated image, and a corGraph., 18(3), August 1999. responding color-coded image in which the white regions [5] J. L. Bentley. Multidimensional binary search trees used denote the pixels that were traced rather than interpolated. for associative searching. Commun. of ACM, 18(9): 509-517, Part (c) shows the interpolated image generated with lower 1975. thresholds and the corresponding color-coded image. Notice [6] J. Bloomenthal. An Introduction to Implicit Surfaces. that the artifacts in part (b) are corrected in part (c). Morgan-Kaurmann, San Francisco, 1997. 64
[7] J. B. Van de Kamer and J. J. W. Lagendijk. Computation of high-resolution SAR distributions in a head due to a radiating dipole antenna representing a hand-held mobile phone. Physics in Medicine and Biology, 47:1827-1835, 2002. [8] D. S. Ebert, F. K. Musgrave, D. Peachey, K. Perlin, and S. Worley. Texturing and Modelling. Academic Press Professional, San Diego, 1998. [9] J. Foley, A. van Dam, S. Feiner, and J. Hughes. Computer Graphics Principles and Practice. Addison-Wesley, Reading, Mass., 1990. [10] A. Gershun. The light field. Journal of Mathematics and Physics, XVIll:5l-\5\, 1939. Moscow, 1936, Translated by P. Moon and G. Timoshenko. [11] A. S. Glassner. Space subdivision for fast ray tracing. IEEE Comp. Graph. andAppl., 4(10): 15-22, October 1984. [12] A. S. Glassner(editor). An Introduction to Ray Tracing. Academic Press, San Diego, 1989. [13] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. Computer Graphics (Proc. of SIGGRAPH 96), pages 43-54, August 1996. [14] P. S. Heckbert and P. Hanrahan. Beam tracing polygonal objects. Computer Graphics (Proc. of SIGGRAPH 84), 18(3): 119-127, July 1984. [15] M. R. Kaplan. Space tracing a constant time ray tracer. State of the Art in Image Synthesis (SIGGRAPH 85 Course Notes), 11, July 1985. [16] M. Levoy and P. Hanrahan. Light field rendering. Computer Graphics (Proc. of SIGGRAPH 96), pages 31-42, August 1996. [17] P. Moon and D. E. Spencer. The Photic Field. MIT Press, Cambridge, 1981. [18] M. Ohta and M. Maekawa. Ray coherence theorem and constant time ray tracing algorithm. Computer Graphics 1987 (Proc. of CG International '87), pages 303-314, 1987. [19] S. Rubin and T. Whitted. A three-dimensional representation for fast rendering of complex scenes. Computer Graphics (Proc. of SIGGRAPH 80), 14(3): 110-116, July 1980. [20] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989. [21] D. M. Y. Sommerville. Analytical Geometry in Three Dimensions. Cambridge University Press, Cambridge, 1934. [22] D. Zorin, P. Schroder, and W. Sweldens. Interpolating subdivision for meshes with arbitrary topology. Computer Graphics (Proc. of SIGGRAPH 96), pages 189-192, 1996.
65
Test Input Bezier Surface
Random Volumes (ray-tracing)
Random Volumes (volume visualization) Test Input Bezier Surface
Random Volumes (ray-tracing)
Input Scene Bezier Surface
Random Volumes (ray-tracing)
Random Volumes (volume visualization)
Dist Thresh 0.010 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 0.250 0.010 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 0.250 0.050 0.150 0.250 Tree Depth 22 23 24 25 26 27 28 29 30 22 23 24 25 26 27 28 29 30 Ang Thresh 5 10 15 20 25 30 5 10 15 20 25 30 10 15 20 30
Speedup (FLOP) 3.12704 3.43796 3.74473 3.93950 4.08325 4.19358 4.28194 4.35214 4.41940 4.48503 4.52146 3.12173 3.26194 3.40527 3.49906 3.56870 3.62689 3.67422 3.70776 3.74041 3.77341 3.80292 2.95084 3.31043 3.54958 Speedup (FLOP) 3.50486 3.86223 4.05344 4.03178 3.97010 3.85335 3.74473 3.53944 3.36016 3.10450 3.41967 3.70909 3.85445 3.85108 3.84435 3.80292 3.56893 3.34045 Speedup (FLOP) 2.68103 3.51840 3.68553 3.72731 3.74471 3.74473 2.56317 3.44274 3.70928 3.76320 3.78206 3.80292 2.81703 3.21517 3.40653 3.54958
Speedup(CPU-time) 1.96466 1.99712 2.07705 2.11372 2.24707 2.24816 2.29041 2.32532 2.32863 2.34465 2.34591 2.63532 2.64317 2.71941 2.76194 2.79244 2.84409 2.88046 2.89190 2.92770 2.94416 2.91917 2.42804 2.67274 2.81416 Speedup(CPU-time) 2.05642 2.14654 2.21946 2.17521 2.15906 2.05680 2.07705 2.04811 1.97434 2.56453 2.63197 2.74675 2.90357 2.93271 2.87188 2.91917 2.79997 2.73197 Speedup(CPU-time) 1.68226 2.01129 2.11734 2.12195 2.11754 2.07705 2.15410 2.67800 2.83973 2.88208 2.89311 2.91917 2.38833 2.62693 2.73348 2.81416
Error 0.00377 0.00483 0.00676 0.00858 0.0103 0.01185 0.01331 0.01532 0.01655 0.01763 0.01867 0.00627 0.00645 0.00679 0.00722 0.00780 0.00853 0.00890 0.00945 0.00989 0.01048 0.01076 0.00850 0.01179 0.01488 Error 0.02098 0.01491 0.01112 0.01032 0.00953 0.00760 0.00676 0.00663 0.00653 0.01859 0.01708 0.01526 0.01449 0.01305 0.01187 0.01076 0.01026 0.00987 Error 0.00424 0.00591 0.00663 0.00673 0.00676 0.00676 0.00478 0.00896 0.01007 0.01055 0.01063 0.01076 0.01047 0.01340 0.01411 0.01488
Memory (MB) 2.925 2.371 1.931 1.699 1.549 1.442 1.361 1.301 1.253 1.212 1.185 19.252 17.518 15.799 14.765 14.088 13.603 13.183 12.875 12.583 12.331 12.094 11.773 9.503 8.344 Memory (MB) 0.565 0.706 0.881 1.084 1.318 1.603 1.931 2.265 2.629 3.729 4.431 5.441 6.560 7.989 9.660 12.094 14.361 17.413
Table 2: Speedup and actual error on Bezier Surface and Random Volumes (ray-tracing and volume visualization) for various parameter values.
66
Figure 9: (a) Ray-traced image, (b) Interpolated image (distance threshold=0.25, angular threshold=30) and corresponding color-coded image, white areas show the ray-traced regions, (c) Interpolated image (distance threshold=0.05, angular threshold=10) and corresponding color-coded image, showing ray-traced pixels.
67
Figure 10: (a) Ray-traced image, (b) Lower right part of interpolated image (distance threshold=0.01), error = 0.00377, (c) Lower right part of interpolated image (distance threshold=0.15), error = 0.01331.
Figure 11: (a) Ray-traced image, (b) Interpolated image (distance threshold=0.05) and the corresponding color-coded image where white regions indicate pixels that were ray-traced.
Figure 12: (a) Ray-traced image, (b) Interpolated image (distance threshold=0.25).
68
Practical Construction of Metric t-Spanners Gonzalo Navarro t
Rodrigo Paredes *
Abstract Let G(V,A) be a connected graph with a nonnegative cost function d : A —>• M+. Let d<s(u, v) be the cost of the cheapest path between «,v G V. A tspanner of G is a subgraph G'(V,E], E C A, such that V u,v € V, d'(w,v) < t • <Mu,v), t > 1. We focus on the metric space context, which means that A = VxV,dis& metric, and t < 2. Several algorithms to build t-spanners are known, but they do not seem to apply well to our case. We present four practical algorithms to build t-spanners with empirical time costs of the form « • n2* *-» and number of edges of the form Ce - n1+ '*'-»' . These algorithms are useful on general graphs as well.
One of the best existing algorithms to search metric spaces is AESA [17]. AESA precomputes and stores the matrix of n(n — l)/2 distances among elements of U. This huge space requirement makes it unsuitable for most applications, however. This matrix can be seen as a complete graph G( V, A) where the set of vertices V = U corresponds to the objects of the metric space, and the set of edges A corresponds to the n(n — l)/2 distances among these objects. A t-spanner G' of G would represent all these distances using a small number of edges E,E C A, and still would be able to approximate all the distances with a maximum error t, that is:
1 Introduction Let G be a connected graph G(V> A) with a nonnegative cost function d(e) assigned to its edges e 6 A. The shortest path among every pair of vertices tt, v 6 V is the one minimizing the sum of the cost of the edges traversed, do(tt, v). This can be computed with Floyd's algorithm or with \V\ iterations of Dijkstra's algorithm considering each vertex as the origin node [18]. A t-spanner it is a subgraph Cr;(V, E), with E C A, which permits to compute paths with stretch t, that is, ensuring that V w, v € V, d&fa v) < t • da(u, v) [13]. We call this the t-condition. In this work we are interested in using t-spanners as tools for searching metric spaces [6]. A metric space is a set of objects X and a distance function d defined among objects, which satisfies the metric properties (positiveness, reflexivity, symmetry, triangle inequality). Given a finite subset U C X, of size n, the goal is to build a data structure over U such that later, given a query object q E X, one can find the elements of U close to q with as few distance computations as possible.
In most metric spaces the distance histogram follows a distribution that becomes concentrated as the dimension increases [6]. This means that in practice we are interested in the range t 6 (1> 2]. We pursue this line in [12], where we focus on the search process but not on t-spanner construction. In that paper we show that the search algorithm is competitive against current approaches, e.g., we need 1.09 times the time cost of AESA using only 3.83% of its space requirement, in a metric space of documents; and 1.5 times the time cost of AESA using only 3.21% of its space requirement, in a metric space of strings. We also show that t-spanners provide better space-time tradeoffs than classical alternatives such as pivot-based indexes. There are metric spaces where computing the distance evaluation among the objects is highly expensive. For instance, in the metric space of documents under the cosine distance [3], in order to compute the distance numerous disk accesses and million of basics arithmetics operations are required. In this case the distance evaluation could take hundredths of seconds, which is really expensive even compared against the operations introduced by the graph. In particular, the cost of the distance evaluation absorbs the cost of the shortest paths computation using Dijkstra's algorithm. Hence our interest in this paper is in building tspanners over metric spaces which work well in practice. Few algorithms exist apart from the basic O(mn2) technique (m = |i?|), which inserts the edges needed
'This work has been supported in part by the Millenium Nucleus Center for Web Research, Grant P01-029-F, Mideplan, Chile and MECESUP Project UCH0109 (Chile). t Center for Web Research, Dept. of Computer Science, University of Chile. Blanco Encalada 2120, Santiago, Chile, {gnavaxr o, rapar«da }Cdc c. nchil*. c 1
69
one by one and recomputes all the shortest paths to every edge inserted. Four t-spanner construction algorithms are presented in this paper, with the goals of decreasing CPU and memory cost and of producing t-spanners of good quality, i.e., with few edges. Our four algorithms are: 1. An optimized basic algorithm, where we limit the propagation of edge insertions. 2. A massive edge insertion algorithm, where we amortize the cost of recomputing distances across many edge insertions. 3. An incremental algorithm, where nodes are added one by one to a correct t-spanner. 4. A recursive algorithm applying a divide and conquer technique. Table 1 shows the complexities obtained. We obtain empirical time costs of the form Ct • n2*24 and number of edges of the form Ce • n1'13. This shows that good quality t-spanners can be built in reasonable time (just the minimum spanning tree computation needs O(n2) time). We take no particular advantage of the metric properties of the edge weights, so our algorithms can be used on general graphs too. As far as we know, there has not been previous work on comparing, in practice, t-spanner construction algorithms on metric spaces.
Basic Basic optimized Massive edge insertion Incremental Recursive
CPU time 0(mn2) 0(mk2)
Memory
0(n») 0(n2)
Distance evaluations O(n') 0(n2)
O(nralogra)
0(m)
O(nm)
O(nralogm) O(nralogm)
O(m) 0(m)
0(n«) 0(n?)
Table 1: t-Spanner algorithm complexities comparison. The value k refers to the number of nodes that have to be checked when updating distances due to a new inserted edge. 2
Previous Work
Several studies on general graph t-spanners have been undertaken [8, 13, 14]. Most of them resort to the basic O(mn2) time construction approach detailed in the next section, where n = \V\ and m = \E\ refer to the resulting t-spanner. It was shown in [1, 2] that this
technique produces t-spanners with n1+°(*-») edges on general graphs of n nodes. More sophisticated algorithms have been proposed in [7], producing t-spanners with guaranteed O(n1+(2+e)(1+lo«»m)/t) edges in worst case time O(mn(2+*)(1+1°s»m)/*), where in this case m refers to the original graph. In a metric space m = 0(n2), which means worst case time O(n5). Additionally, the algorithms in [7] work for t € [2, logn], unsuitable for our application. (Some of these algorithms could be adapted to work heuristically for smaller t, but to the best of our knowledge, this has not been done so far.) Other recent algorithms [16] work only for t = 1, 3, 5, ... also unsuitable for us. Parallel algorithms have been pursued in [11], but they do not give new sequential algorithms. As it regards to Euclidean t-spanners, i.e., the subclass of metric t-spanners where the objects are points in a £>-dimensional space with Euclidean distance, much better results exist [8, 1, 2, 10, 9, 15], showing that one can build t-spanners with O(n) edges in O(nlog n) time. These results, unfortunately, make heavy use of coordinate information and cannot be extended to general metric spaces. Other related results refer to probabilistic approximations of metric spaces using tree metrics [4, 5]. The idea is to build a set of trees such that their union makes up a t-spanner with high probability. However, the t values are of the form O (logn log logn). Hence the need to find practical algorithms that allow building appropriate t-spanners for metric spaces, that is, with t < 2, for complete graphs, and taking advantage of the triangle inequality. 3
Basic t-Spanner Construction Algorithm
The intuitive idea to solve this problem is iterative. We begin with an initial t-spanner that contains all the vertices and no edges, and calculate the distance estimations among all vertex pairs. These are all infinite at step zero, except for the distances between a node and itself (d(u, u) = 0). The edges are then inserted until all the distance estimations fulfill the t-condition. The edges are considered in ascending cost order, so we start by sorting them. Using smaller-cost edges first is in agreement with the geometric idea of inserting edges between near neighbors and making up paths from low cost edges in order to use few edges overall. Hence the algorithm uses two matrices. The first, real, contains the true distance between all the objects, and the second, estira, contains the distance estimations obtained with the t-spanner under construction. The tspanner is stored in an adjacency list. The insertion criterion is that an edge is added to the set E only when its current estimation does not 70
satisfy the t-condition. After inserting the edge, it is necessary to update all the distance estimations. The update mechanism is similar to the distance calculation mechanism of Floyd's algorithm, but considering that edges, not nodes, are inserted into the set. Figure 1 depicts the basic t-spanner construction algorithm.
Figure 2: Propagation of distance estimations. shortest path estimations due to the inserted edge. • check: The adjacency of ok, check — adyacency(ok) — ok = {u € U, 3v 6 ok, (u, v) £ E} — ok. These are the nodes that we still need to update. Note that it is necessary to propagate the computation only to the nodes that improve their estimation to ai or 02. The complete algorithm reviews all the edges of the graph. For each edge, it iterates until no Figure 1: Basic t-spanner construction algorithm (t- further propagation is necessary. Figure 3 depicts the optimized basic algorithm. Spanner 0). This algorithm makes O(n2) distance evaluations, like AESA [17]; O(mn2) CPU time (recaU that n = \V\ and m = \E\); and O(n2 + m) = O(n2) memory. Its main deficiencies are excessive edge insertion cost and too high memory requirements. 4 Optimized Basic Algorithm Like the basic algorithm (Section 3), this algorithm considers the use of real and estim matrices, choosing the edges in increasing weight order. The optimization focuses on the distance estimation update mechanism. The main idea is to control the propagation of the computation, that is, only updating the distance estimations that are affected by the insertion of a new edge. Figure 2 shows the insertion of a new edge. In the first update we must modify only the edge that was inserted, between nodes ai and a%. The computation then propagates to the neighbors of the 04 nodes, namely the nodes {61,62,63}; then to the nodes {ci, 02} and finally d±. The propagation stops when a node does not improve its current estimation or when it does not have further neighbors. In order to control the propagation, the algorithm uses two sets, ok and check. ok: The nodes that already have updated their
Figure 3: Optimized basic algorithm (t-Spanner 1). 71
This algorithm takes O(n2) distances evaluations. In terms of CPU time it takes O(m&2), where k is the number of neighbors to check when inserting an edge. In the worst case this becomes O(mn2) just like the basic algorithm, but the average is much better. From the point of view of the memory it still takes O(n2 + m) = O(n2). This algorithm reduces the CPU time used, but even so this is still very high, and the memory requirements are still too high. A good feature of this algorithm is that, just like the basic algorithm, it produces good-quality t-spanners (few edges). So we have used its results to predict the expected number of edges per node in order to speed up other algorithms that rely on massive edge insertion. We call Et-Spann«ri(nid,t) the expected number of edges in a metric space of n objects, distance function d, and stretch t. In Section 8 we show some estimations obtained, see Table 2. 5
size is HZ = 1.2 • \E\t where E refers to the tspanner under construction. We made preliminar experiments in order to fix this value. With values lower than 1.2 the algorithm takes more processing time without improving the number of edges, and with higher values the algorithm inserts more edges than necessary and needs more memory to build the t-spanner. The algorithm stages are: 1. We insert successive MSTs to the t-spanner. The first MST follows the basics Prim algorithm [18], but the next MSTs are built using Prim over the edges that have not been inserted yet. We traverse the nodes sequentially, building the list of pending edges (wrongly t-estimated). At the same time, we insert successive MSTs and remove pending edges accordingly. Additionally, when the current node has no more than HI/2 pending edges, we just insert them (since we only need a small set of edges in order to correct the distance estimations of this node). The insertion of MSTs continues as long as there are more than H% pending edges (note that HI depends on the current t-spanner size \E\).
Massive Edges Insertion Algorithm
This algorithm tries to reduce both the CPU processing time and memory requirements. To reduce the CPU time, the algorithm updates the distance estimations only after performing many edge insertions, using an O(mlogn)-time Dijkstra's algorithm to update distances. To reduce the memory requirement, it computes the distances between objects on the fly. Since we insert edges less carefully than before, the resulting t-spanner is necessarily of lower quality. Our effort is in minimizing this effect. The algorithm has three stages. In the first one, it builds the t-spanner backbone by inserting whole minimum spanning trees (MSTs), and determines the global wrongly t-estimated edge list (pending); in the second one, it refines the t-spanner by adding more edges to improve the wrongly t-estimated edges; and in the third one, it inserts all the remaining "hard" edges. This algorithm uses two heuristic values:
This stage continues until we review all the nodes. The output is the t-spanner backbone (t-Spanner) and the gobal list of pending edges (pending). 2. In the second stage we reduce the pending list. For this sake, we traverse the list of nodes with pending edges (pendingNodes), from more to less pending edges. For each such node, we check which edges have to improve their t-estimation and which do not (edges originally in the pending list may have become well t-estimated along the process). From the still wrongly t-estimated edges, we insert a set of the smaller cost edges of size HI/4: and proceed to the next node (we need to insert more edges in order to improve the distance estimation; with values lower than HI/4 the algorithm takes more processing time without improving the number of edges and with higher values the algorithm inserts more edges than necessary). This allows us to review in the first place the nodes that require more attention, without concentrating all the efforts in the same node. The process considers two special cases. The first one is that we have inserted more than n edges, in which case we regenerate and re-sort the pending Nodes list and restart the process. The second one is that the pending list of the current node is so small that we simply insert its elements.
HI determines the expected number of edges per node, and it is obtained from the t-Spanner 1 edge model: HI = |-Et_sPanneri(n,<M)|/n. With HI we will define thresholds to determine whether or not to insert the remaining edges (those still wrongly t-estimated) of the current node. Note that |Et-Spanner i(n, d, t)\ is an optimistic predictor of the resulting t-spanner estimated size using the massive edges insertion algorithm, so we can use #1 = |-E*-Spannerl(™>d,t)|/n *S * lower bound of
the number of edges per node.
HI is used to determine the pending list size and will give a criterion to determine when to insert an additional MST. The maximum pending list 72
The output condition of the second stage is that t-spanner is increased; inserting less edges at a time, the pending list size is smaller than n/2 (we made increases the processing times whitout decreasing the preliminar experiments in order to fix this value, t-spanner size. and we obtained the best results with n/2). For the distance verification we use an incremental Dijkstra's algorithm with limited propagation, that is, 3. We insert the pending list to the t-spanner. the first time, Dijkstra's algorithm takes an array with Figure 4 depicts the massive edges insertion algo- precomputed distances initialized at t*d(ui,Uj}+e, with rithm. This algorithm takes O(nra) distance evalua- e > 0, j € [1>* — 1]. This is because, if a distance to tions, O(nralogm) CPU time (since we run Dijkstra's node i is not well t-estimated, we do not really need to algorithm once per node), and O(n + m) = O(m) mem- know how bad estimated it is. For the next iterations, ory. It is easy to see that the space requirement is O(m): Dijkstra's algorithm reuses the previously computed the pending list is never larger than O(m) because at array, because there is no need to propagate distances each iteration of stage 1 it grows at most by n, and as from nodes whose estimation has not improved. Figure 5 depicts the incremental node insertion alsoon as it becomes larger than 1.2 • m (#2) we take out gorithm. This algorithm takes O(n2) distance evaluaedges from it by adding a new MST, until it becomes short enough. The CPU time comes from running Dijk- tions, O(nmlogm) CPU time, and O(n + m) = O(m) stra's algorithm once per node at stage 1. At stage 2 we memory. The CPU time comes from the fact that every insert edges in groups of O(m/n), running Dijkstra's node runs Dijkstra's algorithm n/S — 0(1) times. algorithm after each insertion, until we have inserted \pending\ — n/2 — O(m) edges overall. This accounts for other n times we run Dijkstra's algorithm. Hence the O(nmlogm) complexity. This algorithm reduces both CPU time and memory requirements, but the amount of distance evaluations is very high (O(nm) > O(n2)). 6
Incremental Node Insertion Algorithm
This version reduces the amount of distance evaluations to just n(n — l)/2, while preserving the amortized update cost idea. This algorithm, unlike the previous ones, makes a local analysis of nodes and edges, that is, it makes decisions before having knowledge of the whole edge set. We insert the nodes one by one, not the edges. The invariant is that for nodes 1... i — 1 we have a well formed t-spanner, and we want to insert the i-th node to the growing t-spanner. Since the insertion process only locally analyzes the edge set, the resulting t-spanner is suboptimal. For each new node i, the algorithm makes two operations: the first is to connect the node to the growing t-spanner using the cheapest edge (towards a node < i); the second one is to verify that the distance estimations satisfy the t-condition, adding some edges to node i until the invariant is restored. We repeat this process until we insert the whole node set. We also use the HI heuristic, with the difference that we recompute HI at every iteration (since the tspanner size changes). We fixed that the number of edges to insert at a time should be $ = #i/(5 • i) in order to reduce the processing time and the amount of edges inserted to the t-spanner. Inserting more edges at a time obtains lower processing time but the size of the 73
Figure 5: Incremental node insertion algorithm (tSpanner 3). 7
Recursive Algorithm
The incremental algorithm is an efficient approach to construct t-spanners, but it does not consider spatial proximity (or remoteness) among the objects. A way to solve this is that the set in which the t-spanner is incrementally built is made up of near objects. Following this principle, we present a solution that recursively divides the object set into two compact
Figure 4: Massive edges insertion algorithm (t-Spanner 2), pending(u) denotes {e G pending, 3v, e = subsets, builds sub-t-spanners in the subsets, and then merges them. For the initial set division we take two far away objects, pi and j>2> that we call representatives, and then generate two subsets: objects nearer to p\ and nearer to p2* Figure 6 shows the concept graphically. For the recursive divisions we reuse the representative as one of the two objects, and the element farthest to it as the other. The recursion finishes when we have less than 3 Figure 6: We select pi and p2> and then divide the set. objects. The merge step also takes into account the spatial proximity among the objects. When we merge the subthe sub-t-spanner represented by p2 (S*5P2)» we choose t-spanners, we have two node subsets V\ and V^ where the object closest to p\ (it), and insert it into the sub-t|Vi| > \V%| (otherwise we swap the subsets). Then, in spanner represented by pi (stspi) verifying that all the 74
distances towards V\ are well t-estimated. Note that this is equivalent to consider that we use the incremental algorithm, where we insert u into the growing t-spanner stspi. We continue with the second closest and repeat the procedure until all the stspz nodes are inserted into stspi. Figure 7 illustrates. Note that the edges already present in stspz are conserved.
Figure 7: The merge step takes the objects according to their distances towards p\. This algorithm also uses an incremental Dykstra's algorithm with limited propagation, but this time we are only interested in limiting the propagation towards stspi nodes (because we know that towards stsp^ we already satisfy the t-condition). Hence, Dykstra's algorithm takes an array with precomputed distances initialized at t • d(tt», Uj) + e for (t^, Uj) € Vi x Vi, and oo for (u»,iij) € Vi x Vz, where e is a small positive constant. For the next iterations, Dykstra's algorithm reuses the previously computed array. Figure 8 depicts the recursive algorithm and the auxiliary functions used to build and merge sub-tspanners. This algorithm takes O(n2) distance evaluations, O(nmlogm) CPU time, and O(n + ra) = O(m) memory. The cost of dividing the sets does not affect that of the underlying incremental construction. 8
Experimental Results
We have tested our algorithms on synthetic and reallife metric spaces. The synthetic set is formed by 2,000 points in a 20-dimensional space with coordinates in the range [—1,1], with Gaussian distribution forming 256 randomly placed clusters. We consider three different standard deviations to make more crisp or more fuzzy clusters (a = 0.1, 0.3, 0.5). Of course, we have not used the fact that the space has coordinates, but have treated the points as abstract objects in an unknown metric space. Two real-life data sets were tested. The first is a string metric space using the edit distance (a discrete function that measures the minimum number of character insertions, deletions and replacements needed to make the strings equal). The strings form an English dictionary, where we index a subset of n = 24,000 words.
The second is a space of 1,215 documents under the Cosine similarity, which is used to retrieve documents more similar to a query under the vector space model. In this model the space has one coordinate per term and documents are seen as vectors in this high dimensional space. The similarity corresponds to the cosine of the angle (inner product) among the vectors, and a suitable distance measure is the angle itself. Both spaces are of interest to Information Retrieval applications [3]. The experiments were run on an Intel Pentium IV of 2 GHz, with 512 MB of RAM and a local disk. We are interested in measuring the CPU time needed and the amount of edges generated by each algorithm. For shortness we have called i-Spanner 1 the optimized basic algorithm, t-Spanner 2 the massive edges insertion algorithm, t-Spanner 3 the incremental algorithm, and t-Spanner 4 the recursive algorithm. Figures 9 and 10 show a comparison among the four algorithms on the Gaussian data set where we vary the stretch t and the amount of nodes, respectively. As it can be seen, all the algorithms produce t-spanners of about the same quality, although the optimized basic algorithm is consistently better, as explained. It is interesting to note that in the case of cr = 0.1 the tspanner 2 has the worst edge performance. This is because, in its first stage, the algorithm inserts a lot of intra-cluster edges and then it tries to connect both inner and peripheral objects among the clusters. Since we need to connect just the peripheral objects, there are a lot of redundant edges that do not improve other distance estimations in the resulting t-spanner. In the construction time, on the other hand, there are large differences. The optimized basic algorithm is unpractically costly, as expected. Also, the massive edges insertion algorithm is still quite costly in comparison to the incremental and recursive algorithms. This is due to its large number of distance computations. This reinforces the idea that the t-spanner 2 is not suitable for metric spaces with highly expensive distance evaluation functions. However, we notice that, unlike all the others, this algorithm improves instead of degrading as the clusters become more fuzzy, becoming a competitive choice on uniformly distributed datasets. The quality of the t-spanner also varies from (by far) the worst t-spanner on crisp clusters to the second best on more fuzzy clusters. This could be due to two phenomena. The first is that there are less redundant edges among the clusters, and the second is that, on an uniform space, the t-spanner 2 inserts "better" edges since they come from MSTs (that usies the shortest possible edges). The incremental and recursive algorithms are quite close in both measures, being by far the fastest al-
75
Figure 8: Recursive algorithm (t-Spanner 4). gorithms. The recursive algorithm usually produces slightly better t-spanners thanks to the more global edge analysis. Note that, for t as low as 1.5, we obtain tspanners whose size is 5% to 15% of the full graph. It is interesting to notice that, for crisp clusters, there is a big jump in the construction time and tspanner size when we move from t = 1.5 to t = 1.4. The effect is much smoother for more fuzzy clusters. A possible explanation is that, for crisp clusters and large enough t, a single edge among cluster centers is enough to obtain a t-spanner. However, when t is reduced below 1.5, this becomes suddenly insufficient and we start having many edges across cluster pairs. We show in Table 2 our least squares fittings on
the data using the model \E\ = an1"1"*-1 and time = an2+r=I microseconds. This model has been chosen according to the analytical results of [1, 2]. As it can be seen, t-spanner sizes are slightly superlinear and times are slightly superquadratic. This shows that our algorithms represent in practice a large improvement over the current state of the art. We show now some results on the metric space of strings, this time focusing on the behavior in terms of the database size n. Since these tests are more massive, we leave out the optimized basic and the massive edge insertion algorithms: They were really slow even for small subsets. This means, in particular for the massive edges insertion algorithm, that this space
76
Figure 9: t-Spanner construction in the synthetic metric space of 2,000 nodes, as a function oft. On the left, edges generated (t-spanner quality). On the right, construction time. Each row corresponds to a different variance.
77
Figure 10: t-Spanner construction in the synthetic metric space of 2,000 nodes, as a function of the number of nodes. On the left, edges generated (t-spanner quality). On the right, construction time. Each row corresponds to a different variance.
78
Table 2: Empirical complexities of our algorithms, as a function of n and t. Time is measured in microseconds. is far from uniform. Figure 11 shows that, also for strings, the number of edges generated is slightly superlinear (8.03 n1+Ttrr for the incremental algorithm and 8.45 n1+ *-» for the recursive one), and the construction time is slightly superquadratic (1.46 n2"1"*^ microseconds for the incremental algorithm and 1.67 n1+ "^ for the recursive one). The recursive algorithm is almost always a bit better than the incremental algorithm in both aspects. Finally, Figure 12 shows experiments on the space of documents. We have excluded the massive edges insertion algorithm, which was too slow. The reason this time is that it is the algorithm that makes, by far, more distance computations, which was clearly the dominant term in this space (comparing two document vocabularies takes several milliseconds). We can see again that, although all the algorithms produce tspanners of about the same quality, the optimized basic algorithm is much more expensive than the other two, which are rather similar.
0
Conclusions
We have presented several algorithms for t-spanner construction when the underlying graph is the complete graph representing distances in a metric space. This is motivated by our recent research on searching metric spaces and shows that t-spanners are well suited as data structures for this problem. For this sake, we need practical construction algorithms for 1 < t < 2. To the best of our knowledge, no existing technique has been shown to work well under this scenario (complete graph, metric distances, small t, practical construction time) and no practical study has been carried out on the subject. However, our algorithms are also well suited to general graphs. Our focus has been on practical algorithms. We
have shown that it is possible to build good quality t-spanners in reasonable time. We have empirically obtained time costs of the form Ct • n2+ "«-»' and number of edges of the form Ce • n1+ *-* . Note that just computing the minimum spanning tree requires O(n2) time. Moreover, just computing all the distances in a general graph requires O(n3) time. Compared to the existing algorithms, our contribution represents in practice a large improvement over the current state of the art. Note that in our case we do not provide a guarantee in the number of edges. Rather, we show that in practice we generate t-spanners with few edges with fast algorithms. It is possible to add and remove elements from the t-spanner in reasonable time while preserving its quality. The incremental algorithm permits adding new elements. Remotion of a node can be arranged by adding a clique among its neighbors and periodically reconstructing the t-spanner with the recursive algorithm. Future work involves using t-spanners where t depends on the actual distance between the nodes. Basically, we are more interested in approximating well short rather than long distances. On the other hand, we are investigating on fully dynamic t-spanners, which means that the t-spanner allows object insertions and deletions while preserving its quality. This is important when we use t-spanners in order to build an index for metric databases in real applications. Another trend is probabilistic t-spanners, where distances are well testimated with high probability, so that with much less edges we find most of the results.
Acknowledgement The second author wishes to thank AT&T LA Chile for the use of their computer to run the experiments and for letting him continue his doctoral studies.
79
Figure 11: t-Spanner construction on the space of Figure 12: t-Spanner construction on the set of docustrings, for increasing n. (a) number of edges generated, ments, as a function oft. (a) number of edges generated, (b) construction time. (b) construction time. References [1] I. Althofer, G. Das, D. Dobkin, and D. Joseph. Generating sparse spanners for weighted graphs, In Proc. 2nd Scandinavian Workshop on Algorithm Theory (SWAT'90), LNCS 447, pages 26-37, 1990. [2] I. Althofer, G. Das, D. Dobkin, D. Joseph, and J. Soares. On sparse spanners of weighted graphs. Discrete Computational Geometry, 9:81-100, 1993. [3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. [4] Y. Bartal. On approximating arbitrary metrics by tree metrics. In Proc. 30th Symposium on the Theory of Computing (STOC'98), pages 161-168, 1998. [5] M. Charikar, C. Chekuri, A. Goel, S. Guha, and S. Plotkin. Approximating a finite metric by a small number of tree metrics. In Proc. 39th Symp. on Foundations of Computer Science (FOCS'98), pages 379-388, 1998. [6] £. Chavez, G. Navarro, R. Baeza-Yates, and J.L. Marroquin. Proximity searching in metric spaces.
[7]
[8]
[9]
[10]
[11]
80
ACM Computing Surveys, 33(3):273-321, September 2001. £. Cohen. Fast algorithms for constructing t-spanners and paths with stretch t. SIAM J. on Computing, 28:210-236, 1998. D. Eppstein. Spanning trees and spanners. In Handbook of Computational Geometry, pages 425-461. Elsevier, 1999. J. Gudmundsson, C. Levcopoulos, and G. Narasimhan. Improved greedy algorithms for constructing sparse geometric spanners. In Proc. 7th Scandinavian Workshop on Algorithm Theory (SWAT &000), LNCS v. 1851, pages 314-327, 2000. J.M. Keil. Approximating the complete Euclidean graph. In Proc. 1st Scandinavian Workshop in Algorithm Theory (SWAT'88), LNCS 318, pages 208-213, 1988. W. Liang and R. Brent. Constructing the spanners of graphs in parallel. Technical Report TR-CS-96-01, Dept. of CS and CS Lab, The Australian National University, January 1996.
[12] G. Navarro, R. Parades, and £. Chavez. t-Spanners as a data structure for metric space searching. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE SOO&), LNCS 2476, pages 298-309. Springer, 2002. [13] D. Peleg and A. Schaffer. Graph spanners. Journal of Graph Theory, 13(1):99-116, 1989. [14] D. Peleg and J. Ulhnan. An optimal synchronizer for the hypercube. SIAM J. on Computing, 18:740-747, 1989. [15] J. Ruppert and R. Seidel. Approximating the ddimensional complete Euclidean graph. In 3rd Canadian Conference on Computational Geometry, pages 207-210, 1991. [16] Mikkel Thorup and Uri Zwick. Approximate distance oracles, la Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 183-192. ACM Press, 2001. [17] £. Vidal. An algorithm for finding nearest neighbors in (approximately) constant average time. Patt. Recog. Lett., 4:145-157, 1986. [18] Mark Allen Weiss. Data Structures and Algorithm Analysis. Addison-Wesley, 2nd edition, 1995.
81
I/O-efficient Point Location using Persistent B-Trees Lars Arge*
Andrew Banner*
Sha-Mayn Teh
Department of Computer Science Duke University Abstract We present an external planar point location data structure that is I/O-efficient both in theory and practice. The developed structure uses linear space and answers a query in optimal O(logB N) I/Os, where B is the disk block size. It is based on a persistent B-tree, and all previously developed such structures assume a total order on the elements in the structure. As a theoretical result of independent interest, we show how to remove this assumption. Most previous theoretical I/O-efficient planer point location structures are relatively complicated and have not been implemented. Based on a bucket approach, Vahrenhold and Hinrichs therefore developed a simple and practical, but theoretically non-optimal, heuristic structure. We present an extensive experimental evaluation that shows that on a range of real-world Geographic Information Systems (GIS) data, our structure uses fewer I/Os than the structure of Vahrenhold and Hinrichs to answer a query. On a synthetically generated worst-case dataset, our structure uses significantly fewer I/Os.
1 Introduction The planar point location problem is the problem of storing a planar subdivision defined by N segments such that the region containing a query point p can be computed efficiently. Planar point location has many applications in, e.g., Geographic Information Systems (GIS), spatial databases, and graphics. In many of these applications, the datasets are larger than the size of physical memory and must reside on disk. In such cases, the transfer of data between the fast main memory and slow disks (I/O), not the CPU computation time, often becomes the bottleneck. Therefore, we are interested in planar point location structures that minimize the number of I/Os needed to answer a query. While several theoretically I/O-efficient planar point location structures have been developed, e.g., [17, 1, 8], they are all relatively complicated and consequently none of them have been implemented. Based on a * Supported in part by the National Science Foundation through ESS grant EIA-9870734, RI grant EIA-9972879, CAREER grant CCR-9984099, ITR grant EIA-0112849, and U.S.Germany Cooperative Research Program grant INT-0129182. Email: largeOcs. duke. edu. t Supported in part by the National Science Foundation through grant CCR-9984099. Email: adannerQcs.duke.edu.
bucket approach, Vahrenhold and Hinrichs developed a simple and practically efficient, but theoretically nonoptimal, heuristic structure [24]. In this paper, we show that a point location structure based on a persistent B-tree is efficient both in theory and practice; the structure obtains the theoretical optimal bounds and our experimental investigation shows that, for a wide range of real-world GIS data, it uses fewer I/Os to answer a query than the structure of Vahrenhold and Hinrichs. For a synthetically generated worst case dataset, the structure uses significantly less I/Os to answer a query.
1.1 I/O-model and previous results We work in the standard I/O model of computation proposed by Aggarwal and Vitter [2]. In this model, computation can only occur on data stored in a main memory of size M, and an I/O transfers a block of B consecutive elements between an infinite sized disk and main memory. The complexity measures in this model are the number of I/Os used to solve a problem (answer a query) and the number of disk blocks used. Aggarwal and Vitter showed that O(^\ogMtB^) is the external memory equivalent of the well-known O(N\og2 N) internal memory sorting bound. Similarly, the B-tree [10, 12, 18] is the external equivalent of an internal memory balanced search tree. It uses linear space, O(N/B) blocks, to store N elements; supports updates in O(logB N) I/Os; and performs one dimensional range queries in optimal O(logB N + T/B) I/Os, where T is the number of reported elements. Using a general technique by Driscoll et al. [14], persistent versions of the B-tree have also been developed [11, 26]. A persistent data structure maintains a history of all updates performed on it, such that queries can be answered on any of the previous versions of the structure, while updates can only be performed on the most recent version (thus creating a new version).1 A persistent B-tree uses O(N/B) space, where TV is the number of updates performed, and updates and range queries can be performed in O(logB N) and O(logB N+T/B) I/Os, lr The type of persistence we describe here is often called partially persistent as opposed to full persistence where updates can be performed on any previous version.
82
I/Os. It is based on the persistent search tree idea of Sarnak and Tarjan [20], and as a theoretical contribution of independent interest, we show how to modify the known persistent B-tree such that only elements present in the same version of the structure need to be comparable, that is, so no total order is needed. In Section 3, we then present an extensive experimental evaluation of the structure's practical performance compared to the heuristic grid structure of Vahrenhold and Hinrichs using both real-world and artificial (worst case) datasets. In their original experimental evaluation, Vahrenhold and Hinrichs [24] used hydrology and road feature data extracted from the U.S. Geological Survey Digital Line Graph dataset [23]. On similar "nicely" distributed sets of short segments, our structure answers queries in about half as many I/Os as the grid structure but requires about twice as much space. On less "nice" data, our structure performs significantly better than the grid structure; we present one example where our structure answers queries using 90% fewer I/Os and requires 94% less space.
respectively [11, 26]; note that the structure requires that all elements stored in it during its entire lifespan are comparable, that is, that the elements are totally ordered. In the RAM model, several linear space planar point location structures that can answer a query in optimal 0(log2 N) time have been developed (e.g [16, 20, 19]). One of these structures, due to Sarnak and Tarjan [20], is based on a persistent search tree. In this application of persistence, not all elements (segments) stored in the structure over its lifespan are comparable; thus a similar I/O-efficient structure cannot directly be obtained using a persistent B-tree. See the recent survey by Snoeyink [21] for a full list of RAM model results. In the I/O-model, Goodrich et al. [17] developed an optimal static point location structure using linear space and answering a query in O(logB N) I/Os. Agarwal et al. [1] and Arge and Vahrenhold [8] developed dynamic structures. Several structures for answering a batch of queries have also been developed [17, 9, 13, 24]. Refer to [4] for a survey. While these structures are all theoretically I/O-efficient, they are all relatively complicated and consequently none of them have been implemented. Based on an internal memory bucket approach [15], Vahrenhold and Hinrichs therefore developed a simple but non-optimal heuristic structure, which performs well in practice [24], The main idea in this structure is to impose a grid on the segments defining the subdivision and store each segment in a "bucket" corresponding to each grid cell it intersects. The grid is constructed such that for certain kinds of "nice data", each segment is stored in O(l) buckets (such that the structure requires linear space) and each bucket contains O(B) segments (such that a query can be answered in O(l) I/Os). In the worst case however, each segment may be stored in Q(^/N/B) buckets and consequently the structure may use Q(N/B-\/N/B) space. In this and some other cases, there may be a bucket containing O(N) segments such that a query takes O(N/B) I/Os. Most of the structures in the above results actually solve a slightly generalized version of the planar point location problem, namely the vertical ray-shooting problem; Given a set of N non-intersecting segments in the plane, the problem is to construct a data structure such that the segment directly above a query point p can be found efficiently. This is also the problem we consider in this paper.
1.2
2
Ray-shooting using persistent B-trees
Our structure for answering vertical ray-shooting queries among a set of non-intersecting segments in the plane is based on the persistent search tree idea of Sarnak and Tarjan [20]. This idea utilizes the fact that any vertical line I in the plane naturally introduces an "above-below" order on the segments it intersects. This means that if we conceptually sweep the plane from left to right (—00 to oo) with a vertical line, inserting a segment in a persistent search tree when its left endpoint is encountered and deleting it again when its right endpoint is encountered, we can answer a ray-shooting query p = (x, y) by searching for the position of y in the version of the search tree we had when I was at x. Refer to Figure 1. Note that two segments that cannot be intersected with the same vertical line are not "above-below" comparable. This
Our results
The main result of this paper is an external data structure for vertical ray-shooting (and thus planar point location) that is I/O-efficient both in theory and practice. The structure, described in Section 2, uses linear space and answers a query in optimal O(logB N)
Figure 1: Vertical ray-shooting using sweep and persistent search tree. 83
means that not all elements (segments) stored in the persistent structure over its lifespan are comparable and thus an I/O-efficient structure cannot directly be obtained using a persistent B-tree. To make the structure I/O-efficient, we need a persistent B-tree that only requires elements present in the same version of the structure to be comparable. In Section 2.1, we first sketch the persistent B-tree of [11] and then in Section 2.2 we describe the modifications needed to use the tree in a vertical ray-shooting structure.
2.1
Persistent B-tree
A B-tree, or more generally an (a, 6)-tree [18], is a balanced search tree with all leaves on the same level, and with all internal nodes except possibly the root having 0(J5) children (typically between B/2 and B). Normally, elements are stored in the leaves, and the internal nodes contain "routing elements" used to guide searches (sometimes called a B+-tree). Since each node and leaf contains B(B) elements it can be stored in 0(1) blocks, which in turn means that the tree uses linear space (O(N/B) disk blocks). Since the tree has height 0(logB N), a range search can be performed in O(\ogB N + T/B) I/Os. Insertions and deletions can be performed in O(\ogBN) I/Os using split, merge, and share operations on the nodes on a root-leaf path [18, 10, 12]. A persistent (or multiversion) B-tree as described in [11, 26] is a directed acyclic graph (DAG) with elements in the sinks (leaves) and "routing elements" in internal nodes. Each element is augmented with an insert and a delete version (or time), defining the existence interval of the element; an element is alive in its existence interval and dead otherwise. Similarly, an existence interval is associated with each node, and it is required that the nodes and elements alive at any time t (in version t) form a B-tree with fanout between aB and B for some constant 0 < a < 1/2. Given the appropriate root (in-degree 0 node) we can thus perform a range search in any version of the structure in 0(logB N + T/B) I/Os. To be able to find the appropriate root at time t in 0(logB N) I/Os, the roots are stored in a standard B-tree, called the root B-tree. An update in the current version of a persistent Btree (and thus the creation of a new version) may require structural changes and creation of new nodes. To control these changes and obtain linear space use, an additional invariant is imposed on the structure: Whenever a new node is created, it must contain between (a + j)B and (1 — j)B alive elements (and no dead elements) for 0 < 7 < 1/2 — a/2. To insert a new element x in the current version of a persistent B-tree we first perform a search for the relevant leaf / using 0(logB N) I/Os. Then we insert
x in I. If I now contains more than B elements, we have a block overflow. In this case we perform a version-split; we copy all, say k, alive elements from / and mark I as deleted, i.e, we update its existence interval to end at the current time. If (a + ^}B < k < (1 — 7)B, we create a new leaf with the k elements and recursively update parent(l) by persistently deleting the reference to I and inserting a reference to the new node. If on the other hand k < (a + ^)B or k > (I — 7)B, we have a strong underflow or strong overflow, respectively. The strong overflow case is handled using a split; we simply create two new nodes with approximately half of the k elements each, and update parent(l) recursively in the appropriate way. To guarantee that a split cannot result in a strong underflow we require that |(1 — 7)5 + 1 > (a + j)B or equivalently that B < 2a + 87 - 1. Note the similarity with a split rebalancing operation on a normal B-tree. Similarly, the strong underflow case is handled with operations similar to merge and share rebalancing operations on normal B-trees; we perform a version split on a sibling /' of I to obtain k' > aB other alive elements. To guarantee that we now have enough alive elements for a new node we require that k + k' > 2 • aB — 1 > (a + i)B or equivalently that B > l/(a - 7). If k + k' < (I - 7)£, we create a new leaf with the k + k' elements. If on the other hand k + k' > (I — 7)B, we first perform a split in order to create two new leaves. The first case corresponds to a merge and the second to a share. Finally, we recursively update parent(l) appropriately. A deletion is handled similarly to an insertion; first the relevant element a: in a leaf I is found and marked as deleted. This may result in I containing less than aB alive elements, also called a block underflow. To reestablish the invariants we simply perform a versionsplit followed by a merge (and possibly a split as before). Figure 2 illustrates the "rebalance operations" needed as a result of an insertion or deletion.
Figure 2: Illustration of "rebalancing operations" needed when updating a persistent B-tree. Both insertions and deletions are performed in O(logBN) I/Os, since the changes needed on the 84
leaf level can be performed in 0(1) I/Os, and the "rebalancing" at most propagates from I to the root of the current version. Since O(7-B) updates have to be performed in a leaf from the time it is created until a new rebalancing operation is needed, the total number of leaves created during N updates 18 ^(^If) = ^(^)- Each time a leaf is created or marked dead a corresponding insertion or deletion is performed recursively one level up the tree. Thus the number of nodes created one level above a leaf during N updates is O( Jy). In total the number of nodes created is Y^Si" O(%) = O(f). Details can be found in [11, 26]. Theorem 1 ([11, 26]) After N insertions and deletions of elements from a total order into an initially empty persistent B-tree, the structure uses O(N/B) space and supports range queries in any version in 0(logB N + T/B) I/Os. An update can be performed on the newest version in O(logB N) I/Os.
2.2
relevant rebalancing. What remains is to describe the modifications to the rebalance operations. Version-split. Recall that a version-split (not leading to a strong underflow or overflow) consists of copying all alive elements in a node w, using them to create a new node v, deleting the reference to u and recursively inserting a reference to v in parent(u). Since the reference to u has an element x associated with it, we cannot simply mark it deleted by updating its existence interval. However, since we are also inserting a reference to the new node u, and since the elements in v are a subset of the elements in w, we can use x as the element associated with the reference to v. Thus we can perform the version-split almost exactly as before, while maintaining the new invariant, by simply using x as the element associated with the reference to v as shown in Figure 3.
Modified persistent B-tree
In the persistent B-tree as described above, elements were stored in the leaves only. To search efficiently, internal nodes also contain elements ("routing elements"). In our discussion of the persistent B-tree so far we did not discuss precisely how these elements are obtained, that is, we did not discuss what element is inserted in parent(v) when a new node (or leaf) v is created and a new reference is inserted in parent(v). As in normal B-trees, a copy of the maximal element in v is used as a routing element in parent(v) when v is created. Since copies of an element can be present as routing elements in internal nodes long after the element is deleted, all elements stored in the persistent B-tree during its entire lifespan need to be comparable (the elements need to be totally ordered). As mentioned, this means that the persistent B-tree cannot be used to design an I/Oefficient vertical ray-shooting structure. To obtain a persistent B-tree structure that only requires elements present in the same version to be comparable, we modify the structure to store actual elements in all nodes; we impose the new invariant that the alive elements at any time t form a B-tree with data elements in internal as well as leaf nodes (as opposed to just in leaves). Except for slight modifications to the version-split, split, merge, and share operations, the insert algorithm remains unchanged. In the delete algorithm we need to be careful when deleting an element x in an internal node u. Since x is associated with a reference to a child uc of u, we cannot simply mark x dead by updating its existence interval. Instead, we first find its predecessor y in a leaf below u and persistently delete it. Then we delete x from u, insert y with a reference to the child uc, and perform the
Figure 3: Illustration of aversion-split. Partially shaded regions corresponds to alive elements, black to dead elements, and white regions to unused space. (The same shading is used at the top of individual elements to indicate their status).
Split. When a strong overflow occurs after a versionsplit of u, a split is needed; two new nodes v and v' are created and two references need to be inserted in parent(u). As in the version-split case, the element x associated with u in parent(u) can be used as the element associated with the reference to v'. To maintain the new invariant we then "promote" the maximal element y in v to be used as the element associated with the reference to v in parent(u), that is, instead of storing y in v, we store it in parent(u). Otherwise a split remains unchanged. Refer to Figure 4.
Figure 4: Illustration of a split.
Merge. When a strong underflow occurs after a version-split of w, we perform a version-split of it's sibling u' and create a new node v with the obtained elements. We then delete the references to u and u' in parent(u) by marking the two elements x and y
associated with the references to u and u' as deleted. We can reuse the maximal of these elements, say y, as the reference to the new node v. To maintain the new invariant (preserve all elements) we then "demote" x and store it in the new node v. Otherwise a merge remains unchanged. Refer to Figure 5.
3 Experimental results In this section we describe the results of an extensive set of experiments designed to evaluate the performance of the (modified) persistent B-tree when used to answer vertical ray-shooting queries, compared to the grid structure of Vahrenhold and Hinrichs [24]. 3.1
Figure 5: Illustration of a merge. Share. When a merge would result in a new node with a strong overflow, we instead perform a share. We first perform a version-split on the two sibling nodes u and u', and create two new nodes v and v' with the obtained elements. As in the merge case, we delete the references to u and u' in parent(u) by marking two elements x and y associated with u and u' as deleted. We can reuse the maximal element y as the reference to v' but x cannot be used as a reference to v. Instead, we demote x to v and promote the maximal element z in v to parent(u). Refer to Figure 6.
Figure 6: Illustration of a share. Theorem 2 After N updates on an initially empty modified persistent B-tree, the structure uses O(N/B) space and supports range queries in any version in O(logB N + T/B) I/Os. An update can be performed on the newest version in O(logj3 N) I/Os. Corollary 1 A set of N non-intersecting segments in the plane can be processed into a data structure of size O(N/B) in O(N logB N) I/Os such that a vertical rayshooting query can be answered in O(\ogB N) I/Os. While trivially performing a sequence of N given updates on a (modified as well as unmodified) persistent B-tree takes O(N logB N) I/Os, it has been shown how the N updates can be performed in O(jjjlogM/B j } ) I/Os (the sorting bound) on a normal (unmodified) persistent B-tree [25, 3, 6]. In the modified B-tree case, the lack of a total order seems to prevent us from performing the updates in this much smaller number of I/Os. In fact, the existence of such a fast algorithm would immediately lead to a semi-dynamic insert-efficient vertical ray-shooting structure using the external version of the logarithmic method [8].
Implementations
We implemented both the persistent B-tree and the grid structure using TPIE. TPIE is a library of templated C++ classes and functions designed to ease the implementation of I/O-efficient algorithms and data structures [5, 7]. Below we discuss the two implementations separately. Persistent B-tree implementation. When using the persistent B-tree to answer vertical ray-shooting queries the elements (segments) already implicitly contain their existence interval (the x-coordinates of the endpoints). Thus we implemented the structure without explicitly storing existence intervals.2 This way each element occupied 28 bytes. To implement the root Btree we used the standard B-tree implementation in the TPIE distribution [7]. In this implementation each root element requires 16 bytes. Two parameters a and 7 are used in the definition of the persistent B-tree. Since we are working with large datasets, we choose these parameters to optimize for space. Space is minimized when 7 (and thus the number of updates needed between creation of nodes) is maximized, and the constraints on a and 7 require that we choose 7 < min{a - 1/B, 1/3 - 2a + l/B}. The maximum value of 7 is when a = 1/5 + ^, so we chose a = 1/5 and 7 = ^ — ^ accordingly. If we wanted to optimize for query performance, we should choose a larger value of a—leading to a smaller tree height—but a few initial experiments indicated that increasing a had relatively little impact on query performance. Grid structure implementation. The idea in the grid structure of Vahrenhold and Hinrichs [24] is to impose a grid on the (minimal bounding box of the) segments and store each segment in a bucket corresponding to each grid cell it intersects. The grid is designed to have N/B cells and ideally each bucket is of size B. To adapt to the distribution of segments, a slightly different grid than the natural <>/N/~B x ^N/B grid is used; two parameters, Fx = j^- Y^i=i lx*2 — £n\ and y = lv^ £]£=i I*/*2 ~ y»il» where the i'th segment is given by ((xn,yn), (%i2>2/t2))> are used to estimate the F
2
We have also implemented a general persistent B-tree with existence intervals. Both implementations will be made available in the next TPIE release.
86
is represented as a series of short segments. Roads (segments) are also broken at intersections, such that no two segments intersect other than at endpoints. Because of various errors, the dataset actually contains a few intersection segments, which we removed in a preprocessing step.
amount of overlap of the x-projections and y-projections of the segments, respectively. Then the number of rows and columns in the grid is calculated as Nx = ax^/N/B and Ny = ay-\/N/B, where ax/ay = Fy/Fx and axay = 1. In the TPIE implementation of the grid structure [24], the segments are first scanned to compute Nx and Ny. Then they are scanned again and a copy of each segment is made for each cell it crosses. Each segment copy is also augmented with a bucket ID, such that each copy occupies 32 bytes. Finally, the new set of segment is sorted by bucket ID using TPIE's I/O-optimal merge sort algorithm [2]. The buckets are stored sequentially on disk and a bucket index is constructed in order to be able to locate the position of the segments in a given bucket efficiently. The index is simply a Nx x Ny array with entry (i,j) containing the position of the first segment in the bucket corresponding to cell (i,j) (as well as the number of segments in the bucket). Each index entry is 16 bytes. If L is the the number of segment copies produced during the grid structure construction, 0(J^ logM/B ^) I/Os is the number of I/Os used by the algorithm. Ideally, this would be O(^ logM/B ^) I/Os, but in the worst case, each segment can cross -^N/B buckets and the algorithm requires O(-^i/JV/J31ogM/B ^) I/Os. Refer to Figure 7 for an example of such a dataset.
Our first 6 datasets, containing between 16 million segments (374 MB) and 80 million segments (1852 MB), consist of the roads in connected parts of the US (corresponding to the six CD's on which the data is distributed). A summary of the number of segments in each dataset is shown in Table 1, and pictures of the datasets appear in Figure 8. In addition to these large datasets, we also used four smaller datasets with disjoint data regions; dataset CL1 consists of the four US states Washington, Minnesota, Maine and Florida, and CL2 excludes Minnesota. The bounding boxes of CL1 and CL2 are relatively empty with the exception of three or four "hot spots" where the states are located. These datasets were included to investigate the effect of non-uniform data distributions. The dataset DST spans a sparsely populated region of the United States and was included to investigate the effect of longer but more uniformly distributed segments; we expect that the segments are longer and more uniformly distributed in the desert than in metropolitan areas. The dataset ATL, on the other hand, was chosen as a dense, but
Figure 7: Worst-case dataset for the grid method. Answering a query using the grid method simply involves looking up the position of the relevant bucket (the cell containing the query point) using the bucket index and then scanning the segments in the bucket to answer the query. In the ideal case, each bucket contains O(B) segments and the query can be performed in O(l) I/Os. However, in the worst case, a query takes O(NfB) I/Os.
3.2
Figure 8: Illustration of the real-world datasets used in our experiments. Black regions indicate areas in the dataset. The outline of the continental US serves as a visualization guide and is not part of the actual data set.
Data
To investigate the efficiency of our vertical ray-shooting data structure, we used road data from the US Census TIGER/Line dataset containing all roads in the United States [22]. In this dataset one curved road 87
Data Set Dl Dl-2 Dl-3 Dl-4 Dl-5 Dl-6 CL1 CL2 ATL DST LONG
Segments (in Millions) 16.36 31.22 41.78 57.33 69.82 80.91 6.69 5.09 10.84 6.40 1.00
Size (MB) 374 714 956 1312 1598 1852 153 116 248 146 23
Table 1: Number of segments and raw dataset size (assuming 24 bytes per segment) of the datasets. maybe less uniform, dataset. Finally, in addition to the real world data, we also generated a dataset LONG, designed to illustrate the worst case behavior of the grid method corresponding to Figure 7. For query point sets, we generated a list of 100000 randomly sampled points from inside each of the datasets; for each query point p, we require that segments are hit by vertical rays emanating from p in both the positive and negative y-directions.
3.3
Experimental setup
We ran our experiments on an Intel PHI - 500MHz machine running FreeBSD 4.5. All code and data resided on a local 10000 RPM 36GB SCSI disk. The disk block size was 8KB, so each leaf of the persistent B-tree could store 290 elements and each internal node (also containing references) 225 elements. In the root B-tree, each leaf could hold 510 elements and each internal node 681 elements. The grid method bucket index has AT/256 entries of 16 bytes each. We limited the RAM of the machine to 128 MB and, since the OS was observed to use 60 MB, we limited the amount of memory available to TPIE to 12 MB. Constraining the memory provides insight into how the structures would perform if the system had more total memory but was under heavy load. To increase realism in our experiments, we used TPIE's built-in (8-way set associative LRU-style) cache mechanism to cache parts of the data structures. Since a query on the persistent B-tree always begins with a search in the relatively small root B-tree, we used a 72 block cache (8 internal nodes, 64 leaf nodes) for the root B-tree—enough to cache the entire structure in all our experiments. We also used a separate 16 block cache for the internal persistent B-tree nodes and a 32 block cache for the leaf nodes. Separate caches were used
for internal nodes and leaves to ensure that accesses to the many leaves did not result in eviction of the few internal nodes from the cache. In total the caches used for the entire persistent structure were of size 120 blocks or 960KB. For the grid structure, we (in analogy with the root B-tree in the persistent case) cached the entire bucket index—of size 2.41 MB for the largest dataset, Dl-6.
3.4
Experimental results
Structure size. Figure 9 shows the size of the grid and persistent B-tree data structures constructed on the 11 datasets. For the real life (TIGER) data, the grid method uses about 1.4 times the space of the raw data, whereas the persistent B-tree sometimes uses almost 3 times the raw data size. The low overhead of the grid method is a result of relatively low segment duplication (around 1.02 average copies per segment) due to the very short segments. The rest of the space is used to store the bucket index. The larger overhead of the persistent B-tree is mainly due to each structural change (rebalance operation) resulting in the creation of multiple copies of each segment; we found that there are roughly 2.4 copies of each segment in a persistent B-tree structure. Analyzing the numbers for the real datasets in more detail reveals that for the first six datasets and ATL the space utilization is quite uniform. For datasets DST, CL1, and CL2 the grid structure uses slightly less space while the persistent B-tree uses more space. The persistent B-tree method uses more space because the relatively small datasets are sparsely populated with segments. At any given time during the construction sweep, the sweep line intersects relatively few segments. As a result, many transitions are made between a height one (one leaf) and height two (with a low-degree root) tree, resulting in relatively low block utilization. For the artificially generated dataset LONG, the space usage of both structures increases dramatically.
Figure 9: Space utilization of grid and persistent B-tree based structures. Size is relative to the raw dataset size. 88
As expected, the cause of the enormous space use of the grid structure is a high number of segment copies (93 per segment on the average). The persistent B-tree also has a significant increase in space, but not nearly as much as the grid structure—the structure is 94% smaller than the grid structure. The reason for increased space usage in the persistent B-tree is that all the segments are long and thus they stay in the persistent structure for a long time (in many versions), resulting in a high tree. Furthermore, most of the structural changes in the beginning of the construction algorithm are splits, leading to many copies of alive elements. Similarly, most of the structural changes at the end of the execution is merges, again leading to a large redundancy. Construction efficiency. Figure 10 shows construction results in terms of I/O and physical time. For all the real life datasets, the persistent B-tree structure uses around 1.5 times more I/Os than the grid structure. This is rather surprising since the theoretical construction bound for the persistent B-tree is O(N logs N) I/Os, compared to the (good case) O(^log M / B -^) bound for the grid structure. Theory thus predicts that the tree construction algorithm should perform about B times as many I/Os as the grid method. The reason for this discrepancy between theory and practice is that during the construction sweep the average number of segments intersecting the sweep line is relatively small. For the Dl-6 dataset, it is less than 2500 segments (see Figure 11). The size of the persistent B-tree accessed during each update is small and as a result the caches can store most of the nodes in the tree. While it only takes about 50% more I/Os to construct the persistent B-tree structure than to construct the grid structure, it takes nearly 17 times more physical time. One reason for this is that most of the I/Os performed by the grid construction algorithm are sequential, while the I/Os performed by the persistent B-tree algorithm are more random. Thus the grid algorithm takes advantage of the optimization of the disk controller and the OS file system for sequential I/O, e.g., by prefetching. Another reason is that construction of the persistent B-tree structure is more computationally intensive than construction of the grid. Our trace logs show that over 95% of the construction time for the persistent B-tree is spent in internal memory compared to only 50% for the grid method. While the grid construction algorithm outperforms the persistent B-tree algorithm on the real life datasets, the worst-case dataset, LONG, causes significant problems. For this dataset, the grid construction takes 48 minutes compared to 53 minutes for the 80 times bigger dataset Dl-6, and compared to 10 minutes for the persistent B-tree. The reason is the large size of the structure results in a high I/O construction cost.
Figure 10: Construction performance: a) Number of I/Os per 100 segments, b) Construction time per segment.
Figure 11: The number of segments intersecting the sweep line as a function of sweep line position during the construction of the persistent B-tree for the Dl-6 dataset.
89
The persistent B-tree construction also takes relatively more I/Os (and time), mainly due to the high average number of segments intersecting the sweep-line (500000 at the peak). Query efficiency. Our query experiments illustrate the advantage of the persistent B-tree structure over the grid structure; Figure 12 shows that a query is consistently answered in less than two I/Os on the average, while the grid structure uses between approximately 2.6 and 28 I/Os on the average for the real-world datasets, and 126 I/Os for the LONG dataset.
Figure 12: Query performance: a) Number of I/Os per query, b) Time per query in milliseconds. Analyzing the grid structure results for the real-world datasets in more detail reveals that the performance is mostly a function of the distribution of segments within their bounding box. The bounding boxes become more full as we move from Dl to Dl-6, and as a result the average number of I/Os per query drops from 3.76
to 2.65. The dataset ATL overlaps with a significant portion of Dl, but because of the better distribution of segments within the bounding box the I/O performance is better for ATL. Datasets CL1 and CL2 exacerbate the problem with non-uniform distributions. For these datasets, most grid cells, more than 90%, are completely empty. As a result, a query within a non-empty cell is very expensive. Finally, as expected the LONG data set shows the grid structures vulnerability to long segments; on the average, a query takes 126 I/Os. Analyzing the persistent B-tree results in detail reveals that its performance is mostly a function of the average number of segments intersecting the sweepline; datasets Dl through Dl-6 and ATL have higher average I/O cost than DST, CL1 and CL2. Similarly, dataset Dl has a lower average cost than ATL, since the region of Dl not in ATL has a smaller number of sweeplinesegment intersections. This is the opposite of the behavior of the grid method whose query performance is worse for Dl than ATL. Finally, even though the average number of sweepline-segment intersections for the LONG dataset is more than half a million, the height of the tree is no more than three at any time (version), and as a result the query efficiency is maintained. Further experiments. In our experiments with the TIGER data, we noticed that non-empty buckets in the grid structure often contained three or more disk blocks of segments. Therefore we expected that by reducing the grid spacing in both the x and ^/-direction by a factor of two—creating four times as many buckets— each bucket would likely contain only one disk block, leading to an improved query performance. Such an improvement is of course highly data dependent. It would also come at the cost of space, since we must index four times as many buckets, and the denser grid may result in more segment duplication. To investigate this we ran our tests again using such a modified grid and found that for the real life datasets, the number of segment copies did not increase significantly (from 1.02 to 1.04 copies per segment). Thus the construction performance was maintained. In terms of query performance, we found that the four-fold increase in the number of buckets leads to a factor of two to three improvement in the query performance. Refer to Figure 13. In these experiments we cached the entire bucket index, which is four times larger than in the regular grid method. For Dl the index is 2 MB, and for Dl-6 the index is 10 MB, ten times larger than the cache used in the persistent B-tree. For the LONG dataset, the modified grid uses twice the space of the standard grid and still has poor query performance compared to the persistent B-tree. Finally, to investigate the influence of caches we ran a series of experiments without caching. In these 90
Acknowledgments We thank Jan Vahrenhold for providing and explaining the grid method code and Tavi Procopiuc for help with the TPIE implementation.
References [1] P. K. Agarwal, L. Arge, G. S. Brodal, and J. S. Vitter. I/O-efficient dynamic point location in monotone planar subdivisions. In Proc. ACMSIAM Symp. on Discrete Algorithms, pages 11161127, 1999. [2] A. Aggarwal and J. S. Vitter. The Input/Output complexity of sorting and related problems. Communications of the ACM, 31(9):1116-1127, 1988. [3] L. Arge. The buffer tree: A new technique for optimal I/O-algorithms. In Proc. Workshop on Algorithms and Data Structures, LNCS 955, pages 334-345, 1995. [4] L. Arge. External memory data structures. In J. Abello, P. M. Pardalos, and M. G. C. Resende, editors, Handbook of Massive Data Sets, pages 313358. Kluwer Academic Publishers, 2002. [5] L. Arge, R. Barve, O. Procopiuc, L. Toma, D. E. Vengroff, and R. Wickremesinghe. TPIE User Manual and Reference (edition 0.9.01 a). Duke University, 1999. The manual and software distribution are available on the web at http://www.cs.duke.edu/TPIE/. [6] L. Arge, K. H. Hinrichs, J. Vahrenhold, and J. S. Vitter. Efficient bulk operations on dynamic Rtrees. Algorithmica, 33(1):104-128, 2002. [7] L. Arge, O. Procopiuc, and J. S. Vitter. Implementing I/O-efficient data structures using TPIE. In Proc. Annual European Symposium on Algorithms, 2002. [8] L. Arge and J. Vahrenhold. I/O-efficient dynamic planar point location. In Proc. ACM Symp. on Computational Geometry, pages 191-200, 2000. [9] L. Arge, D. E. Vengroff, and J. S. Vitter. Externalmemory algorithms for processing line segments in geographic information systems. In Proc. Annual European Symposium on Algorithms, LNCS 979, pages 295-310, 1995. To appear in special issues of Algorithmica on Geographical Information Systems. [10] R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1:173-189, 1972. [11] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion B-tree. VLDB Journal, 5(4):264-275, 1996. [12] D. Comer. The ubiquitous B-tree. A CM Computing Surveys, 11(2):121-137, 1979.
Figure 13: Query performance of the grid method using four times as many buckets (1/2 Grid): a) Number of I/Os per query, b) Time per query in milliseconds. experiments we found that one additional I/O is used per query in the grid structure in order to access the bucket index. In the persistent B-tree structure one or two extra I/Os are used depending on the height of the root B-tree. This can immediately be reduced by one I/O per query by just caching the block containing the root of the root B-tree. Thus we conclude that the query performance does not depend critically on the existence of caches.
4 Conclusions In this paper, we have presented an external point location data structure based on a persistent B-tree that is efficient both in theory and practice. One major open problem is to construct the structure in O(^log M /B^) I/^s, as compared to the (trivial) O(N logB N) algorithm discussed in this paper. 91
[13] A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. Ramos. Randomized external-memory algorithms for some geometric problems. International Journal of Computational Geometry & Applications, ll(3):305-337, June 2001. [14] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38:86124, 1989. [15] M. Edahiro, I. Kokubo, and T. Asano. A new point-location algorithm and its practical efficiency — Comparison with existing algorithms. ACM Trans. Graph., 3(2):86-109, 1984. [16] H. Edelsbrunner and H. A. Maurer. A spaceoptimal solution of general region location. Theoret Comput. Sci., 16:329-336, 1981. [17] M. T. Goodrich, J.-J. Tsay, D. E. Vengroff, and J. S. Vitter. External-memory computational geometry. In Proc. IEEE Symp. on Foundations of Comp. Sci., pages 714-723, 1993. [18] S. Huddleston and K. Mehlhorn. A new data structure for representing sorted lists. Acta Informatica, 17:157-184, 1982. [19] D. G. Kirkpatrick. Optimal search in planar subdivisions. SIAM J. Comput, 12(l):28-35, 1983. [20] N. Sarnak and R. E. Tarjan. Planar point location using persistent search trees. Communications of the ACM, 29:669-679, 1986. [21] J. Snoeyink. Point location. In J. E. Goodman and J. O'Rourke, editors, Handbook of Discrete and Computational Geometry, chapter 30, pages 559574. CRC Press LLC, Boca Raton, FL, 1997. [22] TIGER/Line™ Files, 1997 Technical Documentation. Washington, DC, September 1998. http://www.census.gov/geo/tiger/TIGER97D.pdf. [23] U.S. Geological Survey. l:100,000-scale line graphs (DLG). Accessible via URL. http://edcwww.cr.usgs.gov/doc/edchome/ndcdb_bk/ (Accessed 9 September 2002). [24] J. Vahrenhold and K. H. Hinrichs. Planar point location for large data sets: To seek or not to seek. In Proc. Workshop on Algorithm Engineering, LNCS 1982, pages 184-194, 2001. [25] J. van den Bercken, B. Seeger, and P. Widmayer. A generic approach to bulk loading multidimensional index structures. In Proc. International Conf. on Very Large Databases, pages 406-415, 1997. [26] P. J. Varman and R. M. Verma. An efficient multiversion access structure. IEEE Transactions on Knowledge and Data Engineering, 9(3):391-409, 1997.
92
Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries Ranjan Sinha* Justin Zobel Abstract Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms. The size of main memory in typical computers continues to grow, but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-sorting algorithms as they do not utilise cache particularly well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are kept in buckets. It is simple, fast, and efficient. We experimentally compare burstsort to existing string-sorting algorithms on large and small sets of strings with a range of characteristics. These experiments show that, for large sets of strings, burstsort is almost twice as fast as any previous algorithm, due primarily to a lower rate of cache miss.
1
Introduction
Sorting is one of the fundamental problems of computer science. In many current applications, large numbers of strings may need to be sorted. There have been several recent advances in fast sorting techniques designed for strings. For example, many improvements to quicksort have been described since it was first introduced, an important recent innovation being the introduction of three-way partitioning in 1993 by Bentley and McHroy [5]. Splaysort, an adaptive sorting algorithm, was introduced in 1996 by Moffat, Eddy, and Petersson [14]; 'School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia. [email protected] t School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia. [email protected]
it uses a combination of the splaytree data structure and insertionsort. Improvements to radixsort for string data were proposed in 1993 by McHroy, Bostic, and McDroy [13]; forward and adaptive radixsort for strings were introduced in 1996 by Andersson and Nilsson [2, 15]; a hybrid of quicksort and MSD radixsort named threeway radix quicksort was introduced in 1997 by Bentley and Sedgewick [4,18]; and, as an extension to keys that are made up of components, three-way radix quicksort was extended by Bentley and Sedgewick in 1997 [3] to give multikey quicksort. While these algorithms are theoretically attractive, they have practical flaws. In particular, they show poor locality of memory accesses. This flaw is of increasing significance. A standard desktop computer now has a processor running at over 1 GHz and 256 Mb or more of memory. However, memory and bus speeds have not increased as rapidly, and a delay of 20 to 200 clock cycles per memory access is typical. For this reason, current processors have caches, ranging from 64 or 256 kilobytes on a Celeron to 8 megabytes on a SPARC; however, these are tiny fractions of typical memory volumes, of 128 to 256 megabytes on the former and many gigabytes on the latter. On a memory access, a line of data (32 or 128 bytes say) is transferred from memory to the cache, and adjacent lines may be pro-actively transferred until a new memory address is requested. The paging algorithms used to manage cache are primitive, based on the low-order bits of the memory address. Thus, some years ago, the fastest algorithms were those that used the least number of instructions. Today, an algorithm can afford to waste instructions if doing so reduces the number of memory accesses [12]: an algo-
93
rithm that is efficient for sorting a megabyte of data, or whatever the cache size is on that particular hardware, may rapidly degrade as data set size increases. Radixsorts are more efficient than older sorting algorithms, due to the reduced number of times each string is handled, but are not necessarily particularly efficient with regard to cache. The degree to which algorithms can effectively utilise cache is increasingly a key performance criterion [12, 20]. Addressing this issue for string sorting is the subject of our research. We propose a new sorting algorithm, burstsort, which is based on the burst trie [9]. A burst trie is a collection of small data structures, or containers, that are accessed by a conventional trie. The first few characters of strings are used to construct the trie, which indexes buckets containing strings with shared prefixes. The trie is used to allocate strings to containers, the suffixes of which are then sorted using a method more suited to small sets. In principle burstsort is similar to MSD radixsort, as both recursively partition strings into small sets character position by character position, but there are crucial differences. Radixsort proceeds position-wise, inspecting the first character of every string before inspecting any subsequent characters; only one branch of the trie is required at a time, so it can be managed as a stack. Burstsort proceeds string-wise, completely consuming one string before proceeding to the next; the entire trie is constructed during the sort. However, the trie is small compared to the set of strings, is typically mostly resident in cache, and the streamoriented processing of the strings is also cache-friendly. Using several small and large sets of strings derived from real-world data, such as lexicons of web collections and genomic strings, we compared the speed of burstsort to the best existing string-sorting algorithms. Burstsort has high initial costs, making it no faster than the best of the previous methods for small sets of strings. For large sets of strings, however, we found that burstsort is typically the fastest by almost a factor of two. Using artificial data, we found that burstsort is insensitive to adverse cases, such as all characters being identical or strings that are hundreds of characters in length.
For large sets of strings, burstsort is the best sorting method. Using a cache simulator, we show that the gain in performance is due to the low rate of cache misses. Not only is it more efficient for the data sets tested, but it has better asymptotic behaviour. 2
Existing approaches to string sorting
Many sorting algorithms have been proposed, but most are not particularly well suited to string data. Here we review string-specific methods. In each case, the input is an array of pointers to strings, and the output is the same array with the strings in lexicographic order. Quicksort Quicksort was developed in 1962 by Hoare [10]. Bentley and Mcflroy's variant of quicksort was proposed in the early 1990s [5] and has since then been the dominant sort routine used in most libraries. The algorithm was originally intended for arbitrary input and hence has some overhead for specific datatypes. For our experiments, we use a stripped-down version by Nilsson [15] that is specifically tailored for character strings, designated as Quicksort. Multikey quicksort Multikey quicksort was introduced by Sedgewick and Bentley in 1997 [3]. It is a hybrid of quicksort and MSD radixsort. Instead of taking the entire string and comparing with another string in its entirety, at each stage it considers only a particular position within each string. The strings are then partitioned according to the value of the character at this position, into sets less than, equal to, or greater than a given pivot. Then, like radixsort, it moves onto the next character once the current input is known to be equal in the given character. Such an approach avoids the main disadvantage of many sorting algorithms for strings, namely, the wastefulness of a string comparison. With a conventional quicksort, for example, as the search progresses it is clear that all the strings in a partition must share a prefix. Comparison of this prefix is redundant [18]. With the character-wise approach, the length of the shared prefix is known at each stage. However, some of the disadvantages of quicksort are still present. Each char-
94
acter is inspected multiple times, until it is in an "equal to pivot" partition. Each string is re-accessed each time a character in it is inspected, and after the first partitioning these accesses are effectively random. For a large set of strings, the rate of cache misses is likely to be high. In our experiments, we have used an implementation by Sedgewick [3], designated as Multikey quicksort. Radixsort Radixsort is a family of sorting methods where the keys are interpreted as a representation hi some base (often a power of 2) or as strings over a given small alphabet. Instead of comparing keys hi their entirety, they are decomposed into a sequence of fixed-sized pieces, typically bytes. There are two, fundamentally different approaches to radix sorting: mostsignificant digit (MSD) and least-significant (LSD) [18]. It is difficult to apply the LSD approach to a stringsorting application because of variable-length keys. Another drawback is that LSD algorithms inspect all characters of the input, which is unnecessary in MSD approaches. We do not explore LSD methods hi this paper. MSD radixsort MSD radixsort examines only the distinguishing prefixes, working with the most significant characters first, an attractive approach because it uses the minimum amount of information necessary to complete the sorting. The algorithm has tune complexity H(n + 5), where S is the total number of characters of the distinguishing prefixes; amongst n distinct strings, the minimum value of 5 is approximately n logj^j n where \A\ is the size of the alphabet. The basic algorithm proceeds by iteratively placing strings in buckets according to their prefixes, then using the next character to partition a bucket into smaller buckets. The algorithm switches to insertionsort or another simple sorting mechanism for small buckets. In our experiments we have used the implementation of Nilsson [15], designated as MSD radixsort. MBM radixsort Early high-performance stringoriented variants of MSD radixsort were presented by Mcflroy, Bostic, and McDroy [13]. Of the four variants,
95
we found programC to be typically the fastest for large datasets. It is an array-based implementation of MSD radixsort that uses a fixed 8-bit alphabet and performs the sort in place. In agreement with Bentley and Sedgewick [3], we found it to be the fastest array-based string sort. In our experiments it is designated as MBM radixsort. Forward radixsort Forward radixsort was developed by Andersson and Nilsson in 1994 [1, 15]. It combines the advantages of LSD and MSD radixsort and is a simple and efficient algorithm with good worst-case behavior. It addresses a problem of MSD radixsort, which has a bad worst-case performance due to fragmentation of data into many sublists. Forward radixsort starts with the most significant digit, performs bucketing only once for each character position, and inspects only the significant characters. A queue of buckets is used to avoid the need to allocate a stack of trie nodes, but even so, hi our experiments this method had high memory requirements. In our experiments we have used the implementations of Nilsson [15], who developed 8bit and 16-bit versions, designated as Forward-8 and Forward-16. Adaptive radixsort Adaptive radixsort was developed by Nilsson in 1996 [15]. The size of the alphabet is chosen adaptively based on a function of the number of elements remaining, switching between two character sizes, 8 bits and 16 bits. In the 8-bit case it keeps track of the minimum and maximum character in each trie node. In the 16-bit case it keeps a list of which slots in the node are used, to avoid scanning large numbers of empty buckets. In our experiments we have used the implementation of Nilsson [15], designated as Adaptive radixsort. 3
Cache-friendly sorting with tries
A recent development in data structures is the burst trie, which has been demonstrated to be the fastest structure for maintaining a dynamic set of strings in sort order [8, 9]. It is thus attractive to consider it as the basis of a sorting algorithm. Burstsort is a straightforward implementation of sorting based on
burst trie insertion and traversal. We review the burst values.) In practice, with our test data sets, the space trie, then introduce our new sorting technique. overhead of the trie is around one bit per string. Insertion is straightforward. Let the string to be Burst tries The burst trie is a form of trie that is inserted be ci,...,c n of n characters. The leading efficient for handling sets of strings of any size [8, 9]. characters of the string are used to identify the container It resides in memory and stores strings in approximate in which the string should reside. If the container is at sort order. A burst trie is comprised of three distinct a depth of d = n + 1, the container is under the emptycomponents: a set of strings, a set of containers, and an string pointer. The standard insertion algorithm for the access trie. A container is a small set of strings, stored data structure used in the container is used to insert the in a simple data structure such as an array or a binary strings into the containers. For an array, a pointer to search tree. The strings that are stored in a container at the string is placed at the left-most free slot. depth d are at least d characters in length, and the first To maintain the limit L on container size, the d characters in the strings are identical. An access trie access trie must be dynamically grown as strings are is a trie whose leaves are containers. Each node consists inserted. This is accomplished by bursting. When of an array (whose length is the size of the alphabet) of the number of strings in a container exceeds L, a pointers, each of which may point to another trie node new trie node is created, which is linked into the trie or to a container, and a single empty-string pointer to a in place of the container. The d + 1th character of container. A burst trie is shown in Figure 1. Strings in the strings hi the container is used to partition the the burst trie are "bat", "barn", "bark", "by", "byte", strings into containers pointed to by the node. (In our "bytes", "wane", "way" and "west". implementation the string is not truncated, but doing A burst trie can increase in size in two ways. First is so could save considerable space, allowing much larger by insertion when a string is added to a container. Sec- sets of strings to be managed [9].) Repetitions of the ond is by bursting, the process of replacing a container same string are stored in the same list, and do not at depth d by a trie node and a set of new containers subsequently have to be sorted as they are known to at depth d + 1; all the strings in the original container be identical. In the context of burst tries, and in work are distributed in the containers in the newly created completed more recently, we have evaluated the effect node. A container is burst whenever it contains more of varying parameters. than a fixed number L of strings. Though the container is an unordered structure, the containers themselves are Burstsort The burstsort algorithm is based on the hi sort order, and due to their small size can be sorted general principle that any data structure that maintains items in sort order can be used as the basis of a rapidly. A question is how to represent the containers. In sorting method, simply by inserting the items into the our earlier implementations we considered linked lists structure one by one then retrieving them all in-order and other structures but the best method we have in a single pass.1 Burstsort uses the burst trie data identified is to use arrays. In this approach, empty structure, which maintains the strings in sorted or nearcontainers are represented by a null pointer. When sorted order. The trie structure is used to divide the an item is inserted, an array of 16 pointers is created. strings into containers, which are then sorted using When this overflows, the array is grown, using the other methods. As is true for all trie sorts, placing realloc system call, by a factor of 8. The container the string in a container requires reading of at most is burst when the capacity L = 8192 is reached. (These the distinguishing prefix, and the characters in the parameters were chosen by hand-tuning on a set of test 1 Our implementations is available under the heading "String data, but the results are highly insensitive to the exact sorting", at the URL www.cs.rmit.edu.au/~jz/resources.
96
Figure 1: Burs* trie twtfi /owr trie nodes, five containers, and nine strings, without duplicates. prefixes are inspected once only. Also, in many data sets the most common strings are short; these strings are typically stored at an empty-string pointer and are collected while traversing the access trie without being involved hi container sorting. Burstsort has similarities to MSD radixsort, but there are crucial differences. The main one is that memory accesses are more localized. During the insertion phase, a string is retrieved to place it in a container, then again when the container is burst (a rare event once a reasonable number of strings have been processed). Trie nodes are retrieved at random, but there are relatively few of these and thus most can be simultaneously resident in the cache. In contrast to this depth-first style of sorting, radixsort is breadth-first. Each string is refetched from memory per character in the string. With a typical set of strings, most leaf nodes in the access trie would be expected to have a reasonable number of containers, in the range 10-100 for an alphabet of 256 characters. Choosing L — 8,192 means that container sizes will typically be in the range 100 to 1,000, allowing fast sort with another sorting method. In preliminary experiments L = 8,192 gave the best results overall. Exploring the behaviour of this 97
parameter is a topic for further research. Considering the asymptotic computational cost of burstsort, observe that standard MSD radixsort uses a similar strategy. Trie nodes are used to partition a set of strings into buckets. If the number of strings in a bucket exceeds a limit L, it is recursively partitioned; otherwise, a simple strategy such as insertionsort is used. The order in which these operations are applied varies between the methods, but the number of them does not. Thus burstsort and MSD radixsort have the same asymptotic computational cost as given earlier. 4
Experiments
We have used three lands of data in our experiments, words, genomic strings and web URLs.2 The words are drawn from the large web track in the TREC project [6, 7], and are alphabetic strings delimited by non-alphabetic characters in web pages (after removal of tags, images, and other non-text information). The web URLs have been drawn from the same collection. The genomic strings are from GenBank. For word and z
Some of these data sets are available under the heading "String sets", at the URL www.cs.rmit.edu.au/~jz/resources.
Table 1: Statistics of the data collections used in the experiments. Data set Duplicates Size Mb Distinct Words ( x 105 ) Word Occurrences (xlO 5 ) No duplicates Size Mb Distinct Words (xlO 5 ) Word Occurrences (xlO 5 ) Genome Size Mb Distinct Words (xlO 5 ) Word Occurrences (xlO 5 ) Random Size Mb Distinct Words (xlO 5 ) Word Occurrences (xlO 5 ) URL Size Mb Distinct Words (xlO 5 ) Word Occurrences (xlO 5 )
Set 1
Set 2
Set 3
Set 4
Set 5
Set 6
1.013 0.599 1
3.136 1.549 3.162
7.954 3.281 10
27.951 9.315 31.623
93.087 25.456 100
304.279 70.246 316.230
1.1 1 1
3.212 3.162 3.162
10.796 10 10
35.640 31.623 31.623
117.068 100 100
381.967 316.230 316.230
0.953 0.751 1
3.016 1.593 3.162
9.537 2.363 10
30.158 2.600 31.623
95.367 2.620 100
301.580 2.620 316.230
1.004 0.891 1
3.167 2.762 3.162
10.015 8.575 10
31.664 26.833 31.623
100.121 83.859 100
316.606 260.140 316.230
3.03 0.361 1
9.607 0.92354 3.162
30.386 2.355 10
96.156 5.769 31.623
304.118 12.898 100
— — —
genomic data, we created six subsets, of approximately 105, 3.1623xl05,106, 3.1623xl06,107, and3.1623x!07 strings each. We call these SET 1, SET 2, SET 3, SET 4, SET 5, and SET 6 respectively. For the URL data we created SET 1 to SET 5. Only the smallest of these sets fits in cache. In detail, the data sets are as follows. Duplicates. Words in order of occurrence, including duplicates. Most large collections of English documents have similar statistical characteristics, in that some words occur much more frequently than others. For example, SET 6 has just over thirty million word occurrences, of which just over seven million are distinct words. No duplicates. Unique strings based on word pairs hi order of first occurrence in the TREC web data.
sands of nucleotides long. The alphabet size is four characters. It is parsed into shorter strings by extracting n-grams of length nine. There are many duplications, and the data does not show the skew distribution that is typical of text. Random. An artificially generated collection of strings whose characters are uniformly distributed over the entire ASCII range and the length of each string is randomly generated but less than 20. The idea is to force the algorithms to deal with a large number of characters where heuristics of visiting only a few buckets would not work well. This is the sort of distribution many of the theoretical studies deal with [17], although such distributions are not especially realistic.
Genome. Strings extracted from genomic data, a col- URL. Complete URLs, in order of occurrence and with duplicates, from the TREC web data, average lection of nucleotide strings, each typically thou-
98
length is high compared to the other sets of strings. 5 Some other artificial sets were used in limited experiments, as discussed later. The aim of our experiments is to compare the performance of our algorithms to other competitive algorithms, hi terms of running time. The implementations of sorting algorithms described earlier were gathered from the best source we could identify, and all of the programs were written hi C. We are confident that these implementations are of high quality. In preliminary experiments we tested many sorting methods that we do not report here because they are much slower than methods such as MBM radixsort. These included UNIX quicksort, splaysort, and elementary techniques such as insertionsort. The time measured in each case is to sort an array of pointers to strings; the array is returned as output. Thus an in-place algorithm operates directly on this array and requires no additional structures. For the purpose of comparing the algorithms, it is not necessary to include the parsing tune or the time used to retrieve data from the disk, since it is the same for all algorithms. We therefore report the CPU times, not elapsed tunes, and exclude the time taken to parse the collections into strings. The internal buffers of our machine are flushed prior to each run in order to have the same starting condition for each experiment. We have used the GNU gcc compiler and the Linux operating system on a 700 MHz Pentium computer with 2 Gb of fast internal memory and a 1 Mb L2 cache with block size of 32 bytes and 8-way associativity. In all cases the highest compiler optimization level 03 has been used. The total number of milliseconds of CPU time consumed by the kernel on behalf of the program has been measured, but for sorting only; I/O times are not included. The standard deviation was low. The machine was under light load, that is, no other significant I/O or CPU tasks were running. For small datasets, times are averaged over a large number of runs, to give sufficient precision.
99
Results
All timings are in milliseconds, of the total time to sort an array of pointers to strings into lexicographic order. In the tables, these times are shown unmodified. In the figures, the times are divided by nlogn where n is the number of strings. With such normalisation, suggested by Johnson [11], the performance of an ideal sorting algorithm is a horizontal line. Table 2 shows the running times for the algorithms on duplicates. These are startling results. The previous methods show only moderate performance gains in comparison to each other, and there is no clear winner amongst the four best techniques. In contrast, burstsort is the fastest for all but the smallest set size tested, of 100,000 strings, where it is second only to MBM radixsort. For the larger sets, the improvement in performance is dramatic: it is more than twice as fast as MBM radixsort, and almost four tunes as fast as an efficient quicksort. The rate of increase in tune required per key is shown in Figure 2, where as discussed the time is normalised by nlogn. As can be seen, burstsort shows a low rate of growth compared to the other efficient algorithms. For example, the normalised time for MBM radixsort grows from approximately 0.00014 to approximately 0.00025 from SET 1 to SET 6, whereas burstsort does not grow at all. There are several reasons that burstsort is efficient. In typical text the most common words are small, and so are placed under the empty-string pointer and do not have to be sorted. Only containers with more than one string have to be sorted, and the distinguishing prefix does not participate hi the sorting. Most importantly, the algorithm is cache-friendly: the strings are accessed in sequence and (with the exception of bursting, which only involves a small minority of strings) once only; the trie nodes are accessed repeatedly, but are collectively small enough to remain in cache. Figure 3 shows the normalised running times for the algorithms on no duplicates. Burstsort is again the fastest for all but the smallest data set, and almost twice as fast as the next best method for the largest data
Table 2: Duplicates, sorting time for each method (milliseconds).
Quicksort Multikey quicksort MBM radixsort MSD radixsort Adaptive radixsort Forward-8 Forward-16 Burstsort
Set 1
Set 2
122 62 58 72 74 146 116 58
528 272 238 290 288 676 486 218
Data set Set 3 Set 4
1,770 920 822 1,000 900 2,030 1,410 630
7,600 3,830 3,650 3,870 3,360 7,590 5,120 2,220
Set 5
Set 6
30,100 14,950 15,460 14,470 12,410 28,670 19,150 7,950
114,440 56,070 61,560 56,790 51,870 113,540 74,770 29,910
Figure 2: Duplicates. The vertical scale is time in milliseconds divided by nlogn. set. Elimination of duplicates has had little impact on relative performance. Table 3 shows the results for genome, a data set with very different properties: strings are fixed length, the alphabet is small (though all implementations allow for 256 characters), and the distribution of characters is close to uniform random. Burstsort is relatively even more efficient for this data than for the words drawn from text, and is the fastest on all data sets. For burstsort, as illustrated in Figure 4, the normalised cost per string declines with data set size; for all other methods, the cost grows. The URL data presents yet another distribution. The strings are long and their prefixes are highly
repetitive. As illustrated hi Figure 5, burstsort is much the most efficient at all data set sizes. Taking these results together, relative behaviour is consistent across aH sets of text strings—skew or not, with duplicates or not. For all of these sets of strings drawn from real data, burstsort is consistently the fastest method. We used the random data to see if another kind of distribution would yield different results. The behaviour of the methods tested is shown in Figure 6. On the one hand, burstsort is the most efficient method only for the largest three data sets, and by a smaller margin than previously. On the other hand, the normalised tune per string does not increase at all from SET 1 to SET 6, while there is some increase for all of the other methods. (As 100
Figure 3: No duplicates. The vertical scale is time in milliseconds divided by nlogn. Table 3: Genome, sorting time for each method (milliseconds). Data set Quicksort Multikey quicksort MBM radixsort MSD radixsort Adaptive radixsort Forward-8 Forward-16 Burstsort
Setl
Set 2
Set 3
Set 4
Set 5
Set 6
156 72 72 102 92 246 174 70
674 324 368 442 404 1,074 712 258
2,580
9,640
1,250 1,570 1,640 1,500
4,610
35,330 16,670 23,700 20,550 17,800 41,110 23,290 8,990
129,720 62,680 90,700 79,840 66,100 147,770 86,400 31,540
observed in the other cases, there are several individual instances in which the time per string decreases between set x and set x+1, for almost all of the sorting methods. Such irregularities are due to variations in the data.) Memory requirements are a possible confounding factor: if burstsort required excessive additional memory, there would be circumstances in which it could not be used. For SET 6 of duplicates we observed that the space requirements for burstsort are 790 Mb, between the in-place MBM radixsort's 546 Mb and adaptive radixsort's 910 Mb. The highest memory usage was observed by MSD radixsort, at 1,993 Mb, followed by forward-8 at 1,632 Mb and forward-16 at 1,315 Mb. We therefore conclude that only the in-place methods show
3,850 2,380 870
6,200 5,770 4,980 12,640 7,190
2,830
better memory usage than burstsort. Other data In previous work, a standard set of strings used for sorting experiments is of library call numbers [3],3 of 100,187 strings (about the size of our SET 1). For this data, burstsort was again the fastest method, requiring 100 milliseconds. The times for multikey quicksort, MBM radixsort, and adaptive radixsort were 106, 132, and 118 milliseconds respectively; the other methods were much slower. We have experimented with several other artificially-created datasets, hoping to bring out ^Available from vwtf.cs.princeton.edu/~rs/strings.
101
Figure 4: Genome sorting time. The vertical scale is time in milliseconds divided by nlogn. the worst cases in the algorithms. We generated three Table 4: Artificial data, sorting time for each method collections ranging in size from one to ten million (milliseconds). strings, as follows. Data set A. The length of the strings is one hundred, the alB A C phabet has only one character, and the size of the Quicksort 3,900 1,040 34,440 collection is one million. Multikey quicksort 11,530 18,750 5,970 MBM radixsort 18,130 40,220 19,620 B. The length of the strings ranges from one to a 5,630 MSD radixsort 10,580 26,380 hundred, the alphabet size is small (nine) and Adaptive radixsort 4,270 7,870 20,060 the characters appear randomly. The size of the 6,580 Forward-8 12,290 38,800 collection is ten million. 4,450 Forward-16 8,140 27,890 Burstsort 2,730 10,090 1,420 C. The length of the strings ranges from one to hundred, and strings are ordered in increasing size in a cycle. The alphabet has only one character and use of cache, we measured the number of cache misses the size of the collection is one million. for each algorithm and sorting task. We have used Table 4 shows the running times. In each case, burstsort cacheprof, an open-source simulator for investigating is dramatically superior to the alternatives, with the cache effects in programs [19]; the cache parameters of single exception of quicksort on SET A; this quicksort our hardware were used. Figure 7 shows the rate of is particularly efficient on identical strings. In SET B, cache misses per key for no duplicates (the upper) and the data has behaved rather like real strings, but with for URL (the lower). Similar behaviour to the no dupliexaggerated string lengths. In SET C, MBM radixsort— cates case was observed for the other data sets. Figure 7 in the other experiments, the only method to ever do is normalised by n (not nlogn as in Figure 3) to show better than burstsort—is extremely poor. the number of cache misses per string. For small data sets in the no duplicates case, Cache efficiency To test our hypothesis that the effi- burstsort and MBM radixsort shows the greatest cache ciency of burstsort was due to its ability to make better
102
Figure 5: URL sorting time. The vertical scale is time in milliseconds divided by nlogn. efficiency, while for large data sets burstsort is clearly superior, as the rate of cache miss grows relatively slowly across data set sizes. For the URL data, the difference hi cache efficiency is remarkable. For all set sizes, burstsort has less than a quarter of the cache misses of the next best method. We then investigated each of the sorting methods in more detail. For quicksort, the instruction count is high, for example 984 instructions per key on SET 6 of duplicates', the next highest count was 362, for multikey quicksort. Similar results were observed on all data sets. As with most of the methods, there is a steady logarithmic increase hi the number of cache misses per key. For multikey quicksort, the number of instructions per key is always above the radixsorts, by about 100 instructions. Although relatively cache efficient in many cases, it deteriorates the most rapidly with increase in data set size. For smaller string sets, MBM radixsort, is efficient, but once the set of pointers to the strings is too large to be cached, the number of cache misses rises rapidly. MSD radixsort is very efficient in terms of the number of instructions per key, next only to adaptive radixsort, and for no duplicates the number of cache misses rises relatively slowly compared to the other radixsorts, again next only to adaptive radixsort. Adaptive radixsort is the most efficient of the previous methods in terms of
the number of instructions per key hi all collections except random. The rate of cache miss rises slowly. Thus, while MBM radixsort is more efficient in many of our experiments, adaptive radixsort appears asymptotically superior. In contrast, forward 8 and 16 are the least efficient of the previous radixsorts, in cache misses, instruction counts, and memory usage. Burstsort is by far the most efficient in all large data sets, primarily because it uses the cpu-cache effectively—indeed, it uses 25% more instructions than adaptive radixsort. For all collections other than URL, the number of cache misses per key only rises from 1 to 3; for URL it rises from 3 to 4. No other algorithm comes close to this. For small sets, where most of the data fits in the cache, the effect of cache misses is small, but as the data size grows they become crucial. It is here that the strategy of only referencing each string once is so valuable. Recent work on cache-conscious sorting algorithms for numbers [12, 16, 17, 20] has shown that, for other kinds of data, taking cache properties into account can be used to accelerate sorting. However, these sorting methods are based on elementary forms of radixsort, which do not embody the kinds of enhancements used hi MBM radixsort and adaptive radixsort. The improvements cannot readily be adapted to variable-sized strings.
103
Figure 6: Random. The vertical scale is time in milliseconds divided by nlogn. 6
Conclusions
We have proposed a new algorithm, burstsort, for fast sorting of strings in large data collections. It is based on the burst trie, an efficient data structure for managing sets of strings in sort order. To evaluate the performance of burstsort, we have compared it to a range of string sorting algorithms. Our experiments show that it is about as fast as the best algorithms on smaller sets of keys—of 100,000 strings—and is the fastest by almost a factor of two on larger sets, and shows much better asymptotic behaviour. The main factor that makes burstsort more efficient than the alternatives is the low rate of cache miss on large data sets. The trend of current hardware is for processors to get faster and memories to get larger, but without substantially speeding up, so that the number of cycles required to access memory continues to grow. In this environment, algorithms that make good use of cache are increasingly more efficient than the alternatives. Indeed, our experiments were run on machines that by the standards of mid-2002 had a slow processor and fast memory; on more typical machines burstsort should show even better relative performance. Our work shows that burstsort is clearly the most efficient way to sort a large set of strings.
References [1] A. Andersson and S. Nilsson. A new efficient radix sort. In IEEE Symp. on the Foundations of Computer Science, pages 714-721, Santa Fe, New Mexico, 1994. [2] A. Andersson and S. Nilsson. Implementing radixsort. ACM Jour, of Experimental Algorithmics, 3(7), 1998. [3] J. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. Annual ACMSIAM Symp. on Discrete Algorithms, pages 360-369, New Orleans, Louisiana, 1997. ACM/SIAM. [4] J. Bentley and R. Sedgewick. Sorting strings with three-way radix quicksort. Dr. Dobbs Journal, 1998. [5] J. L. Bentley and M. D. Mcllroy. Engineering a sort function. Software—Practice and Experience, 23(11):1249-1265, 1993. [6] D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271-289, 1995. [7] D. Hawking, N. Craswell, P. Thistlewaite, and D. Harman. Results and challenges in web search evaluation. In Proc. World- Wide Web Conference, 1999. [8] S. Heinz and J. Zobel. Practical data structures for managing small sets of strings. In M. Oudshoorn, editor, Proc. Australasian Computer Science Conf., pages 75-84, Melbourne, Australia, January 2002. [9] S. Heinz, J. Zobel, and H. E. Williams. Burst tries: A fast, efficient data structure for string keys. ACM
104
Figure 7: Cache misses, 1 Mb cache, 8-way associativity, 32 bytes block size. Upper: no duplicates. Lower: URLs.
[10] [11]
[12]
[13] [14]
Transactions on Information Systems, 20(2):192-223, 2002. C. A. R. Hoare. Quicksort. Computer Jour., 5(1):1015, 1962. D. S. Johnson. A theoretician's guide to the experimental analysis of algorithms. In Goldwasser, Johnson, and McGeoch, editors, Proceedings of the Fifth and Sixth DIM ACS Implementation Challenges. American Mathematical Society, 2002. A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. In Proc. Annual ACMSIAM Symp. on Discrete Algorithms, pages 370-379. ACM Press, 1997. P. M. Mcllroy, K. Bostic, and M. D. Mcllroy. Engineering radix sort. Computing Systems, 6(l):5-27, 1993. A. Moffat, G. Eddy, and O. Petersson. Splaysort: Fast, versatile, practical. Software—Practice and Ex-
[15] [16]
[17]
[18] [19] [20]
105
perience, 26(7):781-797, 1996. S. Nilsson. Radix Sorting & Searching. PhD thesis, Department of Computer Science, Lund, Sweden, 19%. N. Rahman and R. Raman. Adapting radix sort to the memory hierarchy. ACM Jour, of Experimental Algorithmic. To appear. N. Rahman and R. Raman. Analysing cache effects in distribution sorting. ACM Jour, of Experimental Algorithmic, 5:14, 2000. R. Sedgewick. Algorithms in C, third edition. AddisonWesley Longman, Reading, Massachusetts, 1998. J. Seward. Cacheprof—cache profiler, December 1999. http://www.cacheprof.org. L. Xiao, X. Zhang, and S. A. Kubricht. Improving memory performance of sorting algorithms. ACM Jour. of Experimental Algorithmics, 5:3, 2000.
Train Routing Algorithms: Concepts, Design Choices, and Practical Considerations* Luzi Andereggt
Stephan Eidenbenz* *
David Scot Taylor^ *
Martin Gantenbein*
Birgitta Weber*
Abstract Routing cars of trains for a given schedule is a problem that has been studied for a long time. The minimum number of cars to run a schedule can be found efficiently by means of flow algorithms. Realistic objectives are more complex, with many cost components such as shunting or coupling/decoupling of trains, and also with a variety of constraints such as requirements for regular maintenance. These versions of the problem tend to be AT-hard, and thus the standard powerful tools (e.g. branch-and-bound, branch-and-cut, Lagrangian relaxation, gradient descent) have been used. These methods may guarantee good solutions, but not quick runtimes. In practice, two major railway companies, German Railway and Swiss Federal Railway, do not use either approach. Instead, then: schedulers create solutions manually, modifying solutions from the previous year. In this paper, we advocate an intermediate, semiautomatic approach. In reality, not all constraints or objectives can be easily formulated mathematically. To allow for interactive human modification of solutions, the system must work rapidly, allowing a user to save desired subsolutions, and modify (or just start over on) the rest. After careful examination to find which constraints and costs we can easily integrate into a flow model approach, we heuristically modify network flow based solutions to account for remaining constraints. We present experimental results from this approach on real data from the German Railway and Swiss Federal Railway. Tork partially supported by the Swiss Federal Office for Education and Science under the Human Potential Programme of the European Union under contact no. HPRN-CT-1999-00104 (AMORE) t Institute of Theoretical Computer Science, ETH Zentrum, Zurich, Switzerland {lastname}@inf.ethz.ch *Los Alamos National Laboratory, [email protected] LA-UR:02-7640 § Department of Computer Science, San Jose State University, [email protected] 'Based on work performed while at ETH Zurich
Christoph Stammt
Peter Widmayer^
1 Introduction Train companies face many algorithmically challenging questions. One of the most important problems in terms of economic importance is to minimize the number of trains needed to fulfill a given schedule of train routes. This problem, known as the fleet size, rolling stock rostering, or vehicle routing problem, was modeled as a minimum cost circulation problem long ago by Dantzig and Fulkerson [3], for a survey see Desrosiers et al [4]. In practice, the problem is still solved by hand, because this flow model ignores many real-world constraints and objectives such as shunting, coupling/decoupling, and maintenance constraints. In this paper, we report on realistic rolling stock rostering, as it occurs for large train operators such as German Railway (DB) and Swiss Federal Railway (SBB). We consider the standard minimum cost flow model, but we more carefully address what different objectives and constraints it can represent. We focus on finding efficient solutions, because in practical situations, all constraints are not explicitly formulated, and all costs are not well defined. Realistic problem instances are so complex that it would be useful to have a system which allows people to interact with proposed solutions, until a satisfactory solution is obtained. For instance, the train companies prefer "robust" solutions, which avoid propagating delays. Unfortunately, they cannot characterize the rules for robustness. With manual solutions, this robustness comes from a combination of intuition by the planners, and from the fact that tested, robust portions of previous solutions are reused. Following a brief introduction to the problem (Sections 1.1, 1.2) is a description of the standard and the network flow approach (Sections 1.3, 1.4, 1.5). Next, Section 2 contains more extensive discussion of real-life problem variants. We consider additional requirements which can be modeled by flow in Section 3: these include more complex costs, different train types, and various station constraints. In Section 4, we address maintenance, which does not seem to easily fit into the standard flow model, and thus requires additional effort. We describe our heuristic modifications to the standard 106
flow model to satisfy maintenance constraints. In Section 5, we describe our experiments and results using data provided by the German Railway (DB) and the Swiss Federal Railway (SBB). 1.1 Terminology We begin with some basic definitions. For our purposes, the smallest interesting entity of a train is a train unit one or more railcars or locomotives which are linked together. Train units are routed and maintained as an entity, never to be broken into smaller units in the problems considered here. Depending on different lands of railcars and locomotives and on the formation of the single cars, there can be different train unit types. A train consists of a number of train units coupled together. The length of a train is the number of train units which form the train. (The physical length of the train, a distance measurement, is also implicitly used hi some of the train scheduling constraints.) Some of the train units can move hi only one direction, whereas others can move in both directions. These bidirectional units are called push-pull trains. A route is a path between two given end stations. Each route has a distance and a travel time associated with it. For our purposes we do not need to consider details such as intermediate stations, number of tracks between the two stations etc. A ride is a train unit on a certain route with distinct departure and arrival time. We distinguish between two types of rides: scheduled rides correspond to all the entries from a timetable, unscheduled rides are additional rides needed to operate the rail network. Unscheduled rides are further divided into three types: empty, piggy-back and maintenance rides. Empty and piggy-back rides are used to transfer train units without passengers from one station to another. Piggy-back rides refer to train units coupled to scheduled rides, while empty rides (also known as deadheading rides) refer to units riding alone to new locations. Maintenance is performed at maintenance stations, which may or may not be the end station of a route. Maintenance rides are used to shuttle the trains to and from maintenance stations, and can also represent maintenance itself. Usually piggy-back rides are cheaper than empty rides because of the costs saved for the crew and track reservation. A rotation is the order of rides performed by one train unit or a number of train units coupled together. A rotation has a duration and length. If a rotation takes one week, the train unit serving this rotation repeats the same rides every week. The length of a rotation is the total distance a train unit travels during a single cycle through the rotation. A circulation is the set of rotations for all train units.
1.2 Problem description The goal of rolling stock rostering is to determine a circulation for given scheduled rides. The input to the simplest version of this problem (rolling stock rostering with one train type and without maintenance) includes the scheduled rides with passenger demand, just one train unit type, and some cost functions for using specific train units of that type. The output assigns tasks to specific train units. This also defines how many train units are used for each ride, which empty and piggy-back rides are used to move train units between stations, and which cars will perform each of these tasks. A general goal is to minimize the cost of the assignment. Of course, realistic costs are determined by a long list of factors, including the total number of train units needed, how much mileage they cover, crews used, number of couplings and decouplings, etc. The number of train units used is clearly one of the most important costs, and has been the primary focus of much previous work. Besides these basic requirements, there are numerous additional requirements and constraints that may differ for each train company. We present some of these requirements for the specific cases of the German Railway (DB) and Swiss Federal Railway (SBB). Depending on the constraints considered, there are many versions of the problem, and these have rarely been treated hi the literature. One of the most important constraints has to do with train maintenance: trains need regular inspections, maintenance work, and cleaning, all of which must be done at special facilities. Any solution that ignores maintenance requirements is useless to train companies. 1.3 Flow model Even the simplest version of the rolling stock rostering problem is usually broken into two separate phases: first, one solves the train length problem, in which the lengths of the trains on each route (including empty and piggy-back rides) are determined. The main goal here is to ensure that there are always enough train units available at each station for all scheduled rides departing from that station. In this phase rides are only assigned to train unit types, but not to specific train units. In the second phase, the train assignment problem is solved: specific train units are assigned to the scheduled rides. For each train unit a rotation is determined. A standard (and longtime) approach (see [2,11]) for modeling the train length problem phase is to use a flow model: a periodic directed graph is used to model the scheduled rides. Each vertex hi the graph represents one station at a specific tune, that is, an (s,t) pair with station s at tune t. A ride is represented by an edge, which leaves a vertex representing its departure
107
Figure 1: Example periodic graph. Departure Station time A 7:00 C 8:00 B 12:00 B 15:00 C 17:00 A 22:00
Arrival Station time B 10:00 B 10:00 14:00 C A 18:00 B 19:00 C 3:00
Demand of train units 2 3 2 2 3 1
Table 1: Example timetable. (station and time), and enters a vertex representing its arrival (station and time). Each edge has a lower and upper bound for the required flow where the number of flow units corresponds to the number of train units. The lower bound corresponds to the passenger demand for the associated ride, while the upper bound corresponds to track/station limitations and thus a maximum number of allowed train units. Moreover, edges are added to the graph to represent trains waiting in a station after their arrival and potential empty rides. Example. A small one day schedule for a single train unit type is given in Table 1. We have three stations A, B, and C and seven scheduled rides. One row represents one ride, with departure, arrival station and times, and certain demands of train units. The corresponding graph is given in Figure 1. The number of train units that can wait within a station is five for station A, six for station B, and four train units can wait in station C. (To simplify the diagram, we have not put capacities on the overnight edges.) Generally, this graph model is used to solve the train length problem, with the flow on each edge rep-
resenting the number of train units. The periodicity of our graph model implies that some solutions can be repeated each period: for these solutions, each period will begin with the same number of train units in a station as the period before. Train assignment is done by breaking the train length solution into cycles, and assigning physical train units to each cycle. A cycle corresponds to a rotation for one specific train unit. These rotations may overlap each other, or contain an edge more than once when multiple train units are required for a route. For these "long train" edges, it is convenient to think of multiple edges, each of flow one, to distinguish among them (i.e. the first and second units of a train can be considered as separate rides, although they travel together). As a result of this approach we get a one period solution. A one period solution does not imply that each of the rotations needs one period to complete, they can take much longer (they do need at least one period). Instead, it means that once any train unit consecutively performs two rides, then in every period those two rides will both be consecutively serviced by one train unit. If a rotation needs x periods to complete, it will use x different tram units to serve the rides of its cycle within each period. Example. Figure 2 shows a possible solution for the small example from Figure 1. The circulation consist of two rotations with three train units per rotation. Another representation of this circulation is pictured in Figure 3. Every rotation of one train unit is represented by one horizontal line. This representation is often used by train companies.
108
depending on what types of empty ride edges have been used in the graph, past work has sometimes overlooked optimal solutions, while at other tunes it has allowed for solutions which break station capacity constraints. (Unless intermediate route stations are given, it is impossible to check station capacity constraints anyway.) Here, we do not claim to find optimal solutions: with realistic problems, such solutions seem well beyond our current abilities. Finally, the train length and assignment problems should both be solved at the same tune. Because the train length problem does not assign tasks to specific train units, many specific constraints (e.g. coupling times, maintenance) cannot be considered. Thus, a solution to the train length problem may not have any feasible train assignment solution, if all constraints are followed. For some constraints (such as coupling), we can the train length problem instances by adding coupling times to the routes, to ensure that coupling will not affect the feasibility of a train assignment. Unfortunately, this padding time may sometimes eliminate the optimal problem solution. Starting with Section 2, we describe some of these real world constraints and our approach which finds feasible solutions on realistic data.
Figure 3: Interval representation of circulation. 1.4 Model limitations The flow model has several strong limitations, even for solving simple instances with no maintenance constraints. In theory it is possible for one period solutions to cost more than solutions that are allowed to span more than one period (see also [6, 7] for further description of these problems and the ones below). DB and SBB prefer one period solutions for their conceptual simplicity, so will only consider one period solutions as feasible. This simplifies our approach considerably: it allows us to ignore unexpected (and often overlooked) complexities inherent hi other solutions. In the past, purely theoretical approaches have overlooked other model limitations as well. For instance,
1.5 Previous work The flow approach to this problem is intuitive, and mentioned as early as 1954 [3]; it is standard enough to be included in textbooks [1] and surveys [4], yet it is also still being studied hi more recent articles [2,11], As for more recent work it is shown hi [6] that the rolling stock rostering problem is J\fPhard to approximate arbitrarily closely; this result even holds for problems with highly simplified maintenance constraints, with all train lengths set to 1, and the costs equal to the number of trains needed. In [7], a variant of a proof from [9] shows that the train length problem is also A/'P-hard, if arbitrary fixed costs are allowed. In this case (the train length problem), maintenance is not even a consideration, but instead more complex costs are allowed. In general, every realistic problem variant is difficult, but here we concentrate more on getting good solutions to real problem instances. Periodicity of schedules has been studied for the simple case of minimizing the number of trains needed to implement the given schedule. Polynomial tune solutions for the periodic case are known that follow very different approaches. Orlin [10] uses a periodic version of Dilworth's Theorem [5], whereas Gertsbakh and Gurevich [8] use the concept of a "deficit function". Later, Serafini and Ukovich [12] propose a very general framework for periodic scheduling problems which also implies this result.
109
2 Real-life issues This section further describes some of the real world problems which arise in rostering. As in Section 1.2, the primary data for all problems is a set of scheduled tram routes, and the output must specify a set of trams and how they will be used to fulfill the input schedule. However, unless the basic model is extended to include additional costs and constraints, the solutions produced will be of little value to railway companies like DB and SBB. Clearly, some of these changes require additional input: specific knowledge for each route (such as length, traveling time, whether or not the track will accommodate electric locomotives) is important. For every type of train unit, maintenance requirements and infrastructures, cost per kilometer to run the train unit, cost to maintain a train unit, crew costs, and coupling/decoupling costs must also be considered. In the following subsections, we describe some of the most important problem variants, constraints, and costs which arose in discussions with DB and SBB. 2.1 Requirements Here, we list two types of requirements which the standard model currently ignores. In Section 3 we show how to account for these additional requirements. • Non-identical train unit types: Each scheduled ride must be assigned to one train unit type out of a predefined set of possibilities. Typically, each ride can be performed by three or four alternative train unit types, each of which may have a suitability ranking. Any solution must assign a suitable train unit type to each scheduled ride. It is, however, possible to append train units' as piggy-back units to a scheduled ride, even if the piggy-back units are of not suitable types for the scheduled ride, as long as there are enough suitable units in the scheduled ride. Train unit types model the different parts of a train: if, for example, a certain train route goes from station A to B starting at tune t, and the train consists of a locomotive, three first-class cars, six second-class cars, and a dining car, we would model this as four different rides. Each has the same station endpoints and tunes, but one ride must be satisfied by a locomotive train unit, while another requires first class cars, etc. In our graph model, this will result in a multi-edge with multiplicity four. • Minimum station turn around times: If a train reaches its arrival station, it might have to be decomposed into its train units by decoupling and
shunting; the train units are subsequently combined into new trains to carry out their next rides. The time needed to complete this procedure depends on the train unit types involved, on whether only coupling or decoupling or both are needed, and on the station topology. The time needed for this procedure is called station turn around tune. These tunes are given as part of the input for each possible combination of coupling mode, train unit type, and station. Therefore, a solution of the train length problem may not have any feasible train assignment solution, since the turn around time at some station is violated. Any feasible solution must fulfill these timing requirements. 2.2 Maintenance requirements A feasible rostering solution must satisfy all maintenance requirements. The input information concerning maintenance consists of three parts: • Maintenance types: Each train unit must have certain types of maintenance. The most common types are: refuel for diesel locomotives (T), interior cleaning (I+E), exterior cleaning (AHA), scheduled repairs (INST), and technical check-ups (V+A). If a train has to wait at a station for a longer period of time, in certain stations it is possible to park the train on a separate track. This is called siding (ABST), and will also be treated as maintenance. • Maintenance interval: Each maintenance type must be performed on a train unit at certain intervals. These intervals can depend on elapsed time (since last maintenance), distance driven (since last maintenance), or on both. For example, interior cleaning should be done every 24 hours, exterior cleaning should be done before the train has run for 1,000 kilometers, and a technical check-up needs to be performed, whenever one week has passed or 10,000 kilometers have been driven (whichever occurs first). This last requirement is both time- and distance-dependent. Of course, it is always possible to perform maintenance before the interval is reached, but this increases costs. • Maintenance stations: Maintenance stations are scattered rather scarcely all over the network. For each maintenance station we are given information of which types of maintenance can be performed for each tram unit type, and hours of operation for the station. Moreover, capacity constraints are also given as part of the input and it is specified how long each maintenance type takes. 110
2.3 Costs Typically, railway companies have many "hazy" objectives concerning the rotations. For example, the train unit movement per period (measured in moving tune and distance) should be more or less equal for all train units. This objective ensures an almost equal aging for all train units, but it does not imply that all rotations should be of the same period length. Another possible objective is the minimization of the number of different stations during one period or in one rotation. This may increase the chance that a specific train unit will move along the same line during one period and thus increase the regularity of a rotation. More precisely, the overall objective of our problem is to minimize costs. Although many properties such as the ones above are desired, their costs are not well defined by the railway companies. Here, we consider costs that are easier to define and thus can be given as part of the input. We distinguish between three cost types (this list is not exhaustive):
for each train unit in a train or for a train as a whole. For example, cost per time must be paid for a train as a whole to reserve rails for the train's passage. 3. Maintenance costs: Whenever maintenance is carried out on a train unit, costs are incurred. These costs depend on the maintenance type and on the maintenance station that performs the work.
3 Flow model modifications Because of the numerous additional requirements that a real-life problem poses as explained hi the previous section, our straightforward two-phase approach of first solving the train length problem by building a graph and finding a minimum cost flow on it and then solving the train assignment problem by extracting rotations from the minimum cost flow may no longer be a viable approach. In this section, we consider some of the 1. Fixed costs: Fixed costs occur for each train unit additional requirements which can be treated by this that is used at some stage hi the rostering. Fixed approach by making modifications to the graph built costs include all costs that are incurred by simply for the train length problem. owning the train unit without using it at all; these costs include several items, but depreciation is the 3.1 Multiple train unit types We integrate the most important one. Traditionally, fixed costs are concept of having several train unit types by first deterconsidered to be the most crucial. The quality of mining an order hi which the different train unit types a train rostering solution is very often measured will be processed. We schedule the non-locomotive train simply by the number of train units used. Train unit types before mixed and locomotive train units; units are very expensive: aside from the capital within these categories, inexpensive train unit types are investment to buy the equipment, even unused scheduled first. So, given the train unit ordering, we simply iterate trains require thousands of dollars per year to the standard flow model on the cheapest unit first, for maintain. all routes where that car type can be used. Once we 2. Costs for each ride: Each ride, whether scheduled, have found a solution for a certain train unit we then empty or piggy-back, incurs a cost that is composed iterate on the remaining routes. of different cost factors. The most important cost Many complications are hidden here: for instance, factors are the following: if the available number of a certain type of train unit is • Cost of energy: Each train unit type has a limited, perhaps there are not enough to cover all routes certain cost associated to it for each kilometer for which that train unit is useful. This will leave more that it runs and for each ton that is trans- routes for the next iteration, where a "better" unit may ported on this kilometer. Cost examples in- be used to satisfy the demand. Also, the number of nonclude power, fuel usage, or wear and tear of locomotive train units on a route may determine the type and number of locomotives needed to pull those rails. • Cost per time: Each train unit type has a cars. Finally, in order to offer inexpensive piggy-back certain cost associated to it for each hour opportunities for a train unit type, edges from previous that it runs. Cost examples include heating, iterations, with appropriate weights, will be included. lighting. 3.2 Minimum station turn around times Station These cost factors are different for each type of turn around times can be integrated into the flow model ride, i.e., different for scheduled, empty, or piggy- as follows: after having obtained a valid rostering from back rides, with the empty rides being cheaper our approach, we check in the solution to see whenever than scheduled, but more expensive than piggy- minimum station turn around tunes have been violated. back rides. Moreover, these costs can occur either If they are violated, we artificially delay all incoming 111
edges by the amount of time needed to carry out a coupling operation and thus create a new vertex; this has the effect that the train unit on this incoming edge then has a higher probability of linking to another ride without violating minimum station turn around times. This yields a new graph, and we iterate this procedure until no violations occur. The number of iterations is bounded in our instances because the minimum station turn around tune is always less than four couplings. Thus, each vertex in the underlying graph will be delayed at most four times. In our experiments, to save calculation tune, we added the maximum needed delay to vertices with violations after the first iteration. We do not simply add the delay to all stations, as at some stations, the delay is quite high (~1 hour), and to do so would require extra train units in the rostering. An other important point is that station turn around times vary not only by station but also by type (coupling or decoupling). 3.3 Additional costs We can model fixed costs of owning each train unit by simply making one unit of flow on the overnight-edges cost as much as the fixed costs for one train unit. Similarly, we can compute the costs for each each ride by computing and combining the costs per distance, tune and ton kilometer for each train unit type. However, the costs that are incurred only for a train as whole (e.g. coupling/decoupling costs) and not for each train unit type are harder to model exactly. As an approximation, we allocate these train costs to a specific unit of the train, if possible the locomotive. This can lead to extra costs hi our calculations if two or more locomotives pull a train, but these should be small compared to others. 4 Maintenance Compliance with maintenance requirements is crucial to making routing solutions feasible: while ignoring certain real-world costs may lead to suboptimal solutions, solutions which ignore maintenance are invalid altogether. Maintenance is ignored by the standard minimum cost flow algorithm, yet it is of topmost concern to the railway companies, greatly affecting the cost and practicality of scheduling trains. Lack of maintenance in solutions produced by automated systems may be the primary reason that railway companies still schedule trains manually. Maintenance is performed at special maintenancecapable stations, with limited hours of operation and bounded work capacity. Even hi practice, maintenance is not always scheduled in advance: it is not practical to expect future plans to be followed precisely indefinitely.
Unexpected changes such as extra trains scheduled for special events, or train break-downs, happen frequently enough that trains occasionally get shuffled to cover for each other. It is more pragmatic to only schedule more frequent maintenance hi advance. Maintenance frequency may not match that of a rotation very well: it may be that a natural schedule for a train covers the same routes each week, for 5,000 total km, yet the train might need to be maintained once every 10,000 km. Although it seems clear that the train should just be maintained once every 2 weeks, this may disrupt the scheduled 1 week rotation. This adds to the complexity of scheduling maintenance. When considering maintenance, the straightforward two-phase approach seems to break down. While many of the additional requirements can be accounted for by making modifications to the underlying graph in the network flow phase (as discussed in Section 3), the crucial maintenance requirements seem evasive to integration into the two-phase approach. We thus introduce a third phase into our approach that we call maintenance insertion. There are several approaches possible as to how to integrate maintenance hi this third phase: some iterate over all three phases of our solution approach, others simply add a single phase after the two other phases have been completed. We propose three different approaches : 1. If the maintenance requirements are simple to perform (e.g. cleaning) — they are available in most stations and do not take long — it is possible to schedule them locally within a station. This approach and its weaknesses are presented in Section 4.1. 2. Train companies do not normally consider maintenance to be part of the rotation itself. Instead, they keep an over-stock of approximately 10% of trains, and try to keep these units maintained, and available within maintenance stations. They are swapped into a rotation as an unmaintained train is swapped out of it. This gives them flexibility to not fully schedule all maintenance far in advance. This approach is described hi Section 4.2. 3. The last approach combines the first two approaches and is described in Section 4.3. Approaches 1 and 2 work iteratively. It is necessary to recalculate the minimum cost flow circulation and extract new cycles after fixing certain maintenance constraints. The complexity of the problem instance will determine which approach is most useful.
112
capacity to carry out the required maintenance and should be reachable from event Vj_i without violating the maintenance requirement. Given such a station, we introduce a new event v^ at this station and connect v<_i to it by an empty-ride edge (thus unplicitly determining its tune); we then introduce a new event t>i'_! at the maintenance station at a later point in time such that an edge from v^ to v"_± allows enough time to carry out the necessary maintenance. Finally, we connect event v"_i to the next possible event on the original rotation by a third edge. If we cannot find a legally reachable maintenance station from event Vi-i, we consider v»_2 (or Vi-k as needed). To encourage the mhiimum cost flow algorithm to use this maintenance detour through events v\_^ and v"_i into the solution hi the next iteration, all three 1. Solve the train length problem using network flow. edges have cost zero, and the minimum flow requirement v 2. Compute a train assignment to the train length for edge (Vi_i> "-i) is set to one. problem. Example. An example of this approach is illustrated in Figure 4. In Station B at 12:00 the maintenance require3. Test the solution for maintenance requirement violations.
4.1 Locally added maintenance If the underlying network structure is rather simple, i.e., if it has abundant, well dispersed maintenance stations, we can hope for the best: the rotations obtained from the standard two-phase approach may already fulfill all or most maintenance requirements as the train units sometimes remain at maintenance stations for long enough time periods to perform required maintenance. In this case, a promising approach for integrating maintenance is to add "maintenance edges" (described below) into our our network flow graph. Using this approach, we repeat the two-phase solution on the new modified underlying graph until all needed maintenance is performed. This "local fix" approach iterates the following basic steps:
4. Introduce new maintenance edges into the graph. Once the solution no longer violates any maintenance requirements, the algorithm stops. All steps except for the introduction of maintenance edges work exactly as hi the basic approach described earlier. We introduce new maintenance edges into the graph as follows. Each rotation R consists of events vi,... ,V\R\. Each event Vi is a pair (s,£), where s is a station and t is a time. Considering the rotation from vi, we determine at each event whether or not any maintenance requirements of a particular train unit have been violated. If there is a violation at event v^ we consider if any of the following hold for Vi-k and k: 1. Maintenance performed at the station of event Vi-k will prevent any maintenance violations at event v». 2. The station of event Vi-k is capable of performing the maintenance required by event v». Besides being an appropriate maintenance station, it must also have enough capacity at the given tune.
Figure 4: Locally added maintenance.
3. The train spends enough at the station of event ments are not fulfilled. New edges to a maintenance station are introduced to the graph and an additional Vi-k to perform the required maintenance. edge within a maintenance station has capacity one. AfIf all these points can be answered affirmatively, we ter the maintenance is done, edges to every station — schedule maintenance at event «>»_*. We can use the representing possible empty rides — are introduced as first point above to decide whether it makes sense to well. check back further (that is, check higher k values) This local fix approach seems to work quite well hi the rotation. If it no longer makes sense to go back in the rotation, we find the maintenance station for problem instances in which maintenance really is closest to event v»_i. This station has to have the not the central issue, with simple to follow maintenance 113
constraints. In these cases, minor detours will allow 4.3 Iteratively fix rotations with maintenance all needed maintenance. Unfortunately, maintenance Our last approach to maintenance combines some aspects of both preceding strategies. As hi Section 4.1, requirements are rarely so simple. we try to fix violations by adding maintenance into cy4.2 Extra trains for maintenance In this ap- cles for free when possible, and by routing trains to proach, we try to model a strategy of the railway com- maintenance stations when it is not. Once maintenance panies: always try to keep a fully serviced spare train is completed on the train, we move this train back into unit (of any type needed) available at each maintenance the same cycle, allowing the cycle to continue later on station. These trains will be swapped into rotations as with a freshly maintained train as in Section 4.2. This needed to eliminate maintenance violations. The first altered cycle will not service every route on the origiand foremost goal is to use as few spare trains as possi- nal cycle, but the routes it does cover will be serviced ble. This approach is non-iterative and consists of solv- with maintained trains. These routes are removed from ing the train length and assignment problems without the schedule, and solutions for the remaining routes, exmaintenance, and then adding extra trains into the as- cluded from the altered cycles, will be iteratively found by starting over. signment to alleviate maintenance requirements. To add these extra trains, we proceed as follows: For a selected rotation R we order all vertices hi the graph Example. An example of this is shown in Figure 5. according to then: time, which results in an ordered In Station B a maintenance edge is added after the list R of vertices vi,... ,V|«|. We then process the resulting list R vertex by vertex, where we check at vertex v,, whether the maintenance requirements of the train units going through vertex Vi have reached some critical threshold with respect to maintenance types. If this is not the case, we proceed to the next vertex; if it is the case, we replace the current train unit by a replacement unit, which was swapped out of a rotation when it needed service sometime in the past. The train unit still requiring service is moved to a maintenance station and serviced. After that, it will be routed to some other station, where it becomes the replacement for another unit requiring service. If no replacement unit can be routed to a rotation when needed, an additional "service train" is added to the system, increasing the total number of extra units used. After processing all vertices all vertices from list R, Figure 5: Fixing maintenance constraints. we link each train at the end of the period to a vertex Vj at the beginning of the period such that the resulting maintenance parameters will satisfy the maintenance maintenance parameter got beyond a critical threshold. requirements of vertex Vj. This will result in feasible The rotation with maintenance meets the old rotation again in Station B at a later point in time. The rotations that satisfy all maintenance requirements. In this approach, we keep the original cycles intact, remaining edges will be covered in one of the next which mimics a strategy of DB and SBB. It can be iterations. In practice the number of iterations should thought of as solving the rostering problem with virtual be small. trains, needing no maintenance. The number of extra Of the three approaches, the final one worked best trains needed to maintain this illusion increases with overall for our test data. This approach evolved from the the difficulty of the maintenance constraints. While first two: during implementation it became clear that conceptually simple, examination of implementation the first two approaches would not work well on complex details shows hidden complexities, especially concerning real-world data. Therefore, only the third approach was how to best link the start and end of the rotations. implemented and used for all experiments. Simply linking the first and last stations in a cycle would not allow to "start" the cycle with a maintained train, 5 Experiments and some care must be taken to keep feasible solutions Here we discuss our test data and experiments. The while not requiring an extra maintenance. major goal of these experiments was to carefully revise
114
rides must be conducted with train units of type Baureihe 411.
our model, test its practicality, and to ensure that important details were not being glazed over as they had been in previous, purely theoretical, treatments. The overall goal hi each experiment is the same: the minimization of the total costs. However, some of the test data given to us by the railway companies was chosen to test the minimization of train units, while other test sets were designed to test whether or not our maintenance strategies give feasible solutions. Although the overall goal is the muumization of the total costs, our results summarized in Table 2 do not show these cost values. The railway companies have not disclosed their precise current solutions or their costs. Without the real values a comparison is not possible. We do compare the number of train units, which is the most important cost factor.
• BR218: Our last DB test consists of all locomotives of the train unit type Baureihe 218 and their associated passenger cars hi the region SchleswigHolstem. The given schedule contains 7666 routes of up to 32 different train unit types. Some of these 32 train unit types are only used as possible substitutions in case of car capacity problems. Here, we must schedule both train cars and locomotives, adding to the complexity of the problem. Additionally, the number of some train unit types are limited, and we must find appropriate substitutions in case of car capacity problems. • BR1210: Our last test set consists of all selflocomotive train compositions of the Zurich SBahn. A train unit hi this test is a composition of several cars with locomotive. S-Bahn trains are either one or two coupled push-pull train units. The given subset of the SBB schedule contains 6151 routes. This test set is the only one which supplies information about the type of turn within a station. Because all train units are composed the same way, the additional information allows the computation of the first class car positions within each station. Besides minimizing the number of train units used, the position of the first class cars within the station should be the same from day to day on any route. Maintenance information was not supplied.
5.1 Test sets We used five different test sets in our experiments, four of them were supplied to us by DB and one by SBB. To simplify further discussions on these test sets we enumerate them: • BR112: This set contains only locomotives of the same train unit type, called Baureihe 112. A train unit in this test set is a single locomotive. The given schedule is a subset of the DB timetable with 2098 routes (each specified only by its end stations). While this test set is quite large, the maintenance requirements are not overwhelming, and the main objective is to minimize the number of train units. • BR612: This test set only contains diesel railcars (self-locomotive cars) of the same train unit type Baureihe 612. Thus, the train unit consists of just one railcar. The given schedule is a subset of the regional timetable Saarland-Westpfalz hi Germany with 332 routes. The difficulty here lies in the frequent refueling requirements of these locomotives, as there are only a small number of fueling stations, and they have limited capacity.
5.2 Results The most interesting results of these test scenarios are the rolling stock rosters, but they are much too large to present here. We show a cutout of one of the rosters, and some values to summarize how efficient our results were. Our software is built of 20,000 lines of C++ code. Besides standard libraries we only used LEDA (Library of Efficient Data types and Algorithms)1. We have run our tests on a PC laptop with a 1.6 GHz Mobile Pentium 4 processor, 512 MByte of RAM under Windows XP. In Figure 6 you see a typical cutout of a graphical representation of a rolling stock roster. Several different depictions are meaningful; this format matches the one used by SBB. On the x-axis we see the tune between 4:00 of the third day hi period until 3:00 of the next day. Each line represents one train unit over one period and the bars on that line show the planned rides of the train unit. On the second line, for example, we see four
• BR411: This test set consists of two different Intercity Express (ICE) train units. The train unit type Baureihe 411 is a composition of seven cars including a locomotive. The train unit type Baureihe 415 is a similar but shorter composition. Two train units of the same type can be coupled together, but a substitution of a train unit of one type by the other is rarely possible. Both train types use the same maintenance infrastructures with capacity and tune limitations. Testing maintenance fea1 sibility is a main goal. The schedule is a subset of LEDA has been developed at the Max-Planck-Insitut fur Inthe international timetable between Germany and formatik, Saarbrucken (http://www.mpi-sb.mpg.de/LEDA/) and its neighbor countries with 496 rides. 330 of these is available at http://www.aigorithmic-solutions.com/. 115
Figure 6: Graphical representation of a rolling stock roster. consecutive productive rides starting and ending in SSH followed by an empty movement to SKL where the train unit is maintained. Several maintenance operations (T, ARA, I+E, V+A) are planned between 17:30 and 20:45. Finally, the train unit makes an empty movement to the next station where it is needed. Above the bars we see train identification numbers, labeling which train units belong to the same train. For example, the train 3305 starts at 6:00 in SSH and ends at 8:30 in FF. This train consists of four separate train units. The dashed horizontal lines separate train units of the same rotation. For example, the first three train units are hi the same rotation of length three. In Table 2 we summarize our results of the five test sets. As would be expected, ignoring maintenance constraints allows the automatic construction of a solution at least as good as the ones in use. Unfortunately, our results with maintenance are not as good. In the two test sets with the most maintenance constraints, BR612 and BR411, our results with maintenance suffer the most, 33% worse than the current solution of the railway companies. Especially in the case of BR612, this is not surprising: refueling is needed so frequently that it must be incorporated very efficiently into the schedule. Pulling a train out of rotation and traveling 50 km extra for each 10,000 km maintenance is very different than doing it for a 500 km tank of fuel.
These results indicate that our approach to maintenance is still not sophisticated enough. The extra train units for our system come from several sources. First, the rides removed from a rotation very often contain multi-edges. Each removed ride may then require an additional train unit. Next, the current approach does not check whether remaining routes can be combined with existing rotations. The integration of these two points into the existing Iteratively fix rotations with maintenance approach is part of future work. This approach to maintenance attempts to fully automate the rostering process. Our program would need additional modifications to be used in an interactive, semi-automated way. This would be a natural "first step" to take before trying to use it hi a fully automatic way to plan all rostering decisions, and our approach has been to make the program fast to make this interaction possible. Using the flow model, it is also simple to "freeze" parts of the solution and change others. 6 Conclusion We have presented an extended and therefore much more practical version of the standard fleet size problem. We have shown that many extensions of the rolling stock rostering problem can be well integrated into the standard flow model approach. For a constraint which we cannot easily integrate (maintenance), we
116
Test set
Graph: |V|, [E\
BR112 BR612 BR411
12521, 90050 1025, 165266 411: 3602, 1066810 415: 1124, 214768 Total 01: 3005, 11982 02: 3655, 14075 03: 6891, 40459 04: 304, 623 05: 6273, 22383 06: 3013, 12703 07: 7048, 30324 08: 574, 1351 C9: 8508, 62435 218: 22628, 368991 Total 36895, 319211
BR218
BR1210
Number of train units our solution our solution with without maintenance maintenance
in real solution
52 9 27 8 35 6 9 10 1 6 5 11 2 16 57 123 79
66 9 28 8 36
'
I
68
72 140 120
55 11 35 10 45 7 10 10 1 8 6 12 3 19 84 160
Runtime in seconds maintenance version
35 14 20 6 26 6 29 10 0 8 6 8 2 27 81 177 79
Table 2: Table of results. have developed some heuristics to modify the standard their support. flow approach. We have implemented our three phase approach and References tested it with real data from the German Railway and Swiss Federal Railway. Although our solutions are worse than the rostering solutions already used by those rail[1] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin. Network way companies, our experiments are encouraging. It flows - theory, algorithms, and applications. Prentice Hall, 1993. would be overly optimistic to assume that the first at[2] P. Brucker, J.L. Hurink, and T. Rolfes. Routing of tempt of a fully automated system would improve upon railway carriages: A case study. In Memorandum long tested and used solutions. Each year, many personNo. 1498, Fac. of Mathematical Sciences. University years of work are used to refine rostering solutions for of Twente, 1999. minor scheduling changes from the previous year. Over [3] G. Dantzig and D. Pulkerson. Minimizing the number the years, it may well be that minor scheduling changes of tankers to meet a fixed schedule. Naval Research have also been made to accommodate the rostering soLogistics Quarterly, 1:217-222, 1954. lutions. Although finding improved fully automated so[4] J. Desrosiers, Y. Dumas, M.M. Solomon, and lutions is an ultimate goal, a more immediate goal was F. Soumis. Time constrained routing and scheduling. to build a system which could be used interactively, by In Handbooks in OR & MS, volume 8, pages 35-139. scheduling personnel, to help in their jobs. Given an Elsevier, 1995. [5] R. Dilworth. A decomposition theorem for partially automated solution, the scheduling personnel can exordered sets. Annals of Mathematics, 51:161-166, tract partial solutions which they believe look promis1950. ing, and then run the system iteratively on the parts [6] T. Erlebach, M. Gantenbein, D. Hiirlimann, G. Neyer, which looked worse. Given our quick runtime, on modA. Pagourtzis, P. Penna, K. Schlude, K. Steinhofel, est equipment, this interactivity is quite feasible. D. Taylor, and P. Widmayer. On the complexity of train assignment problems. In ISAAC: International Symposium on Algorithms and Computation, LNCS. Springer-Verlag, 2001. [7] M. Gantenbein. The train length problem. Diploma Thesis, Department of Computer Science, ETH Zurich, 2001. [8] I. Gerthsbak and Y. Gurevich. Constructing an opti-
7 Acknowledgements We would like to thank German Railway and Swiss Federal Railway for sharing their rostering problems and data. Special thanks go to Frank Wagner, Martin Neuser, Jean-Claude Strueby, and Daniel Hiirlimann for
117
[9] [10] [11] [12]
mal fleet for a transportation schedule. Transportation Science, 11:20-36, 1977. S. Hochbaum and A. Segev. Analysis of a flow problem with fixed charges. Networks, 19:291-312, 1989. J.B. Orlin. Minimizing the number of vehicles to meet a fixed periodic schedule: an application of periodic posets. Operations Research, 30:760-776, 1982. A. Schrijver. Minimum circulation of railway stock. CWI Quarterly, 6(3):205-217, 1993. P. Serafini and W. Ukovich. A mathematical model for periodic scheduling problems. SIAM J. Discrete Mathematics, 2(4):550-581, 1989.
118
On the implementation of a swap-based local search procedure for the p-median problem Renato F. Werneck*
Mauricio G. C. Resende* Abstract We present a new implementation of a widely used swap-based local search procedure for the p-median problem. It produces the same output as the best implementation described in the literature and has the same worst-case complexity, but, through the use of extra memory, it can be significantly faster in practice: speedups of up to three orders of magnitude were observed.
1
Introduction
efficiency of our method is presented in Section 5. Final remarks are presented in Section 6. Notation and assumptions. Before proceeding with the algorithms themselves, let us establish some notation. As already mentioned, F is the set of potential facilities and U the set of users that must be served. The basic parameters of the problem are n = |Z7|, m = |F|, and p, the number of facilities to open. Although 1 < p < m by definition, we will ignore trivial cases and assume that 1 < p < m and that p < n (if p > n, we just open the facility that is closest to each user). We assume nothing about the relationship between n and m. In this paper, u denotes a generic user, and / a generic facility. The cost of serving u with / is d(w, /), the distance between them. A solution S is any subset of F with p elements (representing the open facilities). Each user u must be assigned to the closest facility / € 5, the one that minimizes d(u, /). This facility will be denoted by 0i(u); similarly, the second closest facility to u in S will be denoted by ^2(w). To simplify notation, we will abbreviate d(u,(f>i(u)) as di(w), and d(u,<j>2(u)) as ^(w). We often deal specifically with a facility that is a candidate for insertion; it will be referred to as /» (by definition /$ ^5); similarly, a candidate for removal will be denoted by fr (fr e S, also by definition). Throughout this paper, we assume the distance oracle model, in which the distance between any customer and any facility can be found in O(l) time. This is the case if there is a distance matrix, or if facilities and users are points on the plane, for instance. In this model, all values of 0i and 02 for a given solution S can be straighforwardly computed hi O(pn) total time.
The p-median problem is defined as follows. Given a set F of ra facilities, a set U of n users (or customers), a distance function d : U x F —* 7£, and a constant p < ra, determine which p facilities to open so as to minimize the sum of the distances from each user to the closest open facility. Being a well-known NP-complete problem [2], one is often compelled to resort to heuristics to deal with it in practice. Among the most widely used is the swapbased local search proposed by Teitz and Bart [10]. It has been applied on its own [8, 13] and as a key subroutine of more elaborate metaheuristics [3, 7, 9,12]. The efficiency of the local search procedure is of utmost importance to the effectiveness of these methods. In this paper, we present a novel implementation of the local search procedure and compare it with the best alternative described in the literature, proposed by Whitaker in [13]. In practice, we were able to obtain significant (often asymptotic) gains. This paper is organized as follows. In Section 2, we give a precise description of the local search procedure and a trivial implementation. In Section 3, we describe Whitaker's implementation. Our own implementation 2 The Swap-based Local Search is described in Section 4. Experimental evidence to the Introduced by Teitz and Bart in [10], the standard local search procedure for the p-median problem is based W AT&T Labs Research, 180 Park Avenue, Florham Park, NJ on swapping facilities. For each facility fa # S, the 07932. Electronic address: mgcr4research.att.com. procedure determines which facility fr 6 S (if any) t Department of Computer Science, Princeton University, would improve the solution the most if fa and f were r 35 Olden Street, Princeton, NJ 08544. Electronic address: interchanged (i.e., if /» were inserted and fr removed rwerneck4cs.princeton.edu. The results presented in this paper were obtained while this author was a summer intern at AT&T from the solution). If one such "improving" swap exists, it is performed, and the procedure is repeated from the Labs Research. 119
new solution. Otherwise we stop, having reached a local minimum (or local optimum). Our main concern here is the time it takes to run each iteration of the algorithm: given a solution 5, how fast can we find a neighbor 5'? It is not hard to come up with an O(pmri) implementation for this procedure. First, in O(pn) time, determine the closest and second closest facilities for each user. Then, for each candidate pair (/i,/ r ), determine the profit that would be obtained by replacing fr with fi, using the following expression (recall that di(u) and dz(u) represent distances from u to the closest and second closest facilities, respectively):
r
The first summation accounts for users whose closest facility is not fr', these will be reassigned to fi only if that is profitable. The second summation refers to users originally assigned to fr, which will be reassigned either to their original second closest facilities or to fi, whichever is more advantageous. The entire expression can be computed in O(ri) time for each pair of facilities. Since there are p(m — p) = O(pm) pairs to test, the whole procedure takes O(pmn) time per iteration. There are several papers in the literature that use this implementation, or avoid using the swapbased local search altogether, mentioning its intolerable running time [7, 9, 12]. All these methods would greatly benefit from Whitaker's implementation (or from ours, for that matter), described in the next section.
[3], is presented in Figure I.1 Function findOut takes as input a candidate for insertion (fi) and returns f r , the most profitable facility to be swapped out, as well as the profit itself (profit). In the code, w represents the total amount saved by reassigning users to fi , independently of which facility is removed; it takes into account all users currently served by facilities that are farther than fi. The loss due to the removal of fr is represented by v(fr), which accounts for users that have fr as their closest facility. Each such user will be reassigned (if fr is removed) either to its original second closest facility or to /»; the reassignment cost is computed in line 7. Line 10 determines the best facility to remove, the one for which v(fr) is minimum. The overall profit of the corresponding swap (considering also the gains in w) is computed in line 11. Since this function runs in O(n) time, it is now trivial to implement the swap-based local search procedure in O(mn) time per iteration: simply call f indOut once for each of the m — p candidates for insertion and pick the most profitable one. If the best profit is positive, perform the swap, update the values of 0i and
4 An Alternative Implementation
Our implementation has some similarity with Whitaker's, in the sense that both methods perform the same basic operations. However, the order in which they are performed is different, and in our case partial results obtained are stored in auxiliary data structures. As we will see, this allows for the use of values computed in early iterations of the algorithm to 3 Whitaker's Implementation speed up later ones. In [13], Whitaker describes the so-called fast interchange heuristic, an efficient implementation of the local search procedure defined above. Even though it was published 4.1 Additional Structures. For each facility fi not in 1983, Whitaker's implementation was not widely used in S, let gain(fi) be the total amount saved when fi is until 1997, when Hansen and Mladenovic [3] applied added to S (thus creating a new solution with p + 1 it as a subroutine of a Variable Neighborhood Search facilities). The savings result from reassigning to fi (VNS) procedure. Their implementation is based on every customer whose current closest facility is farther Whitaker's, with a minor difference: Whitaker prefers a than i itself: first improvement strategy (a swap is made as soon as a profitable one is found), while Hansen and Mladenovic prefer best improvement (all swaps are evaluated and the most profitable executed). In our analysis, we Similarly, for every fr £ S, define loss(fr) as the increase in solution value resulting from the removal of assume best improvement is used. The key aspect of this implementation is its ability fr (with only p — 1 facilities remaining). This is the to find in 0(n) time the best possible candidate for 1 removal, given a certain candidate for insertion. The Expressions of the form a «— b in the code mean that the pseudocode for a function that does that, adapted from value of o is incremented by b units. 120
Figure 1: Function to determine, given a candidate for insertion (fi), the best candidate for removal (fr). Adapted from [3]. of the contribution of u to loss(fr) is overly pessimistic by d-z(u} — d(u, f i ) .
cost of transferring every customer assigned to fr to its second closest facility:
In the local search, we are interested in what happens when insertions and removals occur simultaneously. Let profit(fi, fr] be the amount saved when fi and fr are swapped. We claim it can be expressed as
for a properly defined function extra(fi,fr). Note that the profit will be negative if the neighboring solution is worse than S. To find the correct specification of extra, we observe that, for every customer u, one of the following cases must hold:
c. d(u,fi) < di(u) < d^(u). Customer u should be reassigned to /», with a profit of d\(u) — d(u,fi) (correctly accounted for in the computation of gain(fi)). However, the loss of di(u) — di(u) predicted in the computation of loss(fi) will not occur. The defintion of extra(fi,fr) must handle cases (26) and (2c) above, in which wrong predictions were made. Corrections can be expressed straightforwardly as individual summations, one for each case:
1. i (u) ^ fr (the customer was not assigned to fr before the swap). We may save something by reassigning it to fi, and the amount we save is included in gain(fi). 2. fa (u) = fr (the customer was assigned to fr before the swap). Three subscases present themselves:
These summations can be merged into one:
a. di(u) < (£3(11) < d(u,fi). Customer u should be assigned to 02(w), which was exactly the assumption made during the computation of loss(fr). The loss corresponding to this reassignment is therefore already taken care of. b. di(u) < d(u,fi) < d2(u). Customer u should be reassigned to fi, but during the computa- 4.2 Local Search. Assume all values of loss, gain, tion of loss(fr) we assumed it would be reas- and extra can be efficiently precalculated and stored in signed to
121
Figure 2: Pseudocode for updating arrays in the local search procedure a matrix for extra). Then, we can find the best swap in O(pm) time by computing the profits associated with every pair of candidates using Equation 4.3. To develop an efficient method to precompute gain, loss, and extra, we note that every entry in these structures is a summation over some subset of users (see Equations 4.1, 4.2, and 4.4). Moreover, the contribution of each user can be computed separately. Function updateStructures, shown in Figure 2, does exactly that. It takes as input a user u and its closest facilities (given by 0i and 02), and updates the contents of loss, gain, and extra. To compute all three structures from scratch, we just need to reset them (set all positions to zero) and call updateStructures once for each user. Together, these n calls perform precisely the summations defined in Equations 4.1, 4.2, and 4.4. We now have all the elements necessary to build a full local search algorithm in O(mn) time. In O(pri) time, compute 0i and 02 for all users. In O(pm) time, reset loss, gain, and extra. With n calls to updateStructures, each made in O(m) time, determine their actual values. Finally, in O(pm) time, find the best swap.
pute 0i and 02 > reset the other arrays, and then call updateStructures again for all users. But we can potentially do better than that. The actions performed by updateStructures depend only on u, 0i(w), and 02(u)] no value is read from other structures. Therefore, if 0i (u) and 02 (u) do not change from one iteration to another, there is no need to call updateStructures again for u. Given that, we consider a user u to be affected if there is a change in either 0i(w) or 02 (u) (or both) after a swap is made. Sufficient conditions for u to be affected after a swap between fi and fr are: (1) one of the two closest facilities is fr itself; or (2) the new facility is closer to u than the original 02 (u) is. Contributions to loss, gain, and extra need to be updated only for affected users. If the number of affected users is small (it often is) significant gains can be obtained. Note, however, that we cannot simply call updateStructures for these users, since this function simply adds new contributions. Previous contributions must be subtracted before new additions are made. We therefore need a function similar to updateStructures, with subtractions instead of additions.2 This function (call it undoUpdateStructures) must be called for all affected users before 0i and 02 are recomputed. Figure 3 contains the pseudocode for the entire local search procedure, already taking into account the observations just made. Apart from the functions just discussed, three others appear in the code. The first, resetStructures, just sets all entries in the auxiliary structures to zero. The second, findBestNeighbor, runs through these structures and finds the most prof-
4.3 Acceleration. So far, our implementation seems to be merely a more complicated alternative to Whitaker's; after all, both have the same worst-case complexity. Furthermore, our implementation has the clear disadvantage of requiring an O(pm)-sized matrix, while 0(n) memory positions are enough in Whitaker's. The extra memory, however, allows for significant accelerations, as this section will show. When a certain facility fr is replaced by a new facility fi, values hi gain, loss, extra, 0i, and 02 be4 This function is identical to the one shown in Figure 2, with come inaccurate. A straighforward way to update instead of incrementing all occurrences of <— replaced with them for the next local search iteration is to recomvalues, we decrement them.
122
Figure 3: Pseudocode for the local search procedure itable swap using Equation 4.3. It returns which facility to remove (/r), the one to replace it (/*), and the profit itself (profit). The third function is updateClosest, which updates 0i and 02, possibly using the fact that the facility recently opened was fi and the one closed was fr. The pseudocode reveals three potential bottlenecks of the algorithm: updating the auxiliarly data structures (loss, gain, and extra), updating closeness information, and finding the best neighbor once the updates are done. We now analyze each of these in turn. 4.3.1 Closeness. Updating closeness information, in our experience, has proven to be a relatively cheap operation. Deciding whether the newly inserted facility fi becomes either the closest or the second closest facility to each user is trivial and can be done in O(n) total time. A more costly operation is finding a new second closest facility for customers who had fr (the facility removed) as either the closest or the second closest element. Updating each of these users requires O(p) time, but since there usually are few of them, the total time spent tends to be small fraction of the entire local search procedure. One should also note that, in some settings, finding the set of closest and second closest elements from scratch is itself a cheap operation. For example, in the graph setting, where distances between customers and facilities are given by shortest paths on an underlying graph, this can be accomplished in O(|JE7|) time [11],
where \E\ is the number of edges in the graph. For experiments in this paper, however, specialized routines were not implemented; we always assume arbitrary distance matrices. 4.3.2 Best Neighbor. The number of potential swaps in a given solution is p(m — p). The straighforward way to find the most profitable is to compute profit(fi,fr) (as defined by Equation 4.3) for all pairs and pick the best, which requires 0(pra) operations. In practice, however, the best move can be found in less time. As defined, extra(fi,fr) can be interpreted as a measure of the interaction between the neighborhoods of fr and fi. After all, as Equation 4.4 shows, only users that have fr as their current closest facility and are also close to fi contribute to extra(fi,fr). In particular, if there are no users in this situation, extra(fi, fr) will be zero. It turns out that, in practice, this occurs rather frequently, especially for larger values of p, when the average number of vertices assigned to each /r is relatively small. Therefore, instead of storing extra as a full matrix, one may consider a sparse representation in which only nonzero elements are explicit: each row becomes a linked list sorted by column number. A drawback of the sparse representation (in general) is the impossibility to make random accesses in O(l) time. Fortunately, for our purposes, this is not necessary; updateStructures, undoUpdateStructures, and best Neighbor (the only functions that access the matrix) can be implemented
123
so as to go through each row sequentially. With the sparse matrix representation, one can implement bestNeighbor as follows. First, determine the facility /» that maximizes gain(fi) and the facility fr that minimizes loss(fr). Since all values in extra are nonnegative, this pair is at least as profitable as any pair (fi',frr) for which extra(fi',fr') is zero. Then, compute the exact profits (given by Equation 4.3) for all nonzero elements hi extra. The whole procedure takes O(m + Xpm) time, where A is the fraction of pairs whose extra value is nonzero. This tends to be smaller as p increases. An interesting side-effect of using sparse matrices is that they often need significantly less memory than the standard full matrix representation.
for the p-median problem. The third, RW, is introduced here as a particularly hard case for our method. Class TSP corresponds to three sets of points on the plane (with cardinality 1400, 3038, and 5934), originally used in the context of the traveling salesman problem [6]. In the case of the p-median problem, each point is both a user to be served and a potential facility, and distances are Euclidean. Following [4], we tested several values of p for each instance, ranging from 10 to approximately n/3. Class ORLIB, originally introduced in [1], contains 40 graphs with 100 to 900 nodes, each with a suggested value of p (ranging from 5 to 200). Each node is both a user and a potential facility, and distances are given by shortest paths in the graph. All-pairs shortest paths are computed in advance for all methods tested, as it is usually done in the literature [3, 4]. Each instance in class RW is a square matrix in which entry (w,/) (an integer taken uniformly at random from the interval [l,n]) represents the cost of assigning user u to facility /. Four values of n were tested (100, 250, 500, and 1000), each with values of p ranging from 10 to n/2, totaling 27 combinations.3 The program that created these instances (using the random number generator by Matsumoto and Nishimura [5]) is available from the authors upon request. All tests were performed on an SGI Challenge with 28 196-MHz MIPS R10000 processors (with each execution of the program limited to one processor) and 7.6 GB of memory. All algorithms were coded in C++ and compiled with the SGI MlPSpro C++ compiler (v. 7.30) with flags -03 -OPT:01imit=6586. All running times shown in this paper are CPU times, measured with the getrusage function, whose precision is 1/60 second. In some cases, actual running times were too small for this precision; therefore, each algorithm was repeatedly run for at least 5 seconds; overall times were measured, and averages reported here. For each instance tested, all methods were applied to the same initial solution, obtained by a greedy algorithm [13]: starting from an empty solution, we insert one facility at a time, always picking the one that reduces the solution cost the most. Running times mentioned in this paper refer to the local search only, they do not include the construction of the initial solution.
4.3.3 Updates. As we have seen, keeping track of affected users can reduce the number of calls to updateStructures. We now study how to reduce the time spent in each of these calls. Consider the pseudocode in Figure 2. Line 5 represents a loop through all facilities not in the solution, but line 6 shows that we can actually restrict ourselves to facilities whose distance to u (the candidate customer) is no greater than d% (the distance from u to its second closest facility). This may be a relatively small subset of the facilities, especially when p is large. This suggests a preprocessing step that builds, for each user w, a list with all facilities sorted in increasing order by their distance to u. During the local search, whenever we need the set of facilities whose distance to u is less than d^, we just take the appropriate prefix of the precomputed list — potentially much smaller than m. Building these lists takes O(nralogra) time, but it is done only once, not in every iteration of the local search procedure. This is true even if local search is applied several times within a metaheuristic (as in [3, 9], for instance): we still need to perform the preprocessing step only once. A more serious drawback of this approach is memory usage. Keeping n lists of size m requires 9 (ran) memory positions, which may be prohibitive. On the other hand, one should expect to need only small prefixes most of the time. Therefore, instead of keeping the whole list in memory, it might be good enough to keep only prefixes. The list would then be used as a cache: if d-2, is small enough, we just take a prefix of the candidate list; if it is larger than the largest element represented, 5.2 Results. This section presents an experimental we look at all possible neighbors. comparison between several variants of our implemen-
5 Empirical Analysis
3
More precisely: for n = 100, we used p = 10, 20, 30, 40, and 5.1 Instances and Methodology. We tested our 50; for n = 250, p = 10, 25, 50, 75, 100, and 125; for n = 500, algorithm on three classes of problems. Two of them, p = 10, 25, 50, 100, 150, 200, and 250; and for n = 1000, p = 10, TSP and ORLIB, are commonly studied in the literature 25, 50, 75, 100, 200, 300, 400, and 500.
124
tation and Whitaker's method, which will be referred to here as Fl (fast interchange). We implemented Fl based on the pseudocode in [3] (obtaining comparable running times); the key function is presented here in Figure 1. The same routine for updating closeness information (described in Section 4.3.1) was used for all methods (including Fl). We start with the most basic version of our implementation, in which extra is represented as a full (non-sparse) matrix. This version (called FM, for full matrix) incorporates some acceleration, since calls to updateStructures are limited to affected users only. However, it does not include the accelerations suggested in Sections 4.3.2 (sparse matrix) and 4.3.3 (preprocessing). For each instance tested, we computed the speedup obtained by our method when compared to Fl, i.e., the ratio between the running times of Fl and FM. Table 1 shows the best, (geometric) mean, and worst speedups thus obtained considering all instances in each class.4 Values larger than one favor our method, FM.
Table 2: Speedup obtained by SM (sparse matrix, no preprocessing) over Whitaker's Fl. CLASS
BEST
MEAN
WORST
ORLIB RW TSP
17.21
3.10 5.26 26.18
0.74 0.75 1.72
32.39 147.71
the somewhat larger TSP instances). However, bad cases become even worse. This happens mostly for instances with small values of p: with the number of nonzero elements in the matrix relatively large, a sparse representation is not the best choice. The last acceleration we study is the preprocessing step (Section 4.3.3), in which all potential facilities are sorted according to their distances from each of the users. Results for this variation (SMP, for sparse matrix with preprocessing) are presented in Table 3. Columns 2, 3, and 4 consider running times of the local Table 1: Speedup obtained by FM (full matrix, no search procedure only; columns 5, 6, and 7 also include preprocessing times. preprocessing) over Whitaker's Fl. CLASS
BEST
MEAN
WORST
ORLIB RW TSP
12.72 12.42 31.14
3.01 4.14 11.68
0.84 0.88 1.85
The table shows that even the basic acceleration scheme achieves speedups of up to 30 for some particularly large instances. There are cases, however, in which FM is actually slower than Whitaker's method. This usually happens for smaller instances (with n or p small), in which the local search procedure performs very few iterations, insufficent to ammortize the overhead of using a matrix. On average, however, FM has proven to be from three to almost 12 times faster than Fl. We now analyze a second variant of our method. Instead of using a full matrix to represent extra, we use a sparse matrix, as described in Section 4.3.2. We call this variant SM. The results, obtained by the same process as above, are presented in Table 2. As expected, SM has proven to be even faster than FM on average and in the best case (especially for 4 Since we are dealing with ratios, geometric (rather than arithmetic) means seem to be a more sensible choice; after all, if a method takes twice as much time for 50% of the instances and half as much for the other 50%, it should be considered roughly equivalent to the other method. Geometric means reflect that, whereas arithmetic means do not.
Table 3: Speedup obtained by SMP (sparse matrix, full preprocessing) over Whitaker's Fl. CLASS
LOCAL SEARCH ONLY BEST MEAN WORST
ORLIB RW TSP
67.0 113.9 862.1
8.7 15.1 177.6
1.30 1.40 3.27
INCL. PREPROCESSING BEST MEAN WORST
7.5 9.6 79.2
1.2 2.1 20.3
0.22 0.18 1.33
The table shows that the entire SMP procedure (including preprocessing) is in general still much faster than Whitaker's Fl, but often worse than the other variants studied in this paper (FM and SM). However, as already mentioned, metaheuristics often need to run the local search procedure several times, starting from different solutions. Since preprocessing is run only once, its cost can be quickly amortized. Based on columns 2, 3, and 4 of the table, it is clear that, once this happens, SMP can achieve truly remarkable speedups with respect not only to Fl, but also to other variants studied in this paper. In the best case (instance r!5934 with p = 1000), it is almost 900 times faster than Fl. A more detailed analysis of this particular instance (r!5934, the largest we tested) is presented in Figure 4. It shows how p (the number of facilities to open) affects the running times of all four methods studied (Fl, FM, SM, and SMP), and also some variants "between" SM and SMP. Recall that in SMP every user keeps a list of 125
Figure 4: Instiance H5934: dependency of running times on p for different methods. Times are in logarithmic scale and do not include preprocessing. all facilities sorted by distance; SM keeps no list at all. In a variant of the form SMg, each user keeps a limited list with the qm/p closest facilities (this is the "cache" version described in Section 4.3.3). Running times in the graph do not include preprocessing, which takes approximately one minute for this particular instance. Since all methods discussed here implement the same algorithm, the number of iterations does not depend on the method itself. It does, however, depend on the value of p: in general, these two have a positive correlation. For some methods, such as Whitaker's Fl and the full-matrix variation of our implementation (FM), an increase in p leads to greater running times (although our method is still 10 times faster for p = 1500). For SMP, which uses sparse matrices, time spent per iteration tends to decrease even faster as p increases: the effect of swaps becomes more local, with fewer users affected and fewer neighboring facilities visited in each call to updateStructures. This latter effect explains why keeping even a relatively small list of neighboring facilities for each user seems to be worthwhile. The curves for variants SMP and SMS are practically indistinguishable in Figure 4, and both are much faster than SM.
6
Concluding Remarks
We have presented a new implementation of the swapbased local search for the p-median problem introduced by Teitz and Bart. Through the combination of several techniques — using a matrix to store partial results, a compressed representation for this matrix, and preprocessing — we were able to obtain speedups of up to three orders of magnitude with respect to the best previously known implementation, due to Whitaker. Our implementation is especially well suited to relatively large instances and, due to the preprocessing step, to situations in which the local search procedure is run several times for the same instance (such as within a metaheuristic). For small instances, Whitaker's can still be faster, but not by a significantly large margin. Two lines of research suggest themselves from this work. First, it is still possible to improve the performance of our method, especially for small instances. One might consider, for example, incorporating the preprocessing step into the main procedure; this would allow operations to be performed as needed, instead of in advance (thus avoiding useless computations). A second line of research would be to test our method as a building block of more elaborate metaheuristics.
126
References [1] J. E. Beasley. A note on solving large p-median problems. European Journal of Operational Research, 21:270-273, 1985. [2] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979. [3] P. Hansen and N. Mladenovic. Variable neighborhood search for the p-median. Location Science, 5:207-226, 1997. [4] P. Hansen, N. Mladenovic, and D. Perez-Brito. Variable neighborhood decomposition search. Journal of Heuristics, 7(3):335-350, 2001. [5] M. Matsumoto and T. Nishimura. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Transactions on Modeling and Computer Simulation, 8(1):3-30, 1998. [6] G. Reinelt. TSPLIB: A traveling salesman problem library. ORSA Journal on Computing, 3:376-384, 1991. http://www.iwr.uniheidelberg.de/groups/comopt/software/TSPLIB95/. [7] E. Holland, D. A. Schilling, and J. R. Current. An efficient tabu search procedure for the p-median problem. European Journal of Operational Research, 96:329-342, 1996. [8] K. E. Rosing. An empirical investigation of the effectiveness of a vertex substitution heuristic. Environment and Planning B, 24:59-67, 1997. [9] K. E. Rosing and C. S. ReVelle. Heuristic concentration: Two stage solution construction. European Journal of Operational Research, 97:75-86, 1997. [10] M. B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16(5):955-961, 1968. [11] M. Thorup. Quick /c-median, fc-center, and facility location for sparse graphs. In Proceedings of the 28th International Colloquium on Automata, Languages and Programming (ICALP 2001), volume 2076 of Lecture Notes in Computer Science, pages 249-260. Springer, 2001. [12] S. Voss. A reverse elimination approach for the pmedian problem. Studies in Locational Analysis, 8:4958, 1996. [13] R. Whitaker. A fast algorithm for the greedy interchange of large-scale clustering and median location prolems. INFOR, 21:95-108, 1983.
127
Fast Prefix Matching of Bounded Strings Adam L. Buchsbaum* Glenn S. Fowler* Kiem-Phong Vo* Abstract Longest Prefix Matching (LPM) is the problem of finding which string from a given set is the longest prefix of another, given string. LPM is a core problem in many applications, including IP routing, network data clustering, and telephone network management. These applications typically require very fast matching of bounded strings, i.e., strings that are short and based on small alphabets. We note a simple correspondence between bounded strings and natural numbers that maps prefixes to nested intervals so that computing the longest prefix matching a string is equivalent to finding the shortest interval containing its corresponding integer value. We then present retries, a fast and compact data structure for LPM on general alphabets. Performance results show that retries often outperform previously published data structures for IP look-up. By extending LPM to general alphabets, retries admit new applications that could not exploit prior LPM solutions designed for IP look-ups.
1 Introduction Longest Prefi x Matching (LPM) is the problem of determining from a set of strings the longest one that is a prefi x of some other, given string. LPM is at the heart of many important applications. Internet Protocol (IP) routers [14] routinely forward packets by computing from their routing tables the longest bit string that forms a prefi x of the destination address of each packet. Krishnamurthy and Wang [20] describe a method to cluster Web clients by identifying a set of IP addresses that with high probability are under common administrative control and topologically close together. Such clustering information has applications ranging from network design and management to providing on-line qualityof-service differentiation based on the origin of a request. The proposed clustering approach is network aware in that addresses are grouped based on prefi xes in snapshots of Border Gateway Protocol (BGP) routing tables. Telephone network management and marketing applications often classify regions in the country by area codes or combinations of area codes and the fi rst few digits of the local phone numbers. For example, the state of New Jersey is identified by area codes such as 201, 908, and 973. In turn, Morris County in New Jersey is identifi ed by longer "AT&T Labs, Shannon Laboratory, 180 Paik Avenue, Floriiam Park, NJ 07932, USA, {alb,gsf,bala,kpvjiawang}@research.att.com.
Balachander Krishnamurthy* Jia Wang*
telephone prefixes like 908876 and 973360. These applications typically require computing in seconds or minutes summaries of calls originating and terminating at certain locations from daily streams of telephone calls, up to hundreds of millions records at a time. This requires very fast classifi cation of telephone numbers by fi nding the longest matching telephone prefi xes. Similar to other string matching problems [17, 19, 27] with practical applications [1, 5], LPM solutions must be considered in the context of the intended use to maximize performance. The LPM applications discussed above have some common characteristics: • Look-ups overwhelmingly dominate updates of the prefi x sets. A router may route millions of packets before its routing table changes. Similarly, telephone number classifi cations rarely change, but hundreds of millions of phone calls are made daily. • The look-up rate is extremely demanding. IP routing and clustering typically require LPM performance of 200 nanoseconds per look-up or better. This severely limits the number of machine instructions and memory references allowed. • Prefi xes and strings are bounded in length and based on small alphabets. For example, current IP addresses are 32-bit strings, and U.S. telephone numbers are 10-digit strings. The fi rst two characteristics mean that certain theoretically appealing solutions based on, e.g., suffi x trees [22], string prefix matching [3, 4], or dynamic string searching [13] are not applicable, as their performance would not scale. Fortunately, the third characteristic means that specialized data structures can be designed with the desired performance levels. There are many papers in the literature proposing schemes to solve the IP routing problem [8, 9, 10, 11, 12, 21, 25, 28, 29] with various tradeoffs based on memory consumption or memory hierarchies. We are not aware of any published work that generalizes to bounded strings such as telephone numbers, however. Work on routing Internet packets [21] exploits a simple relationship between IP prefi xes and nested intervals of natural numbers. We generalize this idea to a correspondence
128
00100000/3 00101000/5
[32,63] [40,47]
11000000/2 11010000/4
[192,255] [208,223]
[32,39] [40,47] [48,63] [192,207] [208,223] [224,255]
(a)
00100000/5 00101000/5 00110000/4 11000000/4 11010000/4 11100000/3
a b a c d c
(c)
Figure 1: (a) An example prefix set, with associated values, for matching 8-bit strings; (b) corresponding nested intervals; (c) corresponding disjoint intervals and the equivalent set of disjoint prefi xes. between bounded strings and natural numbers, which shows that solutions to one instance of LPM may be usable for other instances. We present retries, a novel, fast, and compact data structure for LPM on general alphabets; and we perform simulation experiments based on trace data from real applications. On many test sets, retries outperform other published data structures for IP routing, often by signifi cant margins. By extending LPM to general alphabets, retries also admit new applications that could not exploit prior LPM solutions designed for IP look-ups. 2 Prefixes and Intervals Let A be an alphabet of fi nite size a = S+1. Without loss of generality, assume that A is the set of natural numbers in the range [0, fi\. Otherwise, map A's elements to their ranks in any fi xed, arbitrary order. We can then think of elements of A as digits in base a so that a string s = si«2 • • • »k over A represents an integer v — siak~l + szak~2 H \-Sk- We denote i(s) = v and cr(v) — s. When we work withfixedlength strings, we shall let a(v) have enough O's padded on the left to gain this length. For example, when the string 1001 represents a number in base 2, t(1001) is the decimal value 9. Conversely, in base 3 and with prescribed length 6, a(9) is the string 000100. Clearly, for any two strings s and t with equal lengths, t,(s) < t,(t) if and only if s precedes t lexicographically.
length m — 8. Figure l(a) shows an example prefix set of four strings and associated values. For example, the first string in the set would best match the string 00100101, yielding result a. On the other hand, the second string would best match 00101101, yielding b. For any string s in A^m, let s° = sO • • •0 and ss s6 - • - 6 be two strings in which enough O's and t-(ss). Thus, we have: LEMMA 2.1. Let shea string in A-m and v < a1 s is a prefix ofa(v) if and only ifv is in I(s).
Then
For any prefix set P, we use /(P) to denote the set of intervals associated with prefi xes in P. Now consider two prefi xes ji and pz and their corresponding intervals I (pi) and I(pz). Applying Lemma 2.1 to the endpoints of these intervals shows that either the intervals are completely disjoint or one is contained in the other. Furthermore, /(pi) contains I(p2) if and only if pi is a prefi x of ^. Next, when 2.1 Longest matching prefixes and shortest containing s has length m, t(s°) = t(s*) = t(s). Lemma 2.1 asserts intervals. Let m be some fi xed integer. Consider &m and that if p is a prefix of s then /(p) must contain t,(s). The Am, respectively the sets of strings over A with lengths < m nested property of intervals in a prefi x set P then gives: and lengths exactly equal to m. Let P C A-m, and with each p € P let there be associated a data value; the data THEOREM 2.1. Let P be a prefix set and s a string in Am. values need not be mutually distinct. We defi ne an LPM Then p is the longest prefix matching s if and only ifl(p) is instance (A, m) as the problem of fi nding the data value the shortest interval in J(P) containing t,(s). associated with the longest string in P that is a prefi x of some string s £ Am. P is commonly called the prefix set, and its Figure l(b) shows the correspondence between prefi xes elements are called prefixes. Following a convention in IP and intervals. For example, the string 00101101 with routing, we shall write s/k to indicate the length-fc prefi x of numerical value 45 would have [40,47] as the shortest a string s (k < len(s)). containing interval, giving b as the matching result. To show examples of results as they develop, we shall Two intervals that are disjoint or nested are called nested use the binary alphabet A = {0,1} and maximum string intervals. Theorem 2.1 enables treating the LPM problem as 129
while (lo <= hi) { for {i = 0; i < m; + + i ) if ( d o % A[i+l] ) != 0 | | do + A[i+l] - 1) > hi) break; itv!2pfx(lo, lo + A [ i ] - 1) ; lo += A[i] ;
that of managing a collection of mutually nested intervals with the following basic operations. Insert(a, 6, u). Insert new interval [a, b] with associated data value v. [a, 6] must contain or be contained in any interval it intersects. Retract(a, b). Delete existing interval [a, 6].
Get(p). Determine the value associated with the shortest .' , .,, ^ ^ A . .^ Figure 2: Constructing the prefi xes covering interval \lo.hi\. interval, if any, that contains integer p. When m and a are small, standard computer integer types suffice to store the integers arising from strings and interval endpoints. This allows construction of practical data structures for LPM based on integer arithmetic.
2. Build a new set of intervals by adding the sorted intervals in order. When an interval [i,j] is added, if it is contained in some existing interval [k, I], then in addition to adding [i, j], replace [k, I] with at most two new disjoint intervals, [k, i — l] and \j +1, /], whenever they are of positive length.
2.2 Equivalence among LPM instances and prefix sets. A data structure solving an (A, m) instance can sometimes be used for other instances, as follows. Let (B, n) be another 3. Merge adjacent intervals that have the same data values. instance of the LPM problem with ft the size of B and n the 4. Apply the algorithm in Figure 2 to each of the resulting maximal string length. Suppose that am > ftn. Since the intervals to construct the new prefi xes. integers corresponding to strings in B-n are less than am, they can be represented as strings in A-m. Furthermore, let Figure l(c) shows how the nested intervals are split into J(p) be the interval corresponding to a prefi x p in B£H. Each disjoint intervals. These intervals are then transformed into integer in /(p) can be considered an interval of length 1, so it m is representable as a prefi x of length m in & . Thus, each a new collection of prefi xes. A property of the new prefi xes string and prefi x set in (B, n) can be translated into some is that none of them can be a prefi x of another. From now on, we assert that every considered prefi x set other string and prefi x set in (A, m). We have shown: P shall represent disjoint intervals. If not, we convert it into THEOREM 2.2. Let (A,m) and (B,ri) be two instances of the equivalent set of prefi xes P(/(P)) as discussed. the LPM problem in which the sizes of A and B are a and ft respectively. Then any data structure solving LPM on (A, m) 3 The Retrie Data Structure can be used to solve (B, n) as long as am > ftn. Theorem 2.2 asserts that any LPM data structure for one type of string can be used for other LPM instances as long Using single values in an interval to generate prefi xes as alphabet sizes and string lengths are within bounds. For is ineffi cient. Let [lo, hi] be some interval where lo < oP m example, 15-digit international telephone numbers fit in 64and hi < a . Figure 2 shows an algorithm (in C) for bit integers, so data structures for fast look-ups of IPv4 32-bit constructing prefi xes from [lo, hi]. Simple induction on the addresses are potentially usable, with proper extensions, for quantity hi — lo + 1 shows that the algorithm constructs telephone number matching. Unfortunately, many of these the minimal set of subintervals covering [lo, hi] such that m are too highly tuned for IP routing to be effective in the each subinterval corresponds to a single prefi x in ^ . We other applications that we consider, such as network address assume an array A [ ] such that A [ i ] has the value a*. The clustering and telephone number matching (Section 4). We function itv!2pf x () converts an interval into a prefi x by next describe the retrie data structure for fast LPM queries inverting the process described earlier of mapping a prefi x m in A. We compare it to prior art in Section 5. into an interval. Such a prefi x will have length m — i. Given a nested set of intervals /, we can construct a minimal set of prefi xes P(I) such that (1) /(p) and I(q) are 3.1 m The basic retrie scheme. Let P be a prefi x set over disjoint for p ^ q; and (2) fi nding the shortest interval in / A- . Each prefi x in P is associated with some data value, containing some integer i is the same as fi nding the longest an integermin a given range [0,D]. We could build a table of size a that covers the contents of the intervals in P. prefi x in P(J) matching
130
for (node = root, shift = m; ; sv %= A[shift]) { shift -= node » (obits+1); if (node & (1 « obits)) node = Node[(node & ((1 « obits) - 1) ) + sv/A[shift]]; else return Leaf[(node & ((1 « obits) - 1)) + sv/A[shift]];
Figure 3: Searching a retrie for a longest prefi x; obits is the number of bits reserved for offset. number of left digits of a given string. Each entry in this table points to another table, indexed by some of the following digits, and so on. As such, there are two types of tables: internal and leaf. An entry of an internal table has a pointer to the next-level table and indicates whether that table is an internal or leaf table. An entry of a leaf table contains the data associated with the prefi x matched by that entry. All internal tables are kept in a single array Node, and all leaf tables are kept in a single array Leaf. We show later how to minimize the total space used. The size of a leaf entry depends on the maximum data value associated with any given prefi x. For example, if the maximum data value is < 28, then a leaf entry can be stored in a single byte, while a maximum data value between 28 and 216 means that leaf data must be stored using 2 bytes, etc. For fast computation, the size of an internal table entry is chosen so that the entry would fi t in some convenient integer type. We partition the bits of this type into three fi elds: index, type, and offset. Index specifi es the number of digits used to index the next-level table, which thus has a ^ entries; type is a single bit, which if 1 indicates that the next level is internal, and if 0 indicates that the next level is leaf; offset specifi es the offset into the Node or Leaf array at which the next-level table begins. Let w be the word size in bits of an internal-entry type. If b0 bits are reserved for offset, then the maximum size for the Node and Leaf arrays is cP0flga, the largest power of a no greater than 2*°, which thus upper bounds the size of any table. Now if bi bits are reserved for index, then the maximum table size is also a2 *~1. Therefore, bi and b0 should be chosen so that these two values are about equal, i.e., so that bi is about lg(l + b0/ Ig a), keeping in mind that bi + b0 +1 = w. In practice, we often use 32-bit integers for internal table entries. For a = 2, we thus choose b0 = 26 and bi = 5, since lg(l+26/ Ig2) is about 4.8. These choices allow the sizes of the Node and Leaf arrays to be up to 226, which is ample for our applications. Given a retrie built for some prefi x set P C &m, let root be a single internal table entry that describes how to index the top-level table. Let A [ ] be an integer array such that A [i] = a*. Now let s be a string in Am with integer value sv = i(s). Figure 3 shows the algorithm to compute the data value associated with the LPM of 3.
Figure 4: A retrie data structure.
Figure 4 shows a 3-level retrie for the prefi x set shown in Figure l(c). The Node array begins with the top-level internal table. Indices to the left of table entries are in binary and with respect to the head of the corresponding table within the Node or Leaf array. Each internal-table entry has three fi elds as discussed. All internal-table entries with offset = nil indicate some default value for strings without any matching prefi x. For example, the string 00101101 is matched by fi rst stripping off the starting 2 bits, 0 0, to index entry 0 of the top-level table. The type of this entry is 1, indicating that the next level is another internal table. The offset of the entry points to the base of this table. The index of the entry indicates that one bit should be used to index the next level. Then the indexed entry in the next-level table points to a leaf table. The entries of this table are properly defi ned so that the fourth and fi fth bits of the string, 01, index the entry with the correct data: b. A retrie with k levels enables matching with at most k indexing operations. This guarantee is important in applications such as IP forwarding. Smaller fc's mean larger look-up tables, so it is important to ensure that a retrie with k levels uses minimal space. We next discuss how to do this using dynamic programming. 3.2 Optimizing the basic retrie scheme. Given a prefi x set P, defi ne len(P) = max{/e/i(p) : p e P}. For 1 < i <
131
Figure 5: Dynamic program to compute the optimal size of a retrie. len(P), let P-* be the subset of prefi xes with lengths < i. Then let I/(P, i) be the partition of P — P-* into equivalence classes induced by the left i digits. That is, each part Q in L(P, i) consists of all prefi xes longer than i and having the same fi rst i digits. Now, let strip(Q, i) be the prefi xes in Q with their left i digits stripped off. Such prefi xes represent disjoint intervals in the LPM instance (A,m — i}. Finally, let Cd be the size of a data (leaf table) entry and c* the size of an internal-table entry. The dynamic program in Figure 5 computes 5(P, k), the optimal size of a retrie data structure for a prefi x set P using at most k levels. The fi rst case states that a leaf table is built whenever the prefix set is empty or only a single level is allowed. The second case recurses to compute the optimal retrie and its size. The fi rst part of the minimization considers the case when a single leaf table is constructed. The second part of the minimization considers the cases when internal tables may be constructed. In these cases, the first term Cta* expresses the size of the constructed internal table to be indexed by the fi rst i digits of a string. The second term Crf|P-*| expresses the fact that each prefi x short enough to end at the given level requires a single leaf-table entry for its data value. The last term $3oe.L(p,) S(strip(Q, i), k—1) recurses down each set strip(Q, i) to compute a retrie with optimal size for it. Each set strip(Q, i) is uniquely identifi ed by the string formed from the digits stripped off from the top level of the recursion until strip(Q, i) is formed. The number of such strings is bounded by \P\len(P). Each such set contributes at most one term to any partition of the len(P) bits in a prefi x. For k < len(P), the dynamic program examines all partitions of [1, len(P)\ with < k parts. The number of such partitions is O(len(P)k~]-}. Thus, we have:
THEOREM 3.1. S(P,k) can be computedin O(\P\len(P)k) time. In practice, len(P) is bounded by a small constant, e.g., 32 for IP routing and 10 for U.S. phone numbers. Since there cannot be more than len(P) levels, the dynamic program essentially runs in time linear in the number of prefi xes. 3.3 Superstring lay-out of leaf tables. The internal and leaf tables are sequences of elements. In the dynamic program, we consider each table to be instantiated in the Node or Leaf array as it is created. We can reduce memory usage by exploiting redundancies, especially in the leaf
tables. For example, in IP routing, the leaf tables often contain many similar, long runs of relatively few distinct data values. Computing a short superstring of the tables reduces space very effectively. Since computing shortest common superstrings (SCS) is MAX-SNP hard [2], we experiment with three heuristics. 1. The trivial superstring is formed by catenating the leaf tables. 2. The left and right ends of the leaf tables are merged in a best-fi t order. 3. A superstring is computed using a standard greedy SCS approximation [2], Both methods 2 and 3 effectively reduce space usage. (See Section 4.) In practice, however, method 2 gives the best trade-off between computation time and space. Finally, it is possible to add superstring computation of leaf tables to the dynamic program to estimate more accurately the actual sizes of the leaf tables. This would better guide the dynamic program to select an optimal overall lay-out. The high cost of superstring computation makes this impractical, however. Thus, the superstring lay-out of leaf tables is done only after the dynamic program fi nishes. 4 Applications and Performance We consider three applications: IP routing, network clustering, and telephone service marketing. Each exhibits different characteristics. In current IP routing, the strings are 32 bits long, and the number of distinct data values, i.e., next-hops, is small. For network clustering, we merge several BGP tables together and use either the prefi xes or their lengths as data values so that after a look-up we can retrieve the matching prefi x itself. In this case, either the number of data values is large or there are many runs of data values. For routing and clustering, we compared retries to data structures with publicly available code: LC-tries [25], which are conceptually quite similar, and the compressed-table data structure of Crescenzi, Dardini, and Grossi (CDG) [8], which is among the fastest IP look-up data structures reported in the literature. (See Section 5.) We used the authors' code for both benchmarks and included in our test suite the FUNET router table and traffi c trace used in the LC-trie work [25]. These data structures are designed specifi cally for IP prefi x matching. The third application is telephone service marketing, in which strings are telephone numbers.
132
Table 1: Number of entries, next-hops, and data structure sizes for tables used in IP routing experiment. Routing table AADS
ATT FUNET MAE-WEST OREGON PAIX TELSTRA
Entries 32505 71483 41328 71319 118190 17766 104096
Next-hops
38 45 18 38 33 28 182
Data struct, size (KB) retrie Ictrie -LR -GR
-FL 1069.49 2508.79 506.57 1241.14 3828.93 912.94 2355.03
4.1 IP routing. Table 1 and Figure 6 summarize the routing tables we used. They report how many distinct prefi xes and next-hops each contained and the sizes of the data structures built on each. Retrie-FL (rsp., -LR, -GR) is a depth-2 retrie with catenated (rsp., left-right/best-fi t merge, greedy) record layout. For routing, we limited depth to 2 to emphasize query time. We use deeper retries in Section 4.2. ATT is a routing table from an AT&T BGP router; FUNET is from the LC-trie work [25]; TELSTRA comes from Telstra Internet [30]; the other tables are described by Krishnamurthy and Wang [20]. We timed the LPM queries engendered by router traffi c. Since we lacked real traffi c traces for the tables other than ATT and FUNET, we constructed random traces by choosing, for each table, 100,000 prefi xes uniformly at random (with replacement) and extending them to full 32-bit addresses. We used each random trace in the order generated and also with the addresses sorted lexicographically to present locality that might be expected in a real router. We generated random traces for the ATT and FUNET tables as well, to juxtapose real and random data. We processed each trace through the respective table 100 times, measuring average LPM query time; each data structure was built from scratch for each unique prefi x table. We also recorded data structure build times. We performed this evaluation on two machines: an SGI Challenge (400 MHz R12000) with split 32 KB LI data and instruction caches, 8 MB unifi ed L2 cache, and 12 GB main memory, running IRIX 6.5; and a 1 GHz Pentium III with split 16 KB LI data and instruction caches, 256 KB unifi ed L2 cache, and 256 MB main memory, running Linux 2.4.6. Each time reported is the median of fi ve runs. Table 2 reports the results, and Figures 7-9 plot the query times. LC-tries were designed to fi t in L2 cache and do so on all the tables on the SGI but none on the Pentium. Retries behave similarly, although they were not designed to be cache resident. CDG fits in the SGI cache on AADS, FUNET, MAE-WEST, and PAIX. Retries uniformly outper-
866.68 2231.89 433.00 1040.37 3107.78 741.92 2023.08
835.61 2180.21 411.36 1000.52 3035.73 723.85 1971.78
763.52 1659.52 967.36 1654.06 2711.16 417.74 2384.66
CDG 4446.37 15601.92 682.93 5520.26 12955.85 3241.31 9863.96
formed LC-tries, sometimes by an order of magnitude, always by a factor exceeding three. CDG signifi cantly outperformed retries on the real and sorted random traces for FUNET on the Pentium, but this advantage disappeared for the random trace and also for all the FUNET traces on the SGI. This suggests the influence of caching effects. Also, the numbers of prefi xes and next-hops for FUNET were relatively low, and CDG is sensitive to these sizes. On the larger tables (ATT, MAE-WEST, OREGON and TELSTRA), retries signifi cantly outperformed CDG, even for the real and sorted random traces for ATT (on both machines). As routing tables are continually growing, with 250,000 entries expected by the year 2005, we expect that retries will outperform CDG on real data. Finally, the FUNET trace was fi 1tered to zero out the low-order 8 bits of each address for privacy purposes [24] and is likely not a true trace for the prefi x table, which contains some prefi xes longer than 24 bits. The data suggest that the non-trivial superstring retrie variations signifi cantly reduce retrie space. As might be expected, the greedy superstring approximation is comparatively slow, but the best-fi t merge runs with little time degradation over the trivial superstring method and still provides signifi cant space savings. The FUNET results, in particular on the real and sorted random traces, suggest that CDG benefi ts from the ordering of memory accesses more than retries benefi t from the superstring layouts. A fi rst-fi t superstring merging strategy might be useful in testing this hypothesis. There is a nearly uniform improvement in look-up times from retrie-FL to retrie-LR to retrie-GR even though each address look-up performs exactly the same computation and memory accesses in all three cases. This suggests benefi cial effects from the hardware caches. We believe that this is due to the overlapping of leaf tables in retrie-LR and retrie-GR, which both minimizes space usage and increases the hit rates for similar next-hop values. The data also suggest that LPM queries on real traces run signifi cantly faster than on random traces. Again this suggests benefi cial cache effects, from the locality observed
133
Table 2: Build and query times for routing. Routing table AADS
Data struct. retrie-FL retrie-LR retrie-GR Ictrie
CDG
ATT
retrie-FL retrie-LR retrie-GR Ictrie
CDG FUNET
retrie-FL retrie-LR retrie-GR Ictrie
CDG
MAE-WEST
retrie-FL retrie-LR retrie-GR Ictrie
CDG OREGON
retrie-FL retrie-LR retrie-GR Ictrie
CDG
PAIX
retrie-FL retrie-LR retrie-GR Ictrie
CDG TELSTRA
retrie-FL retrie-LR retrie-GR Ictrie
CDG
SGI Build (ms)
Build Query (ns) Rand. (ms) Traffic | Sort. rand.
195 225 4157
113 214 459 518 9506
245 1011
166 190 1650
136 102 340 388
18 17 17 163 64 14 14 14 134 15
8241
233 325 447 512 8252
383 949 118 132 2100
66 136 508 572 10000
343 599
15 15 15 160 17 16 16 16 159 39 14 14 14 149 14 14 15 15 155 19 15 16 16 155 26 15 14 14 162 15 16 16 15 155 26
20 20 20 215 144 31 22 22 214 224 18 17 17 199 20 21 20 21 210 152 67 23 23 341 142 20 18 18 213 123 27 21 21 311 180
Pentium Query (ns) Traffic Sort. rand.
150 180 3010
100 220 370 440
7220
270
1010
120 130 1020
140 70 270 310 6950
250 310 350 390 7740
420
1050
90 90 1360
60 150 420 480
8510
390 610
20 18 19 146 42 14 14 14 111 8
25 24 22 153 25 32 31 31 181 43 20 21 19 153 14 28 26 26 176 33 34 31 31 206 41 24 22 22 146 22 37 31 31 201 47
Rand.
48 42 40 380 69 83 68 66 452 85 28 24 24 381 26 65 52 49 454 73 76 59 56 464 81 36 30 28 284 58 86 62 56 458 83
prefi xes in the tables. The goal of clustering is to recover the actual matching prefi x for an IP address, thereby partitioning the addresses into equivalence classes [20]. PREF assigns each resulting prefi x itself as the data value. LEN assigns the length of each prefi x as its data value, which is suffi cient to recover the prefi x, given the input address. PREF, therefore, has 168,161 distinct data values, whereas LEN has only 32. We built depth-2 and -3 retries and LC-tries for PREF 4.2 Network clustering. For clustering, we combined the and LEN. Table 3 and Figure 10 detail the data structure routing tables used above. There were 168,161 unique sizes. Note the difference in retrie sizes for the two tables.
in our IP traffic traces. Real trace data is thus critical for accurate measurements, although random data seem to provide an upper bound to real-world performance. Finally, while retries take longer to build than LC-tries (and sometimes CDG), build time (for -FL and -LR) is acceptable, and query time is more critical to routing and on-line clustering, which we assess next.
134
Table 3: Data structure sizes for tables used in clustering experiment. Routing table PREF
LEN
depth-2 retrie
Data struct, size (KB) depth-3 retrie
-FL
-LR
-GR
-FL
-LR
-GR
Ictrie
13554.87 5704.37
13068.88 4938.74
13054.80 4785.66
3181.63 1400.84
2878.24 1045.53
2889.90 990.33
3795.91 3795.91
Table 4: Build and query times for clustering. Machine
SGI
Table
Operation
PREF
build query (Apache) query (EW3) build query (Apache) query (EW3) build query (Apache) query (EW3) build query (Apache) query (EW3)
LEN Pentium
PREF
LEN
Times (build: ms) (query: ns) depth-2 retrie depth-3 retrie
-FL
-LR
-GR
-FL
-LR
-GR
Ictrie
1732 20 20 1299 15 15 1300 26 27 990 21 23
2028 19 19 1495 16 16 1570 26 27 1170 21 21
25000 19 19 28000 16 16 19000 26 27 25000 21 21
6919 36 36 4299 32 33 4670 41 43 2860 35 37
7478 35 36 4476 32 32 5060 40 42 3070 34 35
43000 35 36 25000 32 32 32000 40 42 18000 34 35
801 136 139 588 135 139 750 121 129 640 117 124
The relative sparsity of data values in LEN produces a much smaller Leaf array, which can also be more effectively compressed by the superstring methods. Also note the space reduction achieved by depth-3 retries compared to depth-2 retries. Depth-3 retries are smaller than LC-tries for this application, yet, as we will see, outperform the latter. CDG could not be built on either PREF or LEN. CDG assumes a small number of next-hops and exceeded memory limits for PREF on both machines. CDG failed on LEN, because the number of runs of equal next-hops was too large. Here the difference between the IP routing and clustering applications of LPM becomes striking: retries work for both applications, but CDG cannot be applied to clustering. We timed the clustering of Apache and EW3 server logs. Apache contains IP addresses recorded at www.apache.org. The 3,461,361 records gathered in late 1997 had 274,844 unique IP addresses. E W3 contains IP addresses of clients visiting three large corporations whose content were hosted by AT&T's Easy World Wide Web. Twenty-three million entries were gathered during February and March, 2000, representing 229,240 unique IP addresses. The experimental setup was as in the routing assessment. In an on-line clustering application, such as in a Web server, the data
structures are built once (e.g., daily), and addresses are clustered as they arrive. Thus, query time is paramount. Retries signifi cantly outperform LC-tries for this application, even at depth 3, as shown in Table 4 and Figure 11. 4.3 Telephone service marketing. In our telephone service marketing application, the market is divided into regions, each of which is identifi ed by a set of telephone prefi xes. Given daily traces of telephone calls, the application classifi es the callers by regions, updates usage statistics, and generates a report. Such reports may help in making decisions on altering the level of advertisement in certain regions. For example, the set of prefi xes identifying Morris County, NJ, includes 908876 and 973360. Thus, a call originating from a telephone number of the form 973360XXXX would match Morris County, NJ. Table 5 shows performance results (on the SGI) from using a retrie to summarize telephone usage in different regions of the country for the fi rst half of 2001. The second column shows the number of call records per month used in the experiment. Since this application is off-line, we consider the total time required to classify all the numbers. The third column shows this time (in seconds) for retries, and
135
Table 5: Time to classify telephone numbers on the SGI. Month |
1
2 3 4 5 6
Counts | Retrie (s) | Bsearch (s)| 24.35 27,479,712 83.51 25,510,814 22.37 74.73 25.49 84.60 28,993,583 24.94 80.76 28,452,823 26.11 84.86 29,786,302 25.27 80.79 28,874,669
the fourth column contrasts the time using a binary search approach for matching. This shows the benefit of retries for this application. Previous IP look-up data structures, which we review next, do not currently extend to this alphabet, although they could be extended using the correspondence between bounded strings and natural numbers. 5 Comparison to Related Work The popularity of the Internet has made IP routing an important area of research. Several LPM schemes for binary strings were invented in this context. The idea of using ranges induced by prefi xes to perform IP look-ups was suggested by Lampson, Srinivasan, and Varghese [21] and later analyzed by Gupta, Prabhakar, and Boyd [16] to guarantee worst-case performance. Ergun et al. [11] considered biased access to ranges. Feldmann and Muthukrishnan [12] generalized the idea to packet classification. We generalized this idea to non-binary strings and showed that LPM techniques developed for strings based on one alphabet can also be used for strings based on another. Thus, under the right conditions, the data structures invented for IP routing can be used for general LPM. Encoding strings over arbitrary alphabets as reals and searching in that representation goes back at least to arithmetic coding; see, e.g., Cover and Thomas [7]. Retries are in the general class of multi-level table lookup schemes used for both hardware [15, 18, 23] and software [9, 10, 25, 28, 29] implementations for IP routing. Since modern machines use memory hierarchies with sometimes dramatically different performance levels, some of these works attempt to build data structures conforming to the memory hierarchies at hand. Both the LC-trie scheme of Nilsson and Karlsson [25] and the multi-level table of Srinivasan and Varghese [29] attempt to optimize for L2 caches by adjusting the number of levels to minimize space usage. Effi cient implementations, however, exploit the binary alphabet of IP addresses and prefi xes. Cheung and McCanne [6] took a more general approach to dealing with memory hierarchies that includes the use of prefi x popularity. They consider a multi-level table scheme similar to retries and attempt to minimize the space usage
of popular tables so that they fit into the fast caches. Since the cache sizes are limited, they must solve a complex constrained optimization problem to fi nd the right data structure. LI caches on most machines are very small, however, so much of the gain comes from fi tting a data structure into L2 caches. In addition, the popularity of prefixes is a dynamic property and not easy to approximate statically. Thus, we focus on bounding the number of memory accesses and minimizing memory usage. We do this by (1) separating internal tables from leaf tables so that the latter can use small integer types to store data values; (2) using dynamic programming to optimize the lay-out of internal and leaf tables given some bound on the number of levels, which also bounds the number of memory accesses during queries; and (3) using a superstring approach to minimize space usage of the leaf tables. The results in Section 4 show that we achieve both good look-up performance and small memory usage. Crescenzi, Dardini, and Grossi [8] introduced a compressed-table data structure for IP look-up. The key idea is to identify runs induced by common next-hops among the 232 implicit prefi xes to compress the entire table with this information. This works well when the number of distinct next-hops is small and there are few runs, which is mostly the case in IP routing. The compressed-table data structure is fast, because it bounds the number of memory accesses per match. Unfortunately, in network clustering applications, both the number of distinct next-hop values and the number of runs can be quite large. Thus, this technique cannot be used in such applications. Section 4 shows that retries often outperform the compressed-table data structure for IP routing and use much less space. 6 Conclusions We considered the problem of performing LPM on short strings with limited alphabets. We showed how to map such strings into the integers so that small strings would map to values representable in machine words. This enabled the use of standard integer arithmetic for prefi x matching. We then presented retries, a novel, multi-level table data structure for LPM. Experimental results were presented showing that
136
retries outperform other comparable data structures. A number of open problems remain. A dynamic LPM data structure that performs queries empirically fast remains elusive. Build times for static structures are acceptable for present applications, but the continual growth of routing tables will likely necessitate dynamic solutions in the future. As with general string matching solutions, theoretically appealing approaches, e.g., based on interval trees [26], do not exploit some of the peculiar characteristics of these applications. Feldmann and Muthukrishnan [12] report partial progress. We have a prototype based on nested-interval maintenance but have not yet assessed its performance. Our results demonstrate that LPM data structures perform much better on real trace data than on randomly generated data. Investigating the cache behavior of LPM data structures on real data thus seems important. Compiling benchmark suites of real data is problematic given the proprietary nature of such data, so work on modeling IP address traces for experimental purposes is also worthwhile. Acknowledgments We thank John Linderman for describing the use of prefi xes in telephone service marketing. We also thank the anonymous reviewers, who made several helpful comments. References [1] K. C. R. C. Arnold. Screen Updating and Cursor Movement Optimization: A Library Package. 4.2BSD UNIX Programmer's Manual, 1977. [2] A. Blum, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings. J. ACM, 41(4):63047,1994. [3] D. Breslauer. Fast parallel string prefix-matching. Theor. Comp. Sci., 137(2):268-78, 1995. [4] D. Breslauer, L. Colussi, and L. Toniolo. On the comparison complexity of the string prefix-matching problem. J. Alg., 29(1): 18-67, 1998. [5] Y.-F. Chen, F. Doughs, H. Huang, and K.-P. Vo. TopBlend: An efficient implementation of HtmlDiff in Java. In Proc. WebNet'OO, 2000. [6] G. Cheung and S. McCanne. Optimal routing table design for IP address lookup under memory constraints. In Proc. 18th IEEEINFOCOM, volume 3, pages 1437-44, 1999. [7] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991. [8] P. Crescenzi, L. Dardini, and R. Grossi. IP address lookup made fast and simple. In Proc. 7th ESA, volume 1643 of LNCS, pages 65-76. Springer-Verlag, 1999. [9] M. Degermark, A. Brodnik, S. Carlsson, and S. Pink. Small forwarding tables for fast routing lookups. In Proc. ACM SIGCOMM '97, pages 3-14,1997. [10] W. Doeringer, G. Karjoth, and M. Nassehi. Routing on longest-matching prefixes. IEEE/ACM Trans. Netwk., 4(l):86-97,1996. Err., 5(1):600, 1997.
[11] F. Ergun, S. Mittra, S. C. Sahinalp, J. Sharp, and R. K. Sinha. A dynamic lookup scheme for bursty access patterns. In Proc. 20th IEEEINFOCOM, volume 3, pages 1444-53, 2001. [12] A. Feldmann and S. Muthukrishnan. Tradeoffs for packet classification. In Proc. 19th IEEE INFOCOM, volume 3, pages 1193-202, 2000. [13] P. Ferragina and R. Grossi. A fully-dynamic data structure for external substring search. In Proc. 27th ACM STOC, pages 693-702, 1995. [14] V. Fuller, T. Li, J. Yu, and K. Varadhan. Classless Inter-Domain Routing (CIDR): An Address Assignment and Aggregation Strategy. Internet Engineering Task Force (www.ietf.org), 1993. RFC 1519. [15] P. Gupta, S. Lin, and M. McKeown. Routing lookups in hardware and memory access speeds. In Proc. 17th IEEE INFOCOM, volume 3, pages 1240-7,1998. [16] P. Gupta, B. Prabhakar, and S. Boyd. Near-optimal routing lookups with bounded worst case performance. In Proc. 19th IEEE INFOCOM, volume 3, pages 1184-92, 2000. [17] D. S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4):664-75, 1977. [18] N.-F. Huang, S.-M. Zhao, J.-Y. Pan, and C.-A. Su. A fast IP routing lookup scheme for gigabit switching routers. In Proc. 18th IEEEINFOCOM, volume 3, pages 1429-36, 1999. [19] G. Jacobson and K.-P. Vo. Heaviest increasing/common subsequence problems. In Proc. 3rd CPM, volume 644 of LNCS, pages 52-65. Springer-Verlag, 1992. [20] B. Krishnamurthy and J. Wang. On network-aware clustering of Web clients. In Proc. ACM SIGCOMM '00, pages 97-110, 2000. [21] B. Lampson, V. Srinivasan, and G. Varghese. IP lookups using multiway and multicolumn search. IEEE/ACM Trans. Netwk., 7(3):324-34, 1999. [22] E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-72, 1976. [23] A. Moestedt and P. Sjodin. IP address lookup in hardware for high-speed routing. In Proc. Hot Interconnects VI, Stanford Univ., 1998. [24] S. Nilsson. Personal communication. 2001. [25] S. Nilsson and G. Karlsson. IP-address lookup using LCtries. IEEEJ. Sel. Area. Comm., 17(6): 1083-92,1999. [26] F. P. Preparata and M. I. Shamos. Computational Geometry: An Introduction. Springer-Verlag, 1988. [27] D. Sankoff and J. B. Kruskal. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparisons. Addison Wesley, Reading, MA, 1983. [28] K. Sklower. A tree-based routing table for Berkeley UNIX. In Proc. USENIX Winter 1991 Tech. Conf, pages 93-104, 1991. [29] V. Srinivasan and G. Varghese. Fast address lookup using controlled prefix expansion. ACM Trans. Comp. Sys., 17(1): l^M), 1999. [30] Telstra Internet, http://www.telstra.net/ops/bgptab.txt.
137
Figure 6: Data structure sizes for tables used in IP routing experiment.
Figure 7: Query times for routing using real traffi c data.
Figure 8: Query times for routing using sorted random traffi c data.
138
Figure 9: Query times for routing using random traffi c data.
Figure 10: Data structure sizes for tables used in clustering experiment.
Figure 11: Query times for clustering.
139
This page intentionally left blank
AUTHOR INDEX Anderegg, L, 106 Applegate, D. L, 1 Arge, L,xi, 82 Atalay, F. B., 56 Buchsbaum, A. L, 128 Buriol, L S., 1 Danner, A., 82 Demaine, E. D., xiii Devillers, (X 37 Dillard, B. L., 1 Eidenbenz, S., 106 Fowler, G. S.. 128 Gantenbein, M., 106 Gkantsidis, C., 16 Hershberger, J., 26 Johnson, D. S., 1 Krishnamurthy, B., 128 Kumar, P., 45 Maxel, M., 26 Mihail,M., 16 Mitchell, J. S. B.,45 Mount, D. M.,56 Navarro, G.,69 Paredes, R.,69 Ron, S., 37 Resende, M. G. C., 119 Shor, P. W., 1
SinhaR.,93 StamraC., 106 Suri.S.,26 Taylor, D. S., 106 Teh, S.-M., 82 Vo,K.-R, 128 Wang,J., 128 Weber, B., 106 Werneck,R. F., 119 Widmayer, P., 106 Yildmm,E. A.,45 Zegura, E., 16 Zobel, J., 93
141