Performance Evaluation and Planning Methods for the Next Generation Internet (Gerad 25th Anniversary)

PERFORMANCE EVALUATION AND PLANNING METHODS FOR THE NEXT GENERATION INTERNET GERAD 25th Anniversary Series Essays an...

Author: André Girard | Brunilde Sansò | Felisa Vázquez-Abad

7 downloads 526 Views 19MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

PERFORMANCE EVALUATION AND PLANNING METHODS FOR THE NEXT GENERATION INTERNET

GERAD 25th Anniversary Series

Essays and Surveys in Global Optimization Charles Audet, Pierre Hansen, and Gilles Savard, editors Graph Theory and Combinatorial Optimization David Avis, Alain Hertz, and Odile Marcotte, editors Numerical Methods in Finance Hatem Ben-Ameur and Michele Breton, editors Analysis, Control and Optimization of Complex Dynamic Systems El-Kebir Boukas and Roland Malhame, editors Column Generation Guy Desaulniers, Jacques Desrosiers, and Marius M. Solomon, editors Statistical Modeling and Analysis for Complex Data Problems Pierre Duchesne and Bruno Remillard, editors Performance Evaluation and Planning Methods for the Next Generation Internet Andre Girard, Brunilde Sanso, and Felisa Vazquez-Abad, editors Dynamic Games: Theory and Applications Alain Haurie and Georges Zaccour, editors Logistics Systems: Design and Optimization Andre Langevin and Diane Riopel, editors Energy and Environment Richard Loulou, Jean-Philippe Waaub, and Georges Zaccour, editors

PERFORMANCE EVALUATION AND PLANNING METHODS FOR THE NEXT GENERATION INTERNET

Edited by

ANDRE GIRARD GERAD and INRS-Energie, Materiaux et Telecommunications

BRUNUJDE SANSO GERAD and Ecole Polytechnique de Montreal

FELISA VAZQUEZ-AB AD GERAD and Universite de Montreal

4y Springer

Andre Girard GERAD & INRS-Telecommunications Montreal, Canada

Brunilde Sanso GERAD & Ecole Polytechnique de Montreal Montreal, Canada

Felisa Vazquez-Abad GERAD and Universite de Montreal Montreal, Canada

Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN-10: 0-387-25550-8 ISBN 0-387-25551-6 (e-book) Printed on acid-free paper. ISBN-13: 978-0387-25550-7 © 2005 by Springer Science+Business Media, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science + Business Media, Inc., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now know or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if the are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springeronline.com

SPIN 11053163

Foreword

GERAD celebrates this year its 25th anniversary. The Center was created in 1980 by a small group of professors and researchers of HEC Montreal, McGill University and of the Ecole Polytechnique de Montreal. GERAD's activities achieved sufficient scope to justify its conversion in June 1988 into a Joint Research Centre of HEC Montreal, the Ecole Polytechnique de Montreal and McGill University. In 1996, the Universite du Quebec a Montreal joined these three institutions. GERAD has fifty members (professors), more than twenty research associates and post doctoral students and more than two hundreds master and Ph.D. students. GERAD is a multi-university center and a vital forum for the development of operations research. Its mission is defined around the following four complementarily objectives: • The original and expert contribution to all research fields in GERAD's area of expertise; • The dissemination of research results in the best scientific outlets as well as in the society in general; • The training of graduate students and post doctoral researchers; • The contribution to the economic community by solving important problems and providing transferable tools. GERAD's research thrusts and fields of expertise are as follows: • Development of mathematical analysis tools and techniques to solve the complex problems that arise in management sciences and engineering; • Development of algorithms to resolve such problems efficiently; • Application of these techniques and tools to problems posed in related disciplines, such as statistics, financial engineering, game theory and artificial intelligence; • Application of advanced tools to optimization and planning of large technical and economic systems, such as energy systems, transportation/communication networks, and production systems; • Integration of scientific findings into software, expert systems and decision-support systems that can be used by industry. One of the marking events of the celebrations of the 25th anniversary of GERAD is the publication of ten volumes covering most of the Center's research areas of expertise. The list follows: Essays and Surveys in Global Optimization, edited by C. Audet, P. Hansen and G. Savard; Graph Theory and Combinatorial Optimization,

vi

NEXT GENERATION INTERNET

edited by D. Avis, A. Hertz and O. Marcotte; Numerical Methods in Finance, edited by H. Ben-Ameur and M. Breton; Analysis, Control and Optimization of Complex Dynamic Systems, edited by E.K. Boukas and R. Malhame; Column Generation, edited by G. Desaulniers, J. Desrosiers and M.M. Solomon; Statistical Modeling and Analysis for Complex Data Problems, edited by P. Duchesne and B. Remillard; Performance Evaluation and Planning Methods for the Next Generation Internet, edited by A. Girard, B. Sanso and F. Vazquez-Abad; Dynamic Games: Theory and Applications, edited by A. Haurie and G. Zaccour; Logistics Systems: Design and Optimization, edited by A. Langevin and D. Riopel; Energy and Environment, edited by R. Loulou, J.-P. Waaub and G. Zaccour. I would like to express my gratitude to the Editors of the ten volumes, to the authors who accepted with great enthusiasm to submit their work and to the reviewers for their benevolent work and timely response. I would also like to thank Mrs. Nicole Paradis, Francine Benoit and Louise Letendre and Mr. Andre Montpetit for their excellent editing work. The GERAD group has earned its reputation as a worldwide leader in its field. This is certainly due to the enthusiasm and motivation of GERAD's researchers and students, but also to the funding and the infrastructures available. I would like to seize the opportunity to thank the organizations that, from the beginning, believed in the potential and the value of GERAD and have supported it over the years. These are HEC Montreal, Ecole Polytechnique de Montreal, McGill University, Universite du Quebec a Montreal and, of course, the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Fonds quebecois de la recherche sur la nature et les technologies (FQRNT), Georges Zaccour Director of GERAD

Avant-propos

Le Groupe cTetudes et de recherche en analyse des decisions (GERAD) fete cette annee son vingt-cinquieme anniversaire. Fonde en 1980 par une poignee de professeurs et chercheurs de HEC Montreal engages dans des recherches en equipe avec des collegues de PUniversite McGill et de PEcole Polytechnique de Montreal, le Centre comporte maintenant une cinquantaine de membres, plus d'une vingtaine de professionnels de recherche et stagiaires post-doctoraux et plus de 200 etudiants des cycles superieurs. Les activites du GERAD ont pris suffisamment d'ampleur pour justifier en juin 1988 sa transformation en un Centre de recherche conjoint de HEC Montreal, de PEcole Polytechnique de Montreal et de PUniversite McGill. En 1996, PUniversite du Quebec a Montreal s'est jointe a ces institutions pour parrainer le GERAD. Le GERAD est un regroupement de chercheurs autour de la discipline de la recherche operationnelle. Sa mission s'articule autour des objectifs complementaires suivants : • la contribution originale et experte dans tous les axes de recherche de ses champs de competence; • la diffusion des resultats dans les plus grandes revues du domaine ainsi qu'aupres des differents publics qui forment Penvironnement du Centre; • la formation d'etudiants des cycles superieurs et de stagiaires postdoctoraux ; • la contribution a la communaute economique a travers la resolution de problemes et le developpement de coffres d'outils transferables. Les principaux axes de recherche du GERAD, en allant du plus theorique au plus applique, sont les suivants : • le developpement d'outils et de techniques d'analyse mathematiques de la recherche operationnelle pour la resolution de problemes complexes qui se posent dans les sciences de la gestion et du genie; • la confection d'algorithmes permettant la resolution efficace de ces problemes; • Papplication de ces outils a des problemes poses dans des disciplines connexes a la recherche operationnelle telles que la statistique, Pingenierie financiere. la theorie des jeux et Pintelligence artificielle; • Papplication de ces outils a Poptimisation et a la planification de grands systemes technico-economiques comme les systemes energetiques, les reseaux de telecommunication et de transport, la logistique et la distributique dans les industries manufacturieres et de service;

viii

NEXT GENERATION INTERNET

• Pintegration des resultats scientifiques dans des logiciels, des systemes experts et dans des systemes d'aide a la decision transferables a Pindustrie. Le fait marquant des celebrations du 25e du GERAD est la publication de dix volumes couvrant les champs d'expertise du Centre. La liste suit : Essays and Surveys in Global Optimization, edite par C. Audet, R Hansen et G. Savard; Graph Theory and Combinatorial Optimization, edite par D. Avis, A. Hertz et O. Marcotte; Numerical Methods in Finance, edite par H. Ben-Ameur et M. Breton; Analysis, Control and Optimization of Complex Dynamic Systems, edite par E.K. Boukas et R. Malhame; Column Generation, edite par G. Desaulniers, J. Desrosiers et M.M. Solomon; Statistical Modeling and Analysis for Complex Data Problems, edite par P. Duchesne et B. Remillard; Performance Evaluation and Planning Methods for the Next Generation Internet, edite par A. Girard, B. Sanso et F. Vazquez-Abad; Dynamic Games : Theory and Applications, edite par A. Haurie et G. Zaccour; Logistics Systems : Design and Optimization, edite par A. Langevin et D. Riopel; Energy and Environment, edite par R. Loulou, J.-P. Waaub et G. Zaccour. Je voudrais remercier tres sincerement les editeurs de ces volumes, les nombreux auteurs qui ont tres volontiers repondu a l'invitation des editeurs a soumettre leurs travaux, et les evaluateurs pour leur benevolat et ponctualite. Je voudrais aussi remercier Mmes Nicole Paradis, Francine Benoit et Louise Letendre ainsi que M. Andre Montpetit pour leur travail expert d'edition. La place de premier plan qu'occupe le GERAD sur l'echiquier mondial est certes due a la passion qui anime ses chercheurs et ses etudiants, mais aussi au financement et a Infrastructure disponibles. Je voudrais profiter de cette occasion pour remercier les organisations qui ont cru des le depart au potentiel et a la valeur du GERAD et nous ont soutenus durant ces annees. II s'agit de HEC Montreal, PEcole Polytechnique de Montreal, PUniversite McGill, l'Universite du Quebec a Montreal et, bien sur, le Conseil de recherche en sciences naturelles et en genie du Canada (CRSNG) et le Fonds quebecois de la recherche sur la nature et les technologies (FQRNT). Georges Zaccour Directeur du GERAD

Contents

Foreword Avant-propos Contributing Authors Preface 1 Design of IP Networks with End-to-End Performance Guarantees /. Atov and R.J. Harris

v vii xi xiii 1

2

Design of IP Virtual Private Networks under End-to-end QoS Constraints E.C.G. Wille, M. Mellia, E. Leonardi, and M.A. Marsan 3 Design of Protected Working Capacity Envelopes Based on p-Cycles: An Alternative Framework for Survivable Automated Lightpath Provisioning G. Shen and W.D. Grover

35

63

4

Network Traffic Engineering with Varied Levels of Protection in the Next Generation Internet S. Srivastava, S.R. Thirumalasetty, and D. Medhi

99

5 Balancing Traffic Flows in Resilient Packet Rings P. Kubat and J. MacGregor Smith

125

6 Game-Theoretic Resource Pricing for the Next Generation Internet B.M. Ninan and M. Devetsikiotis

141

7 A New Approach to Policy-Based Routing in the Internet B.R. Smith and J.J. Garcia-Luna-Aceves

165

Advanced Methods for the Estimation of the Origin Destination Traffic Matrix S. Vaton, J.S. Bedo and A. Gravey 9 Energy and Cost Optimizations in Wireless Sensor Networks: A Survey V. Mhatre and C. Rosenberg

189

227

x NEXT GENERATION INTERNET 10 Duality-Based TCP Congestion Control with Error Analysis 249 M. Mehyar, D. Spanos, and S.H. Low 11 Fast Algorithmic Solutions to Multi-dimensional Birth-Death Processes with Applications to Telecommunication Systems L. D. Servi 12 A New Paradigm for On-Line Management of Communication Networks with Multiplicative Feedback Control H. Yu and C.G. Cassandras 13 Comparing Locality of Reference - Some Folk Theorems for the Miss Rate and the Output of Caches A.M. Makowski and S. Vanichpun

269

297

333

Contributing Authors IRENA ATOV

JAMES MACGREGOR SMITH

Swinburne University of Technology, Australia

University of Massachusetts, USA jmsmithOecs.umass.edu

[email protected] ARMAND M. MAKOWSKI JEAN-SEBASTIEN BEDO

Ecole Poly technique, Palaiseau, France j [email protected] CHRISTOS G. CASSANDRAS

Boston University, USA cgc@bu . edu

University of Maryland armand®isr.umd.edu MARCO AJMONE MARSAN

Politecnico di Torino, Italy [email protected] DEEP MEDHI

North Carolina State University, USA

University of Missouri-Kansas City, USA

[email protected]

[email protected]

MICHAEL DEVETSIKIOTIS

JOSE JOAQUIN GARCIA-LUNA-ACEVES

MORTADA MEHYAR

University of California, Santa Cruz, USA [email protected]

California Institute of Technology, USA

ANNIE GRAVEY

ENST Bretagne, France [email protected]

[email protected] MARCO MELLIA

Politecnico di Torino, Italy [email protected]

WAYNE D. GROVER

VIVEK PRAKASH MHATRE

TRLabs and University of Alberta, Canada

Purdue University, USA [email protected]

[email protected] BOBBY M. NINAN RICHARD J. HARRIS

North Carolina State University, USA

Massey University, New Zealand [email protected],nz

[email protected]

PETER KUBAT

University of Waterloo, Canada [email protected]

Verizon Laboratories, USA peter.kubatOverizon.com EMILIO LEONARDI

Politecnico di Torino, Italy leonardiOpolito.it

CATHERINE ROSENBERG

L. D. SERVI

MIT Lincoln Laboratory, USA [email protected]

STEVEN H. LOW

GANGXIANG SHEN

California Institute of Technology, USA

TRLabs and University of Alberta, Canada gshenQtrlabs.ca

[email protected]

Xll

NEXT GENERATION INTERNET

BRADLEY R. SMITH

SARUT VANICHPUN

University of California, Santa Cruz, USA

University of Maryland [email protected]

bradOsoe.ucsc.edu SANDRINE VATON DEMETRI SPANOS

California Institute of Technology, USA demetriQcaltech.edu SHEKHAR SRIVASTAVA

University of Missouri-Kansas City, USA [email protected]

ENST Bretagne, France sandrine.vatonQenst-bretagne.fr EMILIO C.G.

WILLE

Centro Federal de Educagao Tecnologica do Parana, Brazil willeQprezzemolo.polito.it HAINING YU

Boston University, USA fernyuQbu.edu

SRINIVASA RAO THIRUMALASETTY

Ciena Corporation, USA [email protected]

Preface

Optimization techniques have been used for a long time in the planning of telecommunication networks. There is abundant work on the design of telephone and transmission networks going at least a half century back. The recent evolution towards an integrated, multi-service network based on the Internet and the IP protocol has occurred in a very different way and without much recourse to the traditional traffic engineering methods. Because the current core network is so much overprovisioned and all the services are operating on a best-effort service model, there has been no real need for sophisticated planning modeling and analysis. This situation is bound to change for a number of reasons. New services such as voice are being introduced with strong requirements for definite Quality of Service, the cost of over-provisioning private networks is becoming an issue and wireless access cannot be over-provisioned at all since the bandwidth is limited by the available spectrum. For all these reasons, we believe that all the classical problems of telecommunication network design will become more essential in the coming years. One such problem is network design which in turn is based on performance evaluation which is an essential element of all design algorithms. This trend is well illustrated by the contents of this book. We find a large number of areas where significant work is being done to address the issue of network design in the context of the new IP-based networks. The topics selected here will give the reader some idea of what is going on but is far from exhaustive for obvious space and time limitations. The design of IP networks will have to take into account the requirements of many applications for guaranteed performance. This problem is examined in "Design of IP Networks with End-to-end Performance Guarantees" by Atov and Harris. Their model takes into account the proposed QoS standards and can handle multiple QoS requirements. The numerical method based on multicommodity flows can compute networks of realistic size. A similar problem of network design is the subject of "Design of IP Virtual Private Networks under End-To-End QOS Constraints" by Wille, Mellia, Leonardi and Ajmone Marsan. This time, the network to be designed is a Virtual Private Network but the model takes into account both the transport and the network layer performances. The design itself is then done at the IP layer while offering all the required QoS guarantees at the transport layer.

xiv

NEXT GENERATION INTERNET

The concept of resilience or reliability is gaining importance as another aspect of the Quality of Service that will have to be provided by the Internet. The notion of survivability by using protection cycles is examined in "Design of Protected Working Capacity Envelopes Based on P-Cycles: an Alternative Framework for Survivable Automated Lightpath Provisioning" by Shen and Grover when applied to the underlying optical transport infrastructure. The model integrates both the IP and the transport layers to provide a unified design methodology for the capacity allocation, service provisioning and reserve network. A similar problem of network design where the notion of QoS is extended to include reliability is examined in "Network Traffic Engineering with Varied Levels of Protection in the Next Generation Internet" by Srivastava, Thirumalasetty and Medhi. In addition to traffic engineering, reliability is taken care of via a number of protection levels. The model is based on protection cycles and is solved via heuristics. The notion of resilience in transmission networks, an essential complement to IP-level techniques, is studied in"Balancing Traffic Flows in Resilient Packet Rings" by Kubat and Smith. Here fast restoration techniques based on ring topologies are used in conjunction with Ethernet technology to provide a robust network. The model optimizes the allocation of traffic to the ring directions both for deterministic and stochastic demands either via an IP formulation or using heuristics for large cases. Pricing and routing are two issues that can be closely tied to the management of QoS. In "Game-Theoretic Resource Pricing for the Next Generation Internet", Ninan and Devetsikiotis use pricing and billing to manage the bandwidth allocated to users competing for a share of the network resources. They show applications of their model to a variety of networks. In addition to a thorough review of existing QoS-based routing, the chapter " A New Approach to Policy-Based Routing in the Internet" by Smith and Garcia-Luna-Aceves describes a new routing algorithm based on distributed label-swapping that can support QoS more efficiently than the present techniques. Network design is based on a good estimation of the traffic matrix. This problem is examined in the context of IP networks in the paper "Advanced Methods for the Estimation of the Origin Destination Traffic Matrix" by Vaton, Bedo and Gravey. Current tools such as SNMP allow traffic measurement only on individual links. Statistical methods are used to take into account the time variation of traffic measured on the network links to construct an estimate of the end-to-end demands.

PREFACE

xv

Wireless networks are growing even more rapidly than the Internet and bring up a new set of design problems. The recent development of sensor networks is a case in point where the management of the energy budget of the batteries in the devices is a dominant issue. A thorough survey of the routing and design problems in this new context is the subject of "Energy and Cost Optimizations in Wireless Sensor Networks: A Survey" by Mhatre and Rosenberg. If they are to provide the QoS required by applications, network operators must be able to control congestion whenever it occurs in the network. An approach to congestion control based on the maximization of utility is used in "Duality-Based TCP Congestion Control with Error Analysis" by Mehyar, Spanos and Low. The model provides a unifying framework for a large number of congestion control algorithms. The authors also show that even in the presence on imperfect information, the control will converge to a region close to the optimal operating point if perfect information were available. Network design is based on decomposition techniques where the performance of output queues has to be computed separately before being recombined into a network performance measure. Thus queuing analysis is a fundamental component of network planning. With multi-service networks, this analysis requires the solution of large birth-death systems. An example of an efficient solution algorithm is given in "Fast Algorithmic Solutions to Multi-Dimensional Birth-Death Processes with Applications to Telecommunication Systems" by Servi where a new class of solution algorithms is presented that can be applied to systems of arbitrary dimensions. In "A New Paradigm for On-Line Management of Communication Networks with Multiplicative Feedback Control" by Yu and Cassandras, Stochastic Flow models are used for control. The model obviates the need to solve difficult queuing analysis problems by using fluid models. It provides an estimation of the gradients of the performance which in turn can be used to adjust the flow rates via a multiplicative feedback algorithm. The Internet has spawned a number of new applications with previously unknown operating characteristics. Caching techniques are often used to reduce the delay in accessing databases. Many such techniques have been designed based on empirical knowledge. The validity of this knowledge is formalized and evaluated in "Comparing Locality of Reference — Some Folk Theorems for the Miss Rate and the Output of Caches" by Makowski and Vanichpun where the authors model the op-

xvi

NEXT GENERATION INTERNET

eration of caches and examine the efficiency of various cache replacement policies. The wide range of topics covered in this book is a witness to the diversity of problems that have to be faced if we want to design the Next Generation IP Networks to meet their expected performance. This is a rich field where optimization techniques can provide significant gains. ANDRE GIRARD BRUNILDE SANSO FELISA VAZQUEZ-ABAD

Chapter 1 DESIGN OF IP NETWORKS WITH END-TO-END PERFORMANCE GUARANTEES Irena Atov Richard J. Harris Abstract

1.

In this paper, we examine the issues that surround IP network design with quality of service (QoS) guarantees and propose a new network design methodology. The proposed network design model takes account of the new QoS technologies (i.e., DifFServ/MPLS) and allows for multiple delay constraints so that guaranteed performance can be achieved for each of the traffic classes. After discussing the most crucial planning issues that must be addressed when QoS mechanisms are used in an IP-based network, a non-linear multicommodity optimisation problem is formulated and heuristics for its approximate solutions are described. The network design model is evaluated in terms of accuracy and scalability for each of the main components that the model employs. The computational results for each of the building blocks demonstrate that realistic size problems can be solved with the proposed method.

Introduction

The development of various Internet technologies, such as DiffServ and MPLS, has enabled support for various traffic classes with different QoS requirements on an integrated IP network (Wang, 2001). Now with the inclusion of QoS considerations, the paradigm for network design and planning must change to include multiple delay constraints so that differentiated performance can be achieved for the various traffic classes. The delay QoS metric is additive (i.e., QoS along a path is a sum of the QoS of its constituent links) and such a QoS class may cover a delay bound or random variation of delay (jitter). It is traditionally difficult to provide network design to meet varying QoS constraints. Assuming that the network topology is given, the ob-

2

NEXT GENERATION INTERNET

jective of network design is to determine link capacities combined with traffic routing, such that the total network cost is minimised while meeting demands and QoS constraints for each of the traffic classes. With the introduction of Asynchronous Transfer Mode (ATM) broadband networks in the late 80s a lot of research was triggered in this area. However, the network design models developed for ATM use a loss-based approach. That is, network planning-design is achieved so that the blocking (loss) probabilities of various types of traffic (classes) remain below a specified threshold, see for example Liang and Ross (1999); Puah (1999). These models exploit the connection-oriented nature of ATM and transform the multi-level traffic problem into a multirate circuit-switched problem by using the notion of equivalent bandwidth (Guerin et al., 1991). However, these multirate loss models do not present suitable tools for the design of IP-based broadband networks. In IP-based broadband networks (i.e., DiffServ/MPLS), due to the concept of class aggregates and static resource reservation, network design is primarily concerned with developing performance guarantees in terms of packet delay or packet delay variation for the various service types (i.e., class aggregates). In addition, the conventional methods for capacity planning of IP networks are limited, in that they only consider best effort service, or else a single delay constraint for all traffic (Gavish and Neuman, 1989). Thus, there is a need for new methods for capacity planning and design that take account of technologies and mechanisms that enable QoS in IP networks and, thus, allow for multiple delay constraints, so that guaranteed performance can be achieved for each of the traffic classes. This paper investigates the issues surrounding IP network design with QoS and proposes a new design methodology which would be applicable to such networks. Since the proposed design model deals in a unified way with both the flow and the capacity assignment issues, we shall refer to it in the following as a capacity and flow assignment problem (CFA problem) as it is of the same generic form as early ARPANET models identified by Kleinrock (1975). There are major challenges involved in the development of an IP network design methodology that supports guaranteed services on an endto-end basis. The technologies that provide QoS introduce new constraints and require that certain features be addressed by any generic design methodology. In Section 2, we first discuss the most crucial planning issues that must be addressed when QoS mechanisms are used in an IP-based network. Then we outline a network model and a cost model, which define the set of underlying assumptions used in the development of the proposed IP network design model with end-to-end performance guarantees. Section 3 provides the notation and formu-

1 Design of IP Networks with End-to-End Performance Guarantees

3

lation of the mathematics programming problem. The CFA problem is formulated as a non-linear multicommodity optimisation problem, which is hard to solve mathematically to optimum for practical sized networks. To be able to efficiently solve large problem instances, we develop a framework for the solution of the network design problem which employs a heuristic approach. Section 4 describes the disaggregation of the problem into simpler optimisation problems (components) that form the basis for a heuristic solution to the original problem. Finally, Sections 5 and 6 summarize computational results and provide concluding remarks.

2.

QoS mechanisms and implications for network planning

QoS mechanisms that are being employed in IP-based core networks can be generally classified into three main categories: traffic control, resource management and traffic engineering. In the following, we briefly overview these categories and discuss the implications that have to be considered for the planning process. Traffic control. Traffic control encompasses all mechanisms for handling and forwarding of packets within the edge and the core routers of the network. These mechanisms include: traffic classification and aggregation, scheduling and active queue management. Typically, based on the type of scheduling mechanism deployed at the routers, which can range from a simple priority queueing to a bandwidth allocation scheme (WFQ, WF2Q, WRR Floyd and Jacobson, 1995), the network provider can offer two types of service quality. The first type is prioritised service, where certain class of traffic receives priority over others as it is processed and routed over the network. The second type is guaranteed service, where the traffic classes are guaranteed certain share of resources, for example bandwidth or a given performance level e.g., delay. It is the latter meaning that we consider in this paper, as it represents more common implementation of DiffServ in the current service offerings. The use of bandwidth allocation type QoS queueing mechanism implies that fixed bandwidth partitioning, i.e., predictable capacity assignment to individual classes can be assumed for the network design process. Since the DiffServ technology is based on the aggregation of individual flows into classes at the ingress of the network and on provisioning of QoS to the service class instead of a single flow, it is important to model and characterise the external and the internal (or internode) traffic flows on a per class basis in order to plan and manage these networks to meet performance objectives (QoS) as required by the various traffic classes.

4

NEXT GENERATION INTERNET

Resource management. Resource management refers to mechanisms that manage the access to the network resources in order to prevent service degradation occurring from traffic overload. DiffServ employs static resource control approach, where the network operator establishes several traffic classes, for which it allocates adequate level of resources (in terms of bandwidth and buffer space) along the respective data paths through the network. In addition, traffic conditioning at the network edge is implemented in order to control the amount of traffic entering the network. Thus, resource management functions require that the network design process determines the amount of bandwidth, which needs to be allocated to each traffic class on every link in the network. Traffic conditioning functions have to be accounted for in the traffic modelling of the c/ass-based traffic aggregates offered to the network. Traffic engineering. Traffic engineering is the commonly used term for the process of network performance optimisation during network operation. Most commonly, this approach involves routing optimisation. It can be based on standard routing protocols, such as OSPF, or on more advanced protocol, such as MPLS. For network design the type of routing protocol plays an important role. In the first case, routing optimisation can be achieved by allowing link metrics to be customized (Fortz and Thorup, 2000). The latter case, on the other hand, allows for explicit routes, providing maximum flexibility in building a specific path through a network based on differentiated QoS. We consider MPLS routing mechanism deployed in an IP-based core network and, thus, in the design process we are concerned with the problem of determining static paths based on differentiated QoS requirements for the classes.

2,1

Network model

We consider a network design methodology for IP core networks based on DiffServ/MPLS technologies, and therefore, our main modelling assumptions comprise (1) a guaranteed service model for the classes (based on a prescribed performance level) and, (2) a fixed routing mechanism for the traffic classes in the network. Fixed (or static) routing policies are implemented by providing each origin-destination (OD) pair in the network with an ordered set of routes and we concentrate here on the choice of the primary route i.e., the recommended one in the candidate set. The network underlying the design problem is modelled as a graph G(V,JS), where |V| = n is the number of nodes and \E\ — m is the number of links (edges) in the network. Each link consists of a separate queueing facility for each class of traffic and a scheduler. In the

1 Design of IP Networks with End-to-End Performance Guarantees

5

standard service offerings, class-based WFQ has been used to enable guaranteed services. It enables the different delay criteria for each class to be met, by allocating a specified proportion of the service capacity to each class. For analysis of this system, each queueing facility can be treated as an independent FIFO queue, with a fixed capacity equal to its allocation, which enables us to confine the design problem to a network of single-server queues for each class of traffic, respectively. Note that, by modelling the network in this way, we effectively disregard the gains from the statistical multiplexing associated with the work-conserving scheduling mechanisms. Since this is a model for a network planning tool, this approximation is acceptable, as it would imply that the performance of the designed network would be as good or better than the QoS criteria required for the traffic classes. The purpose of modelling the network in the above fashion is to enable traffic to be categorized into classes based on their sensitivity to the delay performance of the network. In our model, we consider the delay QoS constraints for a traffic class to be specified in terms of the mean delay and the variance of the delay (or jitter) following ITU-T G.1010 Recommendation. Depending on the type of delay QoS constraint, the following sets of delay-sensitive classes of traffic are defined: (1) Delay sensitive class (DSC) - contains classes of traffic sensitive to delay (i.e., their mean delays are required to be less than or equal to their specified end-to-end delay limits), (2) Jitter sensitive class (JSC) - contains classes of traffic sensitive to variations in the delay (i.e., their delay variances are required to be less than or equal to their specified end-to-end delay variance limits), and (3) Delay and jitter sensitive class (DJSC) - contains classes of traffic that are sensitive to both delay and its variation.

2.2

Performance dependent cost model

We consider two distinct types of costs associated with the design of multiservice IP network that provides guaranteed services end-to-end: (1) queueing costs and (2) capacity costs. The first type of cost is associated with the cost of provisioning a certain delay and jitter queueing performance for the delay-sensitive traffic classes in the network. The second type of cost is associated with the setup cost of a link with specific capacity. For multiservice IP networks, the dominant factor in the cost of provisioning a particular delay-sensitive service class is dependent on the performance requirement (QoS) of that class. For example, the more stringent the delay requirement for a traffic class, the higher the cost of provisioning this service class. In order to model the queueing costs associated with the design of such a network, we assume a QoS

6

NEXT GENERATION INTERNET

framework where each link I G E can offer several delay d\ and jitter QoS guarantees i//, each associated with a different cost Q\{d\) {Qiiyi)). Specifically, in this framework, a performance dependent cost function is associated with each link in the network. Furthermore, these performance dependent cost functions i.e., the cost/delay and cost/jitter functions {Qi(d),Qi(v)}izE, are assumed to be non-increasing and they are of a general integer type. Such cost functions are being assumed as they better fit practical purposes, as discussed in Raz and Shavitt (2000) and references therein. Furthermore, considering that, in practice, the capacities of the links in the network are restricted to a discrete set of values, we assume discrete capacity costs associated with the links. Let I\ denote the index set of line types available for link /, I G E. The capacity and the cost [$/month] of line typefc,k G //, are denoted by 9f and 7^, respectively.

3.

Problem formulation

In order to be able to offer performance guarantees (in terms of delay) we build our model around the consideration that switches support classbased WFQ queueing and routing constraints are imposed by MPLS routing mechanism. We incorporate these features into an efficient network design model which includes primary decisions for determining class-based bandwidth allocations on the links, total link capacities and how traffic of each class is to be routed through the network. Note that, determining bandwidth allocations for the various traffic classes on the links is an important design issue as standard WFQ service disciplines can only provide tight end-to-end delay guarantees for the classes if an adequate level of bandwidth is allocated along their respective data paths through the network. Our modelling approach exploits a novel framework in which capacity provisioning is based on the partitioning of the end-to-end QoS constraints of each class-based traffic demand into local QoS constraints at each link in the network. In this context, a link delay (jitter) partition for a given delay-sensitive class represents a maximum allowable queueing and propagation delay on a link in terms of mean delay (variance of the delay), that can be tolerated by the class, so that the end-to-end delay (jitter) constraints of all traffic flows of the same class traversing that link are still satisfied1. We shall use a path-b&sed formulation for the underlying multicommodity networkflowmodel. Our experience shows the path-based model lr rhe delay is additive QoS metric and allows the end-to-end delay to be partitioned across the various links in the network.

1

Design of IP Networks with End-to-End Performance Guarantees

7

is both simpler to solve for realistic networks and explicitly allows service-based routing constraints. Let II be the set of all OD pairs (or commodities) with a demand between them. Each OD pair (u, v) G II, may have up to C different classes of traffic where C — C — 1 are delay-sensitive classes and the remaining one is the best-effort class. The set of candidate routes for OD pair {u, v) of class c, is denoted by K%v and the j-th route from this set is denoted by it^J, j = 1 , . . . , |i?|, with \R\ representing the cardinality of this set. The end-to-end delay (jitter) constraint for class c of OD pair (u,v) is denoted by D%v (V™v). A c = {^v}(u,v)eii represents the traffic demand matrix for class c, c G C. An entry in this matrix, AJfv, represents a vector itself, that defines the traffic demand for OD pair (Uj v) in terms of a specific set of parameters corresponding to the selected traffic descriptors for the dimensioning process, which is discussed in Section 4.2. The sets of decision variables are defined as follows: dc\ and vc\ are integer decision variables representing a delay partition and jitter partition for traffic of class c on link /, respectively; x^ is a binary decision variable that has value 1 if j - t h route is chosen from the given set of routes for OD pair (u, v) of class c (R^v) and 0 otherwise; yf is a binary decision variable, which has value 1 if line type k is assigned to link / and 0 otherwise. We introduce a weighted type variable 5C to provide user flexibility in specifying the type of the considered delay-sensitive class with respect to delay and jitter. It is defined as 0 < Sc < 1, having value 1 if the considered class is of type DSC, value 0 if the considered class is of type JSC, and having a value in between 0 and 1 for the D JSC type class. The symbol aT- is an indicator parameter with value 1 if link I lies on j-th route from the set K£v (R™) and is zero otherwise. rc = {rcV}(Uiv)eu denotes the routing matrix for class c, c G C. The primary route for OD pair (n, t>), r^v) is the set of links that comprise the route from the set R£v for which x™ = 1. The corresponding vector of the delay (jitter) partitions for class c on all links is denoted by dc (^c)« The propagation delay for link I is denoted by d/prOp. fc — {fcljleE represents a vector of link flows for class c, c G C. The vector fc is determined from the traffic demand and routing matrices (A c , rc) by applying an appropriate trafficbased decomposition model. In network design procedures, a maximum link utilization p is often specified at input and must also be satisfied. These constraints allow for variations in the load not accounted for at the network planning level to occur without affecting the performance of the network.

8

NEXT GENERATION INTERNET

The actual capacity required for a link 1,1 € E, is denoted by [//. Due to the fixed bandwidth partitioning assumption, the required capacity on a link is simply obtained as the sum of the individual bandwidth allocations for the classes on the given link, which we denote by bci. The bandwidth required for a delay-sensitive class on a link can be determined from the traffic descriptors of the total class flow on the link and its link QoS constraints by applying an appropriate link dimensioning model i.e., bc\ — ^F(fchdci^vci). The best-effort service class does not require any service guarantees, and therefore its required bandwidth on a link is readily determined from the mean rate of the best-effort link flow i.e., bci = J-'(fci)' Finally, 6 — {6f}ieE represents a vector of link capacities for which y\ — 1. Accordingly, the considered CFA problem can be stated as follows: Problem CFA: Given a network G(V,E), link cost/delay and cost/ a set of available link capacities jitter functions {Qi(d))Qi(v)}ieE, set of candidate {OfykeiiJeE, their corresponding costs {^}keihieE, routes {RcV}ceC<(u,v)en> end-to-end delay QoS requirements {D™, ^^^ced (uv)eu and traffic demands {A c } c € c- Find a set of primary routes {rc}cec, & set of delay and jitter allocations on the links {dc}ceQ, {yc$c(zG and a set of link capacities 6, such that the network cost is minimised while all OD pair traffic demands and QoS constraints for the traffic classes are satisfied. Mathematically, the optimisation problem can be formulated as finding the x^J, y^, dc/, uci values that satisfy the following:

Minimise: E E ™ ™j E E (u.v)eu leE x

a

[ScQi{dci) + (1 -

Sc)Qi{ud)\

v

Subject to: E

d

^ < X™D™ + (i - x$)M

Vj e i C , v(«, v) e n, Vc € 6 (1.1a)

E

v

d < xcjDT + (1 - x$)M Vj € E£v, V(u, v) e n, Vc € C (Lib)

E

= 1 V(«, v) € n, Vc € C

(l.lc)

1 Design of IP Networks with End-to-End Performance Guarantees * = 1 VZ eE

9 (Lid)

f + (l--yf)^} VZ G £

(Lie)

cec x™ = {0,1} \/j G i2™, V(tz, v) G n , Vc G C

(l.lf)

yf = {0,1} Vfceii, V Z G £

(l.lg)

dd > diprop

(l.lh)

i/d > 0 yieE,

VZ G £ , c e C ceC

(l.ii)

In the above formulation, the network cost represents the sum of all link costs, where the total cost of a link consists of two components: (1) queueing cost and (2) cost resulting from the total capacity required for the link. The queueing cost of a link is a sum of the costs incurred by the delay-sensitive classes only, as the best-effort class does not require any performance guarantees. The cost incurred by a specific delay-sensitive class on a link consists of a weighted sum of the delay and jitter costs (depending on the type of class considered), and is a function of the delay and jitter partitions on that link and the number of flows of the considered class that traverse the link. Accordingly, the first two terms in the objective function capture the total queueing cost in the network, while the third term refers to the total cost associated with the capacities of the links in the network, respectively. The set of constraints specified in (1.1a) and (Lib) guarantee that the end-to-end delay and jitter requirements for the delay-sensitive traffic classes are satisfied. The constant M is chosen such that if the j-th route from the set E%v is not selected (i.e., x™ = 0) then the sum of the delays (jitter) can be unbounded. This can be achieved by setting M to a sufficiently large value, such as, the sum of all end-to-end delay requirements, i.e., M = Yl(uv)eu-^cV' ^he constraints in (Lie) and (Lid) guarantee that only one route is chosen for each OD pair and only one line type for each link, respectively. The constraints in (Lie) guarantee the feasibility of the flow on each link in terms of the capacity assigned to it. Finally, the constraints given in (l.lf-l.li) define the solution space for the decision variables. The formulated mathematical program is a non-linear multicommodity flow problem in which the level of complexity arises from the nature of the decision variables that describe the possible solution (i.e., x^J, y\ are 0 — 1 type or binary integer variables and dci, vci are variables of integer type), as well as from the non-linearity of the associated set of constraints. The problem is defined with \E\\I\ binary line type selectors

10

NEXT GENERATION INTERNET

(y?)> IC'II-^I integer delay partition variables, as well as, jitter partition variables (dc/, vc\), and |C||II||i?| binary flow variables (x^). Since each flow associated with a delay-sensitive class defines a forcing constraint for the end-to-end QoS requirement, the number of constraints (1.1a) and (1.1b) is equal to |C||Il||i?|. The set of constraints (1.1c) is defined for each service class and OD pair, thus resulting in |C||II| constraints. The total set of constraints (l.ld) and (Lie) is determined from the number of links in the network \E\. In addition, there is a total of 2|C||JS| + |C||II||i?| + |£?||/| defining constraints (l.lf-l.li) in the model. It can be seen that the considered problem involves the interaction of three significant factors: the flow assignment (i.e., routing), the maximum delay and jitter allocation to each link, and the capacity allocation on the links. Determining all these factors i.e., all the decision variables (x^J, y^, dc/, vc\) simultaneously, for a network of any reasonable size represents a very challenging task. In fact, this problem is A/T^-complete, as it can be reduced to other problems which represent special cases of our problem that are known to be A/'P-complete. Namely, the problem studied in Holmberg (2000), represents a multicommodity minimal cost network flow problem with fixed charges on the links and is well known to be A/"P-complete. In this problem, there are two kinds of costs associated with the links in the network. The routing costs increase linearly with the amount offlowand aflowcost per unit of commodity on a link is predefined. Additionally, a fixed charge is incurred whenever a link is being used (by any amount of flow). Furthermore, each link has a limited capacity on the total flow. Two sets of variables are introduced in this model, continuous flow variables which reflect routing decisions for each commodity and binary design variables which define the set of links that are used in the network. The multicommodity capacitated versions of this general model, such as the model considered in Gendron et al. (1999), are thus also A/'P-complete. The capacitated version of the model allows, whenever a link is being used, additional facilities with fixed capacities to be installed on it, if needed. In this case, the design variables are of integer type representing the number of facilities installed on the links. Now, if we reduce our model to the case of a single service class, it then becomes a generalized version of the model (Gendron et al., 1999), where the cost per unit flow on a link is a function of a delay partition (and/or jitter partition) variable which further adds to the complexity. Furthermore, if we consider only the first term of the objective function we are left with two sets of decision variables defining the delay partitions andflowsin the network. The problem becomes that of deter-

1 Design of IP Networks with End-to-End Performance Guarantees

11

mining the routes and delay partitions on the links given the cost/delay functions, such that the end-to-end delay requirements are satisfied. This combined problem of QoS partitioning and routing when the cost/delay functions associated with the links are of general integer type has been shown to be AfP-complete even for the case of single commodity flows (i.e., single OD pair) (Lorenz et al., 2000).

4.

Solution procedure and heuristics

In this section, we outline our proposed framework for the solution of the CFA problem. When devising an effective algorithm for solving problems of this sort, it might be best to exercise some careful "strategic" planning. From experience, we know that a general-purpose integer programming code will fail in all but the easiest problem instances. Thus, we are interested in a fast method that can generate near-optimal solutions for large problems. From the discussion in the previous section, it is apparent that for a network of any reasonable size, determining all the decision factors simultaneously (i.e., routing information, link delay and jitter allocations, link capacities) represents a nearly impossible problem. Therefore, we choose a heuristic approach for solving the CFA problem, in which these factors are considered sequentially. Specifically, a heuristic solution method to the original problem is devised by performing disaggregation of the problem into the following three (simpler) optimisation problems (Figure 1.1):

Optimal QoS Partition and Routing (OPQR-G) Problem First, for each delay-sensitive class, determine the primary paths between the OD pairs and the partitions of the global end-to-end QoS constraints into local QoS constraints (on the links), by solving a combined QoS partition and routing optimisation problem. Capacity Allocation (CA) Problem Then, based on the local delay QoS constraints and the total amount of traffic for each class routed on each link, determine bandwidth allocations for the delaysensitive traffic classes, as well as the total bandwidth (capacities) of the links by solving a capacity allocation problem. Cost Minimisation (CM) Problem Finally, once the links are sized, perform a cost optimisation to account for the modularity of the link capacities. In this approach, we first interpolate the discrete costs with continuous costs and thus we simplify the model in favour of one with fewer integer decision variables. That is, we first determine the flow and QoS (delay, jitter) partition variables for each delay-sensitive traffic class, by

12

NEXT GENERATION INTERNET Stepl: Determine primary routes and link QoS partitions for all classe

Class/

• OPQR-G Algorithm 1—

Class 2

• OPQR-G Algorithm 1 —

f !i

! i I 1 !

L_

Class n

• OPQR-G Algorithm

|—

i i i

1 I !

Step 2: Determine class-based bandwidth allocations and total capacities on links

1 —-J— CA Algorithm

Step 3: Perform optimization to account for the discrete set of link capacities

CM Algorithm

I

i

J

Figure 1.1. Capacity planning framework

applying an algorithm for solution of the OPQR-G problem. This provides the basis for the solution of CA problem i.e., for determining the required bandwidth for the delay-sensitive traffic classes on the links. The continuous link capacities, obtained as a linear sum of the bandwidth allocations for the delay-sensitive classes, are then transformed to the higher (feasible) discrete values of the link capacities. Once the links are sized for capacity (based on application of the CA algorithm), a cost optimisation is performed, where we concentrate on a flow assignment optimisation problem for the best-effort (BE) service class only. That is, we want to achieve a minimum cost placement of all OD pair BE traffic demands on the existing capacitated links and/or to install additional link capacities in the network if required. The optimisation subproblems, although more manageable than the original problem, also pose significant modelling and algorithmic challenges. Each of the main steps and algorithms employed in the proposed framework are discussed in the following subsections.

4,1

Problem OPQR-G

The algorithm for the solution of the OPQR-G (Optimal QoS Partition and Routing in General topology networks) problem, when applied to each class of traffic independently, will determine the required input for the capacity allocation problem; that is, the primary routes between the OD pairs and the QoS partitions on the links for each class of traffic, respectively. The combined QoS partition and routing optimisation

1 Design of IP Networks with End-to-End Performance Guarantees

13

problem for the DSC and JSC class will be solved by considering delay and jitter as a QoS metric, respectively. For the DJSC class, the QoS requirement is defined by a combination of two QoS metrics i.e., delay and jitter bound, thus, we need to find the partitions of the two metrics on the links simultaneously. This is achieved by applying the following two-step heuristic approach: (1) first solve the OPQR-G problem by considering delay as a QoS metric to determine the routes and the delay partitions on the links for the class, and subsequently, (2) given the routes, solve the problem of optimal QoS partitioning only (Problem OPQ-G) to determine the jitter partitions on the links. The combined routing and QoS partitioning optimisation problem has been addressed in the literature in the context of unicast connections and multicast trees only, see Lorenz et al. (2000); Lorenz and Orda (2002) and references therein. In addition to this, several other studies have considered the QoS partitioning part of the problem only, where the routing information is given at input (Ibaraki and Katoh, 1988; Monma et al., 1990). Again, these studies solve the problem for the case of unicast connections and multicast trees only. Moreover, they only provide solutions for convex cost functions. We consider the more general case of a multicommodity network flow problem. In addition, it is assumed that the performance dependent cost functions are non-increasing and they are of a general integer type. Problem OPQR-G needs to be solved for each delay-sensitive traffic class independently and, therefore, we provide its formulation in the context of a single service class. As a result, in the following we omit the subscript c for clarity. Furthermore, without loss of generality, the definition of the OPQR-G problem is described by considering a DSC type of service class i.e., we make use of a delay as a QoS metric. Problem OPQR-G: Given a network GiV^E), a cost/delay function for each link {Qi(d)}ieE, end-to-end delay requirements for all OD pairs and a set of candidate routes for all OD pairs in the net{Duv}(uv}en, work {Ruv}(Uyv)eU- Find a partition d — {di}ieE and set of routes r = {ruv}(UrV)
^

dt < xfJDuv

+ (1 - xf)M

Vj e Ruv

(1.2a)

14

NEXT GENERATION INTERNET

Y^ xf = 1 y(u,v) en

(i.2b)

jeRuv xf

= {o, 1} Vj G ij™, v(u,v) e n VZ € £7

(i.2c) (

The network cost represents the sum of all link costs where the total cost of a link is a function of the delay allocation on that link and the number offlowsthat traverse the link. Each set of constraints is self-explanatory and directly follows from the definitions introduced in Section 3. This problem is A/'P-complete as it has been shown that this is the case, even in the (simpler) case of unicast connections (Lorenz et al., 2000). As a result, we concentrate on the development of an efficient heuristic algorithm and we derive a pseudo-polynomial time solution for the problem. The proposed greedy algorithm OPQR-G first considers each OD pair in isolation by exploiting an algorithm for optimal QoS partition and routing for unicast connections (OPQR problem) developed in Lorenz et al. (2000) and then performs re-adjustment of the delay partition in the network, in order to reduce the overall network cost. In addition, two Linear Programming (LP) based heuristic algorithms were developed, LP1 and LP2, that use the optimisation tool ILOG™CPLEX7.1LP for solving the OPQR-G problem. A brief description of each of these algorithms is given next and the reader is referred to the earlier work (Atov et al., 2003, 2004) for further details. 4.1.1 Algorithm OPQR-G. In our approach to the solution of the OPQR-G problem, we first compute an optimal route and delay partitions on its links for each OD pair in isolation by using the algorithm for the unicast problem i.e., algorithm OPQR provided in Lorenz et al. (2000). We then apply a reallocation heuristic algorithm called STRETCH to determine a single near-optimal set of routes for all OD pairs and delay partitions on all the links, so that the network cost is minimised while the end-to-end delay requirements for every OD pair demand in the network are still satisfied. The OPQR problem for unicast is a generalization of the RSP (Restricted Shortest Path) problem for integer cost functions (Hassin, 1992). Consequently, the general idea that the algorithm OPQR (Lorenz et al., 2000) exploits is to represent each link /, based on the given link delay/cost functions, as a set of links {/]_, /2> • • • > lu} corresponding to all possible costs on the link. The delay associated with each of these links is the minimum delay which achieves the specified cost. Thus, each link in this set is associated with a single cost/delay pair (j, di(j)). Once the set of links is created for each link in the network, one can then run the

1 Design of IP Networks with End-to-End Performance Guarantees restricted shortest path algorithm to find the optimal solution (Lorenz et al, 2000; Lorenz and Orda, 2002). The complexity of the OPQR algorithm is 0(mU{U + logD)) where U is an upper bound on the cost of the solution (Lorenz et al, 2000). The general idea of the greedy OPQR-G algorithm can be explained as follows. When for a given OD pair demand with an end-to-end delay requirement OPQR algorithm is run, it returns an optimal route and optimal delay partitions on the links along that route. After the OPQR algorithm is run for all OD pairs in isolation, this will produce a set of routes in the network, where a single unique route is allocated to each OD pair, and a set of delay partitions for each link in the network, where the elements in this set represent delay partitions on the link from all the routes in the network that traverse that link. As each link may be traversed by multiple routes, the delay partition on a link must be set to the minimum value of all delay partitions associated with the link, in order to satisfy the end-to-end delay requirements of all routes that traverse that link. A set of delay partitions on the links in the network obtained in this way will result in a feasible solution. However, this solution will be too stringent, as it may be possible to relax the delay partitions on some links in order to further minimise the network cost while still satisfying the end-to-end delay requirements for all OD pairs in the network. In order to achieve this, a reallocation heuristic algorithm known as STRETCH is applied. Thus, the heuristic algorithm OPQR-G consists of two main parts, the initialisation step and an iteration step, which employs the reallocation algorithm STRETCH. Step 1 - Initialisation In this step, the OPQR (or e-OPQR) algorithm is run for each OD pair (ii, v) with the given end-to-end delay requirement Duv in isolation, to determine an optimal path ruv and optimal delay partitions on the links along that path {diruv}ieruv for each OD pair in the network. This will produce a set of routes in the network wl; an d a set of delay partitions for each link in the netr __ {r }(W)V)6n> uv work di_r = {d/r }(ii,t;)€n (VZ € E). In order to satisfy all end-to-end delay requirements of all routes that traverse a given link, the initial delay allocation for a link is set to the minimum of all delay partitions from the routes that traverse that link, i.e., d\ = m.m{diruv}^uv^eu (V/ £ E). This set of initial delay allocations on the links and the set of routes for all OD pairs in the network defines the input to the next algorithmic step. Step 2 - Iteration STRETCH In this step, the algorithm first examines all the routes in the network and performs classification of the

15

16

NEXT GENERATION INTERNET

Step 2 ITERATION STRETCH Input: G(Vt E), {ruv}MeU, 2a

{Duv}iu,v)en

WieE,

for all pairs (u, v) € II tightuv <- false, nr <- 0 for all routes r n v G r if (s > 0) slackuv <— s, else if (s = 0)

n r <— n r + 1

tightuv <- true else RETURN error

for all/ € E 2b

if (n r = 0) RETURN

{di}i£E

(Uyv) — arg min{slackuv\tightuv

— false}

PLSF *- r wv S L SF *-• min{s/ac/c wv |t^/ii uv = /a/se}

{«iLSF}i€PLSF *- OPQ(G(V, ^),p L SF, {0i for all / G PLSF

d/ <— ^ + 5/LSF tightuv

*— true,

nr <— n r — 1

for each { « i / ) | ^ p / U w V = /a/se AND r wv n / ' ' ± 0} 5/ac^u'V ^. 5/acfcu't;' _ ^ /€r . a . t;nr . a/ , l;/ SlhSF for all / e E go to Step 2b

Figure 1.2. Algorithm STRETCH

routes as being of either "tight" or "slack" type. A tight route is considered one for which the sum of the delay partitions on its links is equal to its end-to-end delay requirement and thus the allocated delay on this route cannot be further relaxed i.e., Xweruv ^i = Duv. A slack route is considered to be one for which the sum of the delay partitions on its links is less than its end-to-end delay requirement and thus the allocated delay can be further relaxed i.e., Ylieruv ^ < Duv- Once the classification is done, the algorithm performs delay reallocation on the slack routes resulting in an increased overall delay allocation in the network and thus the network cost is minimised. Specifically, in Step 2a, the tight and slack routes are determined based on the value of the slack factor for each route, which is calculated

1 Design of IP Networks with End-to-End Performance Guarantees as follows: sTuv = Duv — Y^ieruv di- The slack routes are ordered in LSF (Least Slack First) order i.e., in an increasing order based on the slack factors for the routes. All the links that belong to routes for which the slack factor is equal to zero (i.e., tight route) are marked as "tight", as delay partitions on those links cannot be further relaxed. Finally, this step updates the link cost functions (as the residual link cost functions) to account for the delays that are already allocated on the links due to the initialisation process. Step 2b runs the OPQ algorithm (Optimal QoS Partition for Unicast) for each slack route in LSF order, where the target delay that the algorithm uses for input is now actually the route slack factor instead. The OPQ algorithm performs delay partitioning on a given route and end-to-end delay requirement, taking into account the tight links. The algorithm will not try to allocate any delay to any tight links on the route. The implementation of the OPQ algorithm requires a simple modification of the OPQR algorithm to allow for route information to be specified at input and thus to perform delay partitioning on a specified route. When OPQ is run for the LSF route PLSF it will return a slack factor partition along this route {S/LSF}/<EPLSF* These slack factor partitions on the links belonging to the LSF route indicate that the delay partition on these links can be relaxed by the value of their respective slack factor partitions without violating delay constraints on other routes (including those that traverse one or more links on the current route). After this adjustment is made for the link delays belonging to the LSF route, the slack factors for all other slack routes that traverse any of the constituent links of the LSF route should also be updated. Finally, in this step, the link cost functions are updated to account for the delays that are already allocated on the links. This step is performed until all slack routes have been considered. Complexity of the algorithm OPQR-G The initialisation step requires the OPQR algorithm to be run for each OD pair in the network. Thus, the overall running time of this step is O(n2mC/(C/ + log D)). The iteration STRETCH step first examines all routes to determine those which are slack and then for each slack route the algorithm OPQ is run to perform a re-adjustment of the delay partitions. The required running time of this operation is determined by the number of links on a slack route and the running time of the OPQ algorithm. The worst case running time for this step is O(m(n—l)U(U+\og £))). Hence, the complexity of the heuristic algorithm OPQR-G is O((n2 + n - l)mU{U + logD)), where U is an upper bound on the cost of the solution and D is a maximum delay requirement of all OD pairs. It can be seen, that the

17

18

NEXT GENERATION INTERNET

run time of the iteration STRETCH is much less than the run time of the INITIALISATION step, therefore, the complexity of the algorithm OPQR-G can be rewritten as O(n2mU(U + logD)). 4.1.2 Algorithm LP1. The greedy fashion in which the algorithm OPQR-G re-adjusts the delay partition route-by-route does not ensure that the delay partition found by STRETCH will be best for the given set of routes. Taking this observation into account, the algorithm LP1 improves upon STRETCH and guarantees optimality of the delay partitioning for the set of routes pre-determined by INITIALISATION. Given the routes for all OD pairs, the problem OPQR-G can be reformulated as an LP problem (problem 0PQR-G-LP1) as shown below. It will become clear when we present algorithm LP2 that without any knowledge of the routes, it is impossible to describe the problem OPQR-G as an LP problem2. Minimise:

E E Ql(di ) (u,v)euier

(1.3)

uv

D UV

Subject to: dl > dlprop

V(n, v) e n

He ; E

(1.3a) (1.3b)

The algorithm LP1 and the heuristic OPQR-G are conceptually the same. Both LP1 and the heuristic use INITIALISATION to compute the routes for all OD pairs and then re-adjust the delay partition such that the new partition does not violate end-to-end delay constraints. The only thing that differentiates between the two algorithms is how re-adjustment of the delay partition is achieved. In the case of LP1 the delay partition achieved on the given set of routes is guaranteed to be optimal. However, there is a trade-off when using the LP approach for solving the problem, in that, it will more-than-likely require a longer time to run than the heuristic. 4.1.3 Algorithm LP2. A limitation to both of the previous heuristics is that the routes are kept unchanged after the INITIALISATION step. In the case where these routes have many common links 2 Note that, the formulation of the OPQR-G-LP1 problem defines the optimal QoS partitioning problem in the context of a multicommodity flow network (OPQ-G problem). The heuristic algorithm OPQR-G can be easily modified to provide the solution for the OPQ-G problem. The only modification required is that the INITIALISATION algorithm runs the OPQ algorithm instead of OPQR for unicast connections for each given route between the node pairs in the network.

1

Design of IP Networks with End-to-End Performance Guarantees

19

LP2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

for alllEE ni « - 1

for

all (u,v) e

n

all (u,v) e

n

generate a set of routes {Ruv} determine {x^ v} and {dt} by solving OPQR-G-LP2 for

if x™ = 1 ^_

ruv

{r^|j-th from Ruv}

for oXiieE

nj-0 for all {u,v) £ n

for alH € ? n'{ *— n

A:

- n\)2

if A > r

for all /, ni <— n j go back to line 5 return the set of routes r and partition d.

1.3.

Algorithm LP2

and/or the cost function of some common links is steeply decreasing then keeping the routes fixed may not be a good idea. The LP2 heuristic manages to solve the overall OPQR-G problem with standard LP techniques. In the original formulation (1.2), both x^v and di are unknowns, therefore, this formulation is not linear, and cannot be solved by LP solvers. This problem is overcome by employing an iterative algorithm as described in Figure 1.3. We have reformulated the OPQR-G problem as an LP problem (problem OPQR-G-LP2), defined below, which the algorithm LP2 must solve at each iteration step. Minimise:

^^^iQiidi)

(1.4)

leE

Subject to:

J2

dl

- X7DUV + i1 ~ x f ) M

V G RUV

«?

(L4a)

jeRuv

xf

= {0,1} Vj G Ru\ V(u,v) e n

di>diprop

VI eE

(1.4c)

(

20

NEXT GENERATION INTERNET

The variable n\ in the formulation above represents the total number of routes that traverse link L The algorithm starts by initialising the vector n = {ni}ieE by a unit vector. Then it determines the set of routes and delay partitions in the network by solving the OPQR-G-LP2 problem. In addition, the vector n is updated by the number of routes that traverse each link, as computed in the previous step. The algorithm iterates until the vector n converges. 4.1.4 Discussion. It is easy to see that the lower bound, LB, on the optimal cost can be computed by LB = E f u ^ e n E i G r ^ ^ " ' where ruv and {diruv}ieruv are the route and delay partitions on its links found by running OPQR for an OD pair (u, v). We shall use the LB as a benchmark for the performance analysis of the three proposed heuristic algorithms in the next section. The heuristics can always provide a solution (if a feasible solution exists) as mentioned before. However, like other heuristics, the heuristic cannot guarantee optimal solutions. When the routes found in INITIALISATION are link disjoint, the heuristic will not change either the routes or the delay partitions, hence, the results are guaranteed to be optimal. If the routes have common links, the solution may not be optimal. The solution is guaranteed not to exceed the optimal cost by T,(u,v)en J2ier^[Ql(dd ~ Ql(dir™)}, where {d[}ieE is the delay partition computed by the heuristics. According to the above analysis, the heuristic is not expected to work very well when the routes found by running OPQR algorithm separately for each OD pair have many common links or when Qi(d[) — Qi(diruv) is large. For example, the heuristic may have trouble with a network that consists of two parts connected via a single link and the (non-increasing) cost function of that link is very steep.

4.2

Problem CA

The CA problem poses the challenge of performing appropriate link dimensioning in order to balance quality of service against costly overprovisioning. From a mathematical modelling point of view, a major challenge is to incorporate a tractable mathematical characterisation of IP data traffic and tractable stochastic model of QoS queueing mechanisms used at the routers. Traditionally, for ease of tractability Poisson traffic descriptor has been used for representation of the external traffic sources and consequently the Jackson's network performance model has been used in network design. However, it is very important that the chosen traffic descriptor is able to capture burstiness of the packet arrivals,

1 Design of IP Networks with End-to-End Performance Guarantees

21

as real IP traffic exhibits such characteristic. The dimensioning model applied for the solution of the CA problem in the proposed framework, incorporates procedures that allow burstiness of multi-class IP traffic to be effectively modelled. The choice of a renewal traffic model i.e., GI arrival process, for this purpose, represents a reasonable balance between the accuracy and the efficiency required by a network design tool, especially when large networks, as well as large aggregates of individual flows into service classes are considered. This is due to the multiplexing which occurs on a very large scale in this case and, as a result, the correlations significantly reduce due to the inter-mixing of packets from different traffic streams. We model a class flow (i.e., single class of traffic between an OD pair) as a GI arrival process characterised with the following set of parameters {Ac, c^, XC) X^}ruverc. The first two-parameters denote the mean packet arrival rate and the squared coefficient of variation (SQV) of the packet inter-arrival times of the class c flow (c = 1,2,... C), which has been assigned a route r^v. The third and fourth parameter denote the first two moments of the packet size of class c flow, respectively. For this renewal process, the coefficient of variation is used to characterise traffic burstiness, i.e., the variability of the arrival stream. IP traffic is bursty in the sense that its squared coefficient of variation is always greater than or equal to unity. In Atov and Harris (2002a), we have presented models, which can be used to translate the OD pair traffic demands for each class of traffic, as obtained from traffic measurement data, into equivalent GI arrival process parameters for direct application into the dimensioning procedures. Due to the aggregation of individual flows into traffic classes at the ingress of the network, first individual flows are modelled by distinguishing between TCP-based and UDP-based flows, respectively. The model also takes account of the traffic aggregation and conditioning functions at the ingress of the network. From the offered class flows to the network and the given routing information (obtained from the OPQR-G problem solution), characterisation of the internal class flows can be obtained by applying the methods for superposition, departure and splitting of GI traffic arrival processes as provided by the well-known QNA analysis (Whitt, 1983). However, the internal class flows cannot be derived in a single iteration step (i.e., after single analysis of each node in the network), as their traffic descriptors depend on the service capacities that are allocated on the links, which are not known and need to be determined in the CA procedure. In order to deal with this interdependence, we calculate the internal flows for the classes, as well as, their bandwidth allocations on the links, iteratively as part of the CA procedure.

22

NEXT GENERATION INTERNET

Input: A c = {(Ac, c^, XCi^c)}ruverc for c E C. Output: bci = {bci}i£E f° r a ll Q°S sensitive classes. 1 2

for class c = 1 to C — 1 for all link (I e £7)

4 for class c = 1 to C — 1 5 switch (qc) 6 case qc = {rf c /} i€£ ; 6a for all link (/ e E) 6b Calculate internal flows: {(A c /,c^)}; 6£ ;, 6c Invert the delay formula to derive 6C/, >€ 6d if max { b^zb^1 \ 6e 7 7a

6C^ •— 6C/ for V7 and repeat steps 3,6. case
Figure 1.4. Algorithm to determine class-based bandwidth allocations

The bandwidth required for each delay-sensitive class c on a link /, bci, is determined from the traffic characteristics of class cflowon a link / and its delay (QoS) constraint for that link qc\ — {dci,vci}, by inverting an approximate delay formula F\ that is, find bc\ where: qc\ — F(AC/, c^, bci) for c = 1, 2 , . . . , C For this purpose, we analyse the GI / G / 1 delay performance model (Whitt, 1983). The CA algorithm, shown in Figure 1.4, based on the given topology, offered class flows, link delay QoS constraints for each class of traffic returns class-based bandwidth allocations for all links in the network (for details see Atov and Harris, 2002b). The algorithm is initialised with starting values for the service times at the nodes, which are specified by their first two moments (T C /,C^_ C/ ). Their initial values are computed from the first two moments of the packet size for the classes and their initial fixed capacity shares on the links, which are initially set to sufficiently large values. Once the initialisation step is completed, the algorithm cycles through the classes and determines their required bandwidth allocations by numerically inverting a delay

1 Design of IP Networks with End-to-End Performance Guarantees

23

formula, a variance of delay formula or both depending on the type of class considered. The algorithm calculates the internal flows and the class-based bandwidth allocations iteratively until the capacities on the links converge. Having obtained the bandwidth required for each delay-sensitive traffic class on a link, bci (c = 1,2, ..,C), the total capacity of the actual link (i.e., link setup capacity), can be determined as the minimum capacity of all available. capacities on the link that is higher than the linear sum of the individual bandwidth allocations for the classes i.e., ^se up _ mo(i |"£^ ^ i ^ w h e r e mod\-] is the smallest value of all available capacities on the link (0^, k G //), which is greater than or equal to the argument value. The set of these link setup capacities 6^ u p (I G E) will determine the required input for the solution of the cost minimisation problem considered next.

4-3

Problem CM

Cost minimisation is performed after the links are sized for capacity based on the total bandwidth requirements for the delay-sensitive traffic classes, since, in practice, capacities of the links in the network are restricted to a set of discrete values. In order to achieve cost minimisation, here we concentrate on a flow assignment optimisation problem for the BE service class only. That is , we consider a multicommodity minimum cost network flow problem where it is required to send BE traffic flows that follow static routes to satisfy all OD pair BE traffic demands given links with existing capacities, or to install in discrete amounts, additional facilities (i.e., line types) with fixed capacities. The set of available line types for each link along with the set of candidate routes for each OD pair is given. In addition, only design costs of the links are considered in the problem. The allocation of the BE traffic flows on the links should be done in such a way, so that the network cost is minimised while the maximum utilization constraint for the links is still satisfied. This problem provides the basis for the solution of many interesting and practical engineering problems, of which one notable example is the link capacity expansion problem of existing (or capacitated) communication networks. In the following, we give a formal definition of the CM problem as an LP problem. For each link /, / G E, there is an available set of line types I[. These line types have different capacities, Of (k = 1, 2 , . , . , AT), starting from 0] = ^ s e t u p the setup capacity for the link as computed in the CA procedure. The interval between two available capacities for the link (Of, 0f+1] is associated with a fixed cost, jf+1 (k - 1, 2 , . . . , iV), starting

24

NEXT GENERATION INTERNET

from jf = 7^se u p the setup cost for the link i.e., the cost associated with #^ etup . Accordingly, we define a fixed setup cost for the network, as the sum of all link setup costs: (1-5) leE

The cost function which we need to minimise represents a sum of the fixed setup cost of the network and the costs of all links in the network, which require installation of additional capacity in order to accommodate the BE traffic demands. The total average traffic demand for the best-effort service class of OD pair (u, v) is defined as \uv. After the best-effort flows are assigned in the network, the actual capacities required for the links, \J\ (for leE), can be computed as follows:

d+ E

E *r
(1.6)

J£KUV Men cec Mathematically, the CM problem can be formulated as finding the Y'> vf v a l u e s that satisfy the following:

Minimise:

K + V V yM

(1.7)

leE keii

Subject to:

Ut < pOfy^ + (1 - y?)6{< VI £ E

Y^ xf = i v(u,v) e n jeR

(i.7b)

uv

f= l xf

(1.7a)

VI eE

- {0,1} Vj G Ruv, V(u, v)eU

yf = {0,1}

V/cG//, VZ eE

(1.7c) (1.7d) (1.7e)

Note that, by minimizing the objective cost function we ensure that the minimal available capacities are chosen for the links (as we assume that the cost of the available capacities increases with an increase in capacity) which is not explicitly specified in the constraints above. Furthermore, in this way load balancing is achieved in the network as this optimisation model will try to maximise the utilization of the capacitated links before installing additional units of capacity. We have developed an LP-based algorithm that uses the optimisation tool ILOG™ CPLEX 7.1 LP for finding an optimal solution for the CM problem (Atov and Harris, 2003). Discussion on the scalability issue of the LP-based algorithm is provided in the next section.

1

5*

Design of IP Networks with End-to-End Performance Guarantees

25

Computational results

Although heuristic methods are very difficult to assess for performance we evaluated the network design model in terms of accuracy and scalability of each of the main components employed by the model. We now briefly summarize the results of our numerical study involving each of the building blocks in the proposed design model; see Atov et al. (2003), Atov et al. (2004), Atov and Harris (2003) for more details. The first component (problem OPQR-G) represents, arguably, the most crucial step in this design model as it greatly influences the results in the second and third steps. We have developed a greedy algorithm OPQR-G, which provides a pseudo-polynomial time solution for this problem and extensive numerical analysis showed that the algorithm is quite powerful with respect to optimality, scalability and solution time. The complexity of the algorithm has been proven to be 0(n2mU(U + logD)). In addition, the two LP-based heuristic algorithms, LP1 and LP2, perform slightly better with respect to accuracy, however, our tests revealed their lack of scalability and they cannot be used for networks larger than 15 nodes. The performance of the three proposed heuristic algorithms was examined using the lower bound on the optimal cost as a benchmark and, thus, the RE for the solution costs obtained by the heuristics represents a worst case scenario. In the following we show results from two different sets of tests, test-1 and test-2, performed for the analysis of the heuristics for the OPQR-G problem. Test-1 consisted of five different test scenarios performed on a 15 node network, shown on Figure 1.5 (an ATM network obtained from Lee et al., 2000), where for each test scenario, delay requirements ({Duv}(UiV)ey) for all OD pairs were randomly assigned from the following sets: Scenario 1 = (10 - 15) ms, Scenario 2 = (15 - 25) ms, Scenario 3 = (25 — 40) ms, Scenario 4 = (40 — 70) ms and Scenario 5 = (70 - 100) ms. Links were randomly assigned a cost/delay function from a set of five different cost/delay functions, which are plotted in Figure 1.6. We tested the heuristic algorithms with the test network and the settings as described above on a Pentium 1.7 GHz PC with 512 MB RAM and the results obtained from test-1 are summarized in Table 1.1. Table 1.1 shows the difference between the solution costs obtained by the algorithms greedy OPQR-G, LP1 and LP2 against the lower bound (LB) of the optimal cost for all five test scenarios. The relative error (RE), defined as RE = 100(heuristic-LB)/LB, for the various scenarios is shown immediately below the cost value for OPQR-G, LP1 and LP2, respectively. In addition to the relative error of the solution costs, the last column in the table shows a discrepancy for the route allocation ob-

NEXT GENERATION INTERNET

26

Figure 1.5.

15 Node Test Network . c1(d)=50/d c2(d)=35/d " " c3(d)=65/(8+d) ^ C4(d)=150/(12+d) " c5(d)=(150+d)/(d2+1)

o

10

15

20

25

30

35

40

Delay Figure 1.6.

Cost/delay test functions

tained from the greedy OPQR-G in terms of the number of routes from the total number of routes in the network, that were different from the routes obtained using LP2. For most scenarios we did not record any discrepancy in the route allocations, except in Scenario 4 where there was only one route out of twelve that was different from the routes found by LP2. It can be seen that, in terms of optimality, LP1 performs best, followed by LP2 and OPQR-G. LP2 at best, provides the same cost solution as LP1 when there is no route discrepancy recorded. However, when there was a route discrepancy recorded, LP2 did not provide a better solution than LP1. Test-2 was performed with the same settings as test-1 on a network of larger size i.e., a 28 node network, shown on Figure 1.7 (a USA network obtained from Gavish and Neuman, 1989), in order to analyse the seal-

1

Design of IP Networks with End-to-End Performance Guarantees

27

Table 1.1. Difference between solution costs of the algorithms OPQR-G, LP1, LP2 and LB Test-1

LB

OPQR-G

LP1

LP2

Route %

See 1

225.33 163.51

See 3

121.49

See 4

107.95

237 (5.17%) 167.84 (2.64 % ) 126.77 (4.34 % ) 110.96 (2.79 % )

239.66 (6.35 % ) 171.60 (4.94 % ) 126.77 (4.34 % ) 113.08 (4.75 % )

N/A

See 2

239.66 (6.35 % ) 172.17 (5.29 % ) 127.52 (4.96 % ) 113.95 (5.56 % )

N/A N/A 8.33

Figure 1.7. 28 Node Test Network

ability of the heuristics. The tests performed on this network showed that the greedy OPQR-G algorithm was able to obtain good solutions in a very short time (about 15 seconds), whereas the LP1 and LP2 failed after running for 30 minutes due to a shortage of memory. Due to scalability problems for the algorithms LP1 and LP2, we performed more extensive analysis of the algorithm OPQR-G only for this and larger size networks. From the analysis in Section 4.1.4, it is expected that the performance of the algorithm OPQR-G will depend upon several factors. Its accuracy decreases with the increase in the number of traversing routes per link, particularly when the routes' delay requirements vary quite dramatically and when the cost/delay function of the traversed link is very steep around the value of the delay that is allocated to the link. In order to verify this, we have increased the number of OD pairs in the

28

NEXT GENERATION INTERNET

test scenarios, starting from 15 OD pairs, to 40 OD pairs and finally 60 OD pairs, thus achieving a higher number of traversing routes per link. Having three different sets of OD pair demands, we performed 15 test scenarios. Moreover, each of these 15 scenarios was run 16 times by performing randomization of the OD pair delay assignment, as well as, randomization of the cost/delay function assignment on the links resulting in total of 240 tests. Table 1.2 gives a summary of the results obtained from test-2. Each column in the table corresponds to one of the five scenarios for the OD pair delay assignment. The results are partitioned into three main rows in the table, each corresponding to a different number of OD pairs considered in the network i.e., 15, 40 or 60, respectively. In each main row we report the minimum, the maximum and the average percentage RE obtained from all runs for each test scenario. From Table 1.2 we can observe how the RE increases with the number of OD pair demands in the network e.g., the RE reported in the third row, across all five scenarios, is higher than the RE in the first row across all scenarios, thus confirming the assertion that the increase in the number of traversing routes per link can affect the performance of the algorithm OPQR - G. In addition, it can be noticed that the RE increases across the columns, especially from Scenario 2 to Scenario 5. When the OD pair delays assigned from Scenario 2 to Scenario 5 are partitioned onto the links, it is likely that the resulting link delay partitions will fall into the range where the cost/delay functions are not very steep. Therefore, in this case, the RE is not as much determined by the steepness of the cost functions of the most traversed links in the network, but it is mainly due to the large variation of the routes' end-to-end delay requirements. For a given link, the more the end-to-end delay requirements of the traversing routes vary, there is a higher likelihood that the routes' optimal delay partitions for that link would vary more against the delay partition found by the heuristic. The range in which the link delay partitions are likely to fall increases across the five scenarios from around 3 delay units to 25 delay units (as routes traverse approximately four to five hops in the network). Notice that, due to having a similar range of link delay partitions in Scenario 4 and Scenario 5 (i.e., the link delay partitions fall in a range where the cost/delay functions are not very steep) the RE values for the two scenarios are very close. As for Scenario 1, the RE is mainly due to the steepness of the- cost/delay functions of the most frequently traversed links in the network. As can be seen from Figure 1.6, the steepness of the cost/delay functions is most pronounced in the range

1

Design of IP Networks with End-to-End Performance Guarantees

29

Table 1.2. Solution cost RE (%) of algorithm OPQR-G for a 28 node network Test-2 Minimum Maximum Average RE Minimum Maximum Average RE Minimum Maximum Average RE

OD pairs

See 1

See 2

See 3

See 4

See 5

15 15 15

3.25 17.42 11.21 4.85 19.57 14.53 5.43 22.46 16.82

2.68 11.97 8.50 6.39 18.59 12.82 7.64 20.46 15.59

5.63 14.51 10.17 9.27 19.13 13.90 10.07 22.03 14.93

6. 17 15 .89 11 .55

6.79 14.03 10.72 11.12 18.36 15.91 12.80 22.19 17.98

. 40 40 40 60 60 60

9. 75 17 .96 14 .18 11 .69 21 .30 16 .71

from 1 to 5 delay units, in which the link delay partitions for the case of Scenario 1 are likely to fall. The various tests presented in this section, showed that the greedy algorithm OPQR-G provides the fastest solution of all the algorithms proposed for problem OPQR-G and performs well in terms of cost when compared to the lower bound on the optimal cost. The tests conducted for networks of larger size than 15 nodes, showed a lack of scalability of the two LP-based algorithms, whereas the algorithm OPQR-G scaled well and was consistent in providing good solutions in a very short running time. In addition to the tests presented in this section, we tested the algorithm OPQR-G for randomly generated networks of a size larger than 30 nodes - specifically, networks of size 50, 100, and 150 nodes were considered - and, for all tests, the worst case RE was not higher than 35 % with longest running time of several minutes. We have performed tests for various problem sizes in order to analyse the scalability of the LP-based algorithm devised for the solution of CM problem and answer the question as to what size problem we could hope to solve using the standard solver ILOG™ CPLEX7.1. Specifically, the tests involved networks of varying size between 15 and 120 nodes, which were randomly generated. For example, networks of various structures and sizes between 15 and 30 nodes required from 10 to 70 seconds to solve. A sample network of 50 nodes and 186 links took less than 5 minutes to solve. At the point of infeasibility (the worst case) the run time was 85 minutes. This was a problem instance from a 120 node network and 412 links with a set of 20 available line types for the links and a set of 5 candidate route for the OD pairs. The computational results showed that realistic size problems can be solved with the proposed LP model, however, our current efforts are focused on development of more

30

NEXT GENERATION INTERNET

efficient heuristics for the solution of this problem. Note that, this was not the case with the LP-based algorithms (LP1 and LP2) devised for the OPQR-G problem. Here it is worth commenting on what caused the poor scalability of these LP-based algorithms. The OPQR-G problem involved cost/delay functions, one for each link, which are of general integer type. These functions had to be translated into equivalent linear piece-wise functions in the LP code, which effect considerably slowed the run time of these algorithms. The OPQR-G problem is defined with integer decision variables (i.e., delay and jitter partitions), as opposed to the binary variables defining the problem CM, which drastically reduces the problem size. Finally, to assess the capability of the proposed CA algorithm in delivering the end-to-end delay QoS guarantees for the traffic classes we employ a simulation study using a queueing network simulator. The test scenario (see Figure 1.8) used in this case study is simple, but sufficiently illustrative, to indicate the quality that can be expected from the proposed dimensioning tool across a wide range of traffic input parameters and traffic intensities at the nodes. For this study each traffic stream (TS) is modelled as a hyperexponential process, which represents a suitable model for a bursty renewal traffic since its SQV of inter-arrival times is always greater than unity. A single service class is considered and the traffic offered to the network is comprised of 10 TSs of type i between the OD pair (1,4) and 10 TSs of type j between the OD pair (2,4), where ijj are chosen from a set of five different TSs, each having different values for the mean packet arrival rate and SQV of inter arrival times. In this case, all combinations for (i,j) were considered, which results in 10 different traffic load scenarios. Since the network dimensioning model effectively does not take account of the statistical multiplexing at the nodes, the capability of the designed network in delivering the end-toend delay QoS guarantees for a given service class will always be better than the performance of the individual service class case. The cases we use have been set up to test the proposed method over a fairly extreme set of circumstances i.e., over extremes of combinations of arrival rate and variability. The combinations of arrival rate which have been considered for the two component processes start from equal rates for the two processes to combinations where one process has two to twenty times the mean arrival rate of the other. The combinations of SQV of the interarrival times that we consider are all combinations of different pairs from {1.5, 2.5, 5,8.5,10}. The packet size for the traffic streams is set to a constant value of 1000 bytes, thus, our model results in a network of G I / D / 1 queues. The packet delay requirements for the traffic aggregates between the OD pairs is 110 msec and 95 msec, respectively.

1

Design of IP Networks with End-to-End Performance Guarantees OD pair

Delay

(1,4)

110 msec

(2,4)

95 msec '

31

Figure 1.8. Simple test network

The link delay requirements are as follows: d\-z — 85 msec, ^2-3 = 70 and c?3_4 — 25 msec. The link bandwidth allocations obtained from the CA algorithm for all traffic load scenarios considered in this test resulted in traffic intensities at the nodes (links) that fall in the range from 0.30 to 0.85. For each test scenario (i.e., (i,j) pair) network simulation experiment was set up based on the link capacities obtained from the dimensioning model. For each experiment we performed nine simulations with different seeds for the random number generator. The estimates of the mean packet delays were computed based on the replication method and 95-percent confidence intervals were obtained assuming a Student-* distribution. The results for this case study are summarized in Table 1.3. Each row in the table indicates one of the ten different traffic load scenarios considered in this study. The first and second column show the results obtained from the simulations for the mean packet delay of the traffic aggregates between the node pairs (1,4) and (2,4), respectively, in the capacitated network. Columns three and four show the relative percentage errors for the mean packet delays between the two node pairs with respect to the target delays, respectively, which are defined as RE = 100(Simulation delay—Target delay)/Target delay. At the bottom of the table are the average absolute relative percent errors (ARE) for each OD pair, respectively, which are defined as ARE = ]T)i=i |REi|/10. In all cases, the dimensioning model slightly overprovisions the network, as the mean delays obtained in the capacitated network are always less than the specified end-to-end delay constraints. Thus, the performance of the network is guaranteed to be better than what is required by the delay-sensitive traffic classes.

32

NEXT GENERATION INTERNET

Table 1.3. End-to-end packet delay percentage RE for various traffic inputs and traffic intensities at the nodes from 0.30 to 0.85 Simulation [msec] (iA)

d

107.75 ± 3.2e-5 107.04 ± 2.9e-4 106.05 ± 2.3e-5 104.82 ± 4.6e-5 105.80 ± 7.5e-5 105.01 ± 3.3e-5 104.04 ± 6.3e-5 102.58 ± 2.8e-5 101.99 ± 1.3e-4 99.35 ± 4.5e-4

Simulation ^(2,4) [msec] 92.51 ± 89.88 ± 87.31 ± 83.12 ± 88.82 ± 86.89 ± 82.38 ± 85.91 ± 82.27 ± 81.73 ±

—

6,

—

2.0e-5 1.3e-5 l.le-5 2.4e-5 4.7e-5 4.2e-5 5.4e-5 1.9e-4 9.2e-5 5.3e-4

CA model

REford(1A)

CA model RE for
-2.04 -2.69 -3.59 -4.70 -3.81 -4.53 -5.41 -6.74 -7.28 -9.68 5.04

-2.62 -5.38 -8.08 -12.50 -6.50 -8.53 -13.28 -9.56 -13.40 -13.96 9.37

Conclusions

Now with the need to consider QoS, the paradigm for IP network design and planning must change in order to take account of the new technologies and mechanisms that are being implemented to provide QoS. Specifically, it must include multiple delay constraints so that differentiated performance can be achieved for the various traffic classes. In this paper, we have examined the issues involved in IP network design with QoS and proposed a network design methodology. The technologies that provide QoS introduce new constraints and require that certain features be considered for the design methodology. The following generic features will allow design of IP QoS networks based on DiffServ/MPLS technologies: (1) various QoS classes, (2) differentiated routing based on class, and (3) queueing models that provide differentiated delays. We have incorporated these features into an efficient network design model which includes primary decisions for determining class-based bandwidth allocations on the links, total link capacities and how traffic of each class is to be routed through the network. Furthermore, the model incorporates procedures that allow the known burstiness of IP traffic to be effectively modelled. To be able to efficiently solve large problem instances, we have proposed a framework for the solution of the network design problem which employs a heuristic approach. In this approach, we decompose the problem into simpler, more manageable, optimisation problems: (1) OPQRG problem, (2) CA problem, and (3) CM problem. These are in turn

1 Design of IP Networks with End-to-End Performance Guarantees

33

solved by heuristic algorithms, for which we provide a brief overview and a reference pointer. Finally, computational results for each of the building blocks demonstrated that realistic size problems can be solved with the proposed network design model very efficiently and the model performs well with respect to guaranteeing the end-to-end delay requirements for the traffic classes.

References Atov, I., Tran, H.T., and Harris, R.J. (2003). Efficient QoS partition and routing in multiservice IP networks. In: Proceedings of IPCCC 2003, Phoenix, Arizona. Atov, L, Tran, H.T., and Harris, R.J. (2004). OPQR-G: Algorithm for efficient QoS partition and routing in multiservice IP networks. Forthcoming in Journal of Computer Communications. Atov, I. and Harris, R.J. (2003). LP-based algorithm for optimization in multiservice IP networks. WSEAS Transactions on Communications, 2:432-438. Atov, I. and Harris, R.J. (2002). Characterization of class-based traffic flows in multiservice IP networks. In: Proceedings of I CCS 2002, Singapore. Atov, I. and Harris, R.J. (2002). Dimensioning method for multiservice IP networks to satisfy delay QoS constraints. In: Proceedings of IFIP-TC6 Interworking 2002, Perth, Australia. Floyd S. and Jacobson, V, (1995). Link-sharing and resource management models for packet networks. IEEE/ACM Trans, on Networking, 3:365-386. Fortz, B. and Thorup, M. (2000). Internet traffic engineering by optimizing OSPF weights. In: Proceedings of INFOCOM 2000, Tel-Aviv, Israel. Gavish, B. and Neuman, I. (1989). A system for routing and capacity assignment in computer communication networks. IEEE Transactions on Communications, 37:360-366. Gendron, B., Crainic, T.G., and Frangoni, A. (1999). Multicommodity capacitated network design. In: B. Sanso and P. Soriano (eds), Telecommunications Network Planning, pp. 1-19. Kluwer Academic Publishers, Norwell, MA. Guerin, R., Ahmadi, H., and Naghshineh, M. (1991). Equivalent capacity and its application to bandwidth allocation in high-speed networks. IEEE Journal on Selected Areas in Communications, 9:968-981. Hassin, R. (1992). Approximation schemes for the restricted shortest path problem. Mathematics of Operations Research, 17:36-42. Holmberg, K. (2000). A Lagrangean heuristic based branch-and-bound approach for the capacitated network design problem. Operations Research, 48:461-481. Ibaraki, T. and Katoh, N. (1988). Resource Allocation Problems: The Foundations of Computing. MIT Press, Cambridge, MA. Kleinrock, L. (1975). Queueing Systems, Vol. II, Computer Applications. Wiley, New York. Lee, H., Song, H.-G., Chung, J.-B,, Chung, S.-J. (2000). Preplanned rerouting optimization and dynamic path rerouting for ATM restoration. Telecommunications Systems, 14:243-259. Liang, W. and Ross, K.W. (1999). Loss models for broadband networks with nonlinear constraint functions. In: B. Sanso and P. Soriano (eds), Telecommunications

34

NEXT GENERATION INTERNET

Network Planning, pp. 121-134. Kluwer Academic Publishers, Norwell, MA. Lorenz, D.H. and Orda, A. (2002). Optimal partition of QoS requirements on unicast paths and multicast trees. IEEE/ACM Transactions on Networking, 10:102-114. Lorenz, D.H., Orda, A., Raz, D., and Shavitt, Y. (2000). Efficient QoS partition and routing of unicast and multicast. In: Proceedings of IWQoS 2000, Pittsburgh, USA. Monma, C.L., Schrijver, A., Todd, M.J., and Wei, V.K. (1990). Convex resource allocation problems on directed acyclic graphs: Duality, complexity, special cases, and extensions. Mathematics of Operations Research, 15:736-748. Puah, L.K. (1999). Capacity Dimensioning Methods for Multi-service Networks. Ph.D. Thesis, RMIT University, Melbourne. Raz D. and Shavitt, Y. (2000). Optimal partition of QoS requirements with discrete cost functions. In: Proceedings of INFO COM 2000, Tel-Aviv, Israel. Wang, Z. (2001). Internet QoS: Architectures and Mechanisms for Quality of Service. Morgan Kaufmann Publishers. Whitt, W. (1983). The queueing network analyzer. Bell System Technical Journal, 62:2779-2815.

Chapter 2 DESIGN OF IP VIRTUAL PRIVATE NETWORKS UNDER END-TO-END QOS CONSTRAINTS Emilio C.G. Wille Marco Mellia Emilio Leonardi Marco Ajmone Marsan Abstract

1.

Traditional approaches to optimal design and planning of packet networks focus on the network-layer infrastructure. The next generation Internet will be faced with problems concerning end-to-end Quality of Service and Service Level Agreement guarantees. In this chapter, we propose a new packet network design and planning approach, for Virtual Private Networks, that is based on user-layer QoS parameters. Our proposed approach maps the end-user performance constrains into transport-layer performance constraints first, and then into networklayer performance constraints. The latter are then considered together with a realistic representation of traffic patterns at the network layer to design the IP network. Examples of application of the proposed design methodology to different networking configurations show the effectiveness of our approach.

Introduction

The pioneering works of Kleinrock (1976) spurred many research activities in the field of optimal design and planning of packet networks, and a vast literature is available on this subject. Almost invariably, however, packet network design focused on the network-layer infrastructure, so that the designer is faced with a trade-off between total cost and average performance (network-wide packet delay, packet loss ratio, link utilization, network reliability, etc.). This approach adopts the viewpoint of network operators, who quite naturally aim at the optimization of some aggregate performance measure, that describe the general behav-

36

NEXT GENERATION INTERNET

ior of their network, averaging over all traffic relations. This may lead to situations where the average performance is good, but, while some traffic relations obtain very good QoS, some others suffer unacceptable performance levels. Today, with the enormous success of the Internet, packet networks have reached their maturity and they are used for very critical services. Accordingly, researchers as well as operators are concerned with endto-end Quality of Service (e2e QoS) issues and Service Level Agreement (SLA) guarantees for IP networks. In this new context, average networkwide performance cannot be taken as the sole metric for network design and planning any longer, specially in the case of corporate virtual private network (VPN). From the end user's point of view, QoS is driven by end-to-end performance parameters, such as data throughput, web page latency, transaction reliability, etc. Matching the user-layer QoS requirements to the network-layer performance parameters is not a straightforward task. Indeed, the QoS perceived by end users in their access to Internet services is mainly driven by TCP, the reliable transport protocol of the Internet, whose congestion control algorithms dictate the latency of information transfer. Indeed, it is well known that TCP accounts for a great amount of the total traffic volume in the Internet, and among all the TCP flows, a vast majority is represented by short-lived flows (also called mice), while the rest is represented by long-lived flows (also called elephants); see for example: Gribble and Brewer (1977), Claffy et al. (1998), Mellia et al. (2002). In this chapter, we propose for the first time (to the best of our knowledge) a packet network design and planning approach that is based on user-layer QoS parameters and explicitly accounts for the impact of the TCP protocol x. Our proposed approach maps the end-user performance constraints into transport-layer performance constraints first, and then into networklayer performance constraints. The mapping process is then considered together with a realistic representation of traffic patterns at the network layer to design the IP network. The representation of traffic patterns inside the Internet is a particularly delicate issue, since it is well known that IP packets do not arrive at router buffers following a Poisson process, see Paxson and Floyd (1995), but a higher degree of correlation exists, which can be partly due to the TCP control mechanisms. This means that the usual approach of 1 Fraleigh et al. (2003) account for user-layer QoS constraints focus mainly on voice traffic, and do not consider the impact of TCP at the transport layer.

2 Design of IP VPNs under End-to-end QoS Constraints modeling packet networks as networks of M/M/l queues as discussed in Gavish and Neuman (1989), Kamimura and Nishino (1991), Cheng and Lin (1995), Gavish (1992), Mai Hoang and Zorn (2001) is not acceptable. In this chapter we adopt a refined IP traffic modeling technique, already presented in Garetto and Towsley (2003), that provides an accurate description of the traffic dynamics in multi-bottleneck IP networks loaded with TCP mice and elephants. The resulting analytical model is capable of producing accurate performance estimates for general topology IP networks loaded by realistic TCP traffic patterns, while still being analytically tractable. In summary, in this chapter we propose a new approach to the packet network design problem, which considers as constraints the e2e QoS perceived by users. Given (i) the network topology, (ii) the average traffic exchanged by all source/destination pairs (i.e., the traffic matrix), (iii) a routing algorithm (e.g., shortest path), we solve the capacity assignment problem, minimizing the link capacity cost, subject to the e2e QoS constraints expressed by users, i.e., either the average data throughput, or the file transfer latency, obtained by considering TCP as the transport protocol. In addition, our approach is capable of also solving either the droptail buffer dimensioning problem, or the AQM (Active Queue Management) parameter dimensioning problem in the case of AQM buffers (e.g., RED). While the buffer cost is usually considered to be negligible, it is important to have a procedure to dimensioning the correct buffer size, to limit the impact of queueing delay on the performance. Moreover, the availability of buffers in high-capacity router is limited by the cost of high-speed static RAM. The rest of the chapter is organized as follows. Section 2 describes the general network design methodology. The e2e QoS mapping into transport- and network-layer performance constraints, and some translations examples, are described in Section 2.1. Section 3 provides the formulation of the general optimization problem, and lists the assumptions needed for the modeling phase. Afterwards, the Capacity Assignment (CA) and the Buffer Assignment (BA) problems are presented. Results obtained for both problems are tabulated and compared with results of ns-2 simulations in Section 4, Conclusions are given in Section 5.

2,

The IP network design methodology

Of course, in any realistic network problem the notion of "optimum design" is an extremely difficult task. The IP network design methodology that we propose in this chapter is based on a "Divide and Conquer" approach, in the sense that it consists of several subproblems. Thus, the

37

38

NEXT GENERATION INTERNET !

:

•

-

•

•

•

•

•

:

•

•

•

:

::i-th S-D pair constraints Page latency, Data throughput : • : • : • : • : • : ! : • : • : • : • : • : •

m r

: : : Optimization;

m m

Application layei^ QoS translator ^

.

^Buffer Assignamentl

1 File completion tim<

•;•••;•;•/

File throughput

in

•:•/

^Transport layer 4

|Qo^ton£aJ.orJ

:::

Buffer size / : AQM parameters / ; ! !

Link Capacity

/::

Capacity Assignament ^

Round Trip Time Loss probability

:: I n p u t s :::: ••••••

- S

s

: • / Physical Topology / : • : / y Routing algorithm /:•:•/

Traffic matrix Capacity cost

/• /;•;

Figure 2.1. Schematic flow diagram of the network design methodology

subproblems are solved separately in a way to obtain a heuristic solution to the general problem. Figure 2.1 shows theflowdiagram of the design methodology. Shaded, rounded boxes represent function blocks, while white parallelograms represent input/output of functions. There are three main blocks, which correspond to the classic blocks in constrained optimization problems: constraints (on the left), inputs (on the bottom right) and optimization procedure (on the top right). As constraints we consider, for every source/destination pair, the specification of user-layer QoS parameters, e.g., download latency for web pages or perceived quality for real-time applications. Thanks to the definition of QoS translators, all the userlayer QoS constraints are then mapped into lower-layer performance constraints, down to the network layer, where performance metrics are typically expressed in terms of average delay and loss probability. The optimization procedure needs as inputs the description of the physical topology, the traffic matrix, the routing algorithm, and the expression of the cost as a, function of the link capacities. The objective of the optimization is to find the minimum cost solution that satisfies the user-layer QoS constraints. The solution identifies link capacities and buffer sizes (or AQM parameters). In our methodology we decouple the CA problem from the BA problem. The optimization starts then with the CA subproblem, solved considering infinite buffers. A second optimization is then performed to solve the BA sub problem. Motivations for this choice are given in the

2 Design of IP VPNs under End-to-end QoS Constraints following sections, where we briefly comment on the main steps of the design methodology, and we provide a formal description for the optimization problem.

2,1

QoS translators

The process of translating QoS specifications between different layers of the protocol stack is called QoS translation or QoS mapping. Several parameters can be translated from layer to layer, for example: delay, jitter, throughput, or reliability (see Knoche and de Meer, 1997 and the references therein). According to the Internet protocol architecture, at least two QoS mapping procedures should be considered in our case; the first translates the application-layer QoS constraints into transportlayer QoS constraints, and the second translates transport-layer QoS constraints into network-layer QoS constraints, such as Round Trip Time (RTT) and Packet Loss Probability (P/oss). Matching the user-layer QoS requirements to the network-layer performance parameters is not a straightforward task. In this section we present some examples of QoS constraints translation and propose a new QoS translator tailored for the TCP protocol case. 2.1.1 Application-layer QoS translator. This module takes as inputs the application-layer QoS constraints, such as web page transfer latency, data throughput, audio quality, etc. Assuming then that for each application we know which transport protocol is used, i.e., either TCP or UDP, this module maps the application-layer QoS constraints into transport-layer QoS constraints. Given the multitude of Internet applications, it is not possible to devise a generic procedure to solve this problem, and we do not focus on generic translators, since ad-hoc solutions should be used, depending on the application. For real-time applications over UDP, the output of the applicationlayer translator is given in terms of packet loss probability, and maximum network e2e delay. For elastic applications exploiting TCP, the output of the applicationlayer translator is still a set of high-level constraints, expressed as file transfer latency (Lt), or throughput (T^). Example - Voice over UDP. In this case, the application-layer QoS translator is in charge of translating the high-level QoS constraint, such as the Mean Opinion Score (MOS), into transport-layer performance constraints, expressed in terms of packet loss probability, maximum network e2e delay. Several studies were conducted on this subject in Markopoulou et al. (2002). For example, good vocal perceived quality

39

40

NEXT GENERATION INTERNET

is associated with an average packet loss probability of the order of 1%, and a maximum e2e delay smaller than 200 ms. Example - Web page download. In this case, the input of the application-layer QoS translator is a desired download time, expressed as a function of the page size, the protocol type, the number of objects in the page, etc. As output, the TCP latency constraint is evaluated. For example, given a desired web page download time smaller than 1.5s, a web page which contains 20 objects, downloaded using 4 parallel TCP connections at most, each object must be transferred with a TCP connection of average duration smaller than 0.3s. 2.1.2 Transport-layer QoS translator. The Transportlayer QoS translator maps transport-layer performance constraints into network-layer performance constraints; the translator in this case must be tailored to the transport protocol used: either UDP or TCP. Real time applications - UDP. The translation from transportlayer performance constraints into network-layer performance constraints in the case of real-time UDP applications is rather straightforward, since the transport-layer performance constraints are usually expressed in terms of packet loss probability and maximum e2e network delay, which can be directly used also as network-level performance parameters. Jitter and delay variation may also be considered. The only effect of UDP that must be taken into account is related to the protocol overhead, which increases the offered load to the network. This effect may be significant, specially for applications like voice, that use small packets. Elastic traffic — TCP. The translation from transport-layer QoS constraints to network-layer QoS parameters, such as Round Trip Time (RTT) and packet loss probability (Pjoss) in this case is more difficult. This is mainly due to the complexity of the TCP protocol, and in particular to the error, flow and congestion control algorithms. The TCP QoS translator accepts as inputs either the maximum file transfer latency (Lj), or the minimum file transfer throughput (T^). We impose that all flows shorter than a given threshold (i.e., TCP mice) meet the maximum file transfer latency constraint, while longer flows (i.e., TCP elephants) are subjected to the throughput constraint. For example, from measurements of the file length distribution over the Internet, presented in Mellia et al. (2002), it is possible to say that 85% of all TCP flows are shorter than 20 segments. For these flows, we im-

2 Design of IP VPNs under End-to-end QoS Constraints

41

pose that the latency constraint must hold. Instead, for flows longer than 20 segments we impose that the throughput constraint must be met. Obviously, the most stringent constraint must be considered in the translation. The maximum RTT and Pioss that satisfy both constraints constitute the output of this translator. To solve the translation problem, we exploit recent research results in the field of TCP modeling (see Garetto and Towsley, 2003 and the references therein). Usually, TCP models take network-layer parameters as inputs, i.e., RTT and packet loss probability, and give as output either the average throughput or the file transfer latency. Our approach is based on the inversion of known TCP models, taking as input either the connection throughput or the file transfer latency, and obtaining as outputs RTT and PiOSs- Among the many models of TCP presented in the literature, when considering file transfer latency, we use the TCP latency model described in Cardwell et al. (2000), which offers a good tradeoff between computational complexity and accuracy of performance predictions. We will refer to this model as CSA (from the last name of authors). When considering throughput, we instead exploit the wellknown PFTK formula, from Padhye et al. (2000). Our methodology can however be modified to incorporate more complex/accurate TCP models. The inversion of TCP models is not simple, since there are at least two parameters that impact TCP throughput and latency, i.e., RTT and Ploss< An infinite number of possible solutions for these two parameters satisfies a given constraint at the TCP level. We decided therefore to fix the Pioss parameter, and leave RTT as the free variable. This choice is due to the considerations that the loss probability has a larger impact on the latency of very short flows, and that it impacts the network load due to retransmissions. Furthermore, Pi0Ss is a ls° constrained by realtime applications. Finally, fixing the value of the loss probability allows us to decouple the CA problem from the BA problem. Therefore, after choosing a value for Pi0Ss) a se^ °f curves can be derived, showing the behavior of RTT versus file latency and throughput. From these curves it is then possible to derive the maximum allowable RTT. The inversion of the CSA and PFTK formulas is obtained using numerical algorithms. For example, given a maximum file transfer latency and a minimum throughput T/i — 512 Kbps constraint, the curves of Figure 2.2 report the maximum admissible RTT which satisfies the most stringent constraint for different values of Pi0Ss>

42

NEXT GENERATION INTERNET

as

10" c File Tranter Latency [s]

Figure 2.2. RTT constraints as given by the transport layer QoS translator

3,

Optimization formulation and solutions

Designing a packet network today may have quite different meanings, depending on the type of network that is being designed. If we consider the design of the physical topology of the network of a large Internet Service Provider (ISP), the design must very carefully account for the existing infrastructure, for the costs associated with the deployment of a new connection or for the upgrade of an existing link, and for the very coarse granularity in the data rates of high-speed links. Instead, if we consider the design of a corporate VPN (Virtual Private Network), where the capacity is leased from a long distance carrier, the set of leased lines is not a critical legacy, costs are directly derived from the leasing fees, and the data rate granularity is much finer. While the general methodology for packet network design and planning that we describe here can be applied to both contexts, as well as others, in this chapter we concentrate on the design of corporate VPNs. The reason we consider VPNs is that we must apply a specific optimization technique for each type of problem. Several different formulations of the packet network design problem can be found in the literature; generally, they correspond to different choices of performance measures, of design variable, and of constraints. Here, we consider the following general problem: Given: physical topology, routing algorithm, traffic estimates between node pairs, capacity and buffer costs;

2 Design of IP VPNs under End-to-end QoS Constraints Minimize: total capacity cost, total buffer cost; With respect to: link capacities, buffer sizes; Subject to: packet delay constraints, packet loss probability. In the solution of the CA and BA problems, we need to evaluate the packet delay and loss probability to verify that the constraints are met. We thus first introduce the network model and discuss the relations between performance measures, input parameters, design variables, and constraints that appear in the general design problem. Then we define and solve the CA and BA problems.

3-1

Traffic model

The network model is an open network of queues, where each queue represents an output interface of an IP router, with its buffer. The routing of customers on this queuing network reflects the actual routing of packets within the IP network. In the description of the network model, we assume that all router buffers exhibit a droptail behavior. Traditionally, M/M/l/B queueing models were considered good representations of packet networks. However, given the well-known correlation of actual IP traffic, we choose to model the increased traffic burstiness induced by TCP using the arrival of packets in groups (batch arrivals), hence using M[X}/M/l/B queues. The batch size varies between 1 and W with distribution [X], where W is the maximum TCP window size expressed in segments. The distribution [X] is obtained considering the number of segments that TCP sources send in one RTT, as discussed in Garetto and Towsley (2003). Our choice of using batch arrivals following a Poisson process has the advantage of combining the nice characteristics of Poisson processes (analytical tractability in the first place) with the possibility of capturing the burstiness of the TCP traffic, The decision to model the router output interfaces with M[X]/M/l/B queues is the results of a careful and detailed study of a wide gamut of performance investigations of queue lengths in IP networks, conducted with the ns-2 simulator in Garetto and Towsley (2003). The Markovian assumption for the batch arrival process is mainly due to the Poisson assumption for the TCP connection generation process (when dealing with TCP mice), as well as the fairly large number of TCP connections simultaneously present in the network. The average packet loss probability, and the average time spent by packets in the router buffer, are obtained directly from the solution of the M[x]/M/1/B queue. Given theflowlength distribution, a stochastic model of TCP

43

44

NEXT GENERATION INTERNET

(described in Garetto and Towsley, 2003) is used to obtain the batch size distribution [X].

3,2

Delay analysis

The packet length is assumed to be exponentially distributed with mean 1///, the transmission time for each packet over a link is l//xC, and thus the utilization factor is given by p — A//iC, (C is the link capacity, and / = A//i is the average data flow on link). The average packet delay in the M[X]/M/l/oo queue is given in Chao et al. (1999):

where K = m^1 , being m' and m" the first and second moments of the batch size distribution [X].

3.3

Network model, traffic, and routing

In the mathematical model, the network infrastructure to be designed is represented by a directed graph G — (V, E) in which V is a set of nodes and E is a set of edges. A node represents a router, and an edge represents a physical link connecting one router to another. For each link / we consider: C/, the capacity of the link; //, the average data flow; d/, the physical length; and £?/, the buffer size. Different formulations of the CA problem result by selecting i) the cost functions fi(Ci), ii) the routing model, and hi) the capacity constraints; different methodologies can be applied to solve them. In this chapter we focus on the VPN case, in which common assumptions are i) linear cost, i.e., fi(Ci) = diCu ii) non-bifurcated routing, and iii) continuous capacities. For each source-destination pair, traffic is transmitted over exactly one directed path in the network. Each path psd from source s to destination d (that is an input to the problem) is determined by a minimum-cost algorithm. Considering that TCP is a closed-loop control protocol, we define as transport path (route) rsd = psd Upds- For each path rsd and link I £ E, let Si(rsd) € {0,1} denote the indicator function which is one if link I is in path rsd and zero otherwise. This allows the direct evaluation of the average data flow // on a link / as a function of traffic requirements. The average (busy-hour) traffic requirements between nodes can be represented by a requirement matrix T — {jsd}i where ^sd is the average packet transfer rate from source s to destination d. The T matrix can be derived from a higher-level description of the (maximum) traffic requests,

2 Design of IP VPNs under End-to-end QoS Constraints

45

expressed in terms of "pages per second", or "flows per second" for a given source/destination pair. We consider as traffic offered to the network 7 ^ = lJpd—, thus accounting for the retransmissions due to the losses that flows experience along their path to the destination. PiOss(rsd) is the desired e2e loss probability for path rsrf.

3,4

The Capacity Assignment problem

As previously said, we solve the Capacity Assignment (CA) problem by considering infinite buffers. The only constraint that has to be met is therefore the e2e packet delay, which is evaluated thanks to the adoption of the M[x]/M/l/oo model for links. Given the network topology, the traffic requirements, and the link flows in the general problem, it is possible to formulate the CA problem as follows. Minimize: leE

Subject to:

^ fi=

"

^

"

Y, Wsdhsd, Ci>fu

^

VMeV

Vs,deV

yieE

(2.3) (2.4) (2.5)

The objective function (2.2) represents the total link cost, which is the sum of the cost functions of link /, //(C/), Equation (2.3) is the packet delay constraint for each source/destination node pair; where RTTsd is the desired Round Trip Time for (aggregated) TCP traffic from node s to node d, and rsd is the propagation delay for path p5Cj. Equation (2.4) defines the average data flow on the link. Constraints (2.5) are nonnegativity constraints. The only design variables are the link capacities Q. We notice that the objective function and the constraint functions are (weakly) convex, therefore the CA problem is a convex optimization problem.

3,5

The Buffer Assignment problem

Given the network topology, the traffic requirements and the link flows, by fixing the link capacities in the general problem, it is possible

46

NEXT GENERATION INTERNET

to formulate the Buffer Assignment (BA) problem as follows. Minimize:

leE

Subject to: ~ ; 8 i { r 8 d ) . p { B u C u fu [X]) < Pioss{rsd), leE

Vs.deV

Bi>0, VI eE

(2.7)

(2.8)

The objective function (2.6) represents the total buffer cost, which is the sum of the cost functions of buffer /, gi{B{) = J3/. Equation (2.7) is the loss probability constraint for each source/destination node pair. Where p(Bi)Cufi,[X\) is the average loss probability for the M^/M/l/B queue, which is evaluated by solving its Continuous Time Markov Chain (CTMC). Constraints (2.8) are non-negativity constraints. In the previous formulation we have considered the following upper bound on the value of Pioss (constraint (2.7)). Pioss(rsd) = 1 - I ] ( J ~ Si{rad).p{Bh Cu fu [X])) < leE

(2.9)

leE

Notice also that the first part of equation (2.9) is based on the assumption that link losses are independent. Therefore, the solution of the BA problem is a conservative solution to the full problem. Notice also that, to evaluate the packet dropping probability, we explicitly consider the bidirectional transport path rsd, taking into account the fact that the performance of TCP is affected by data segments lost on the forward path p s ^, and by ACKs lost on the reverse path pds> While the second event has less impact on TCP performance, it is not negligible for short file transfers. The proof that the BA problem is a convex optimization problem is not a straightforward task. The difficulty in this proof derives from the need of showing that p{B) C, / , [X]) is convex. Since, to the best of our knowledge, no closed form expression for the M[X]/M/l/B stationary r distribution is know n, no closed form expression for p{B) C, / , [X]) can be derived. However, we conjecture that the BA problem is a convex optimization problem by considering that: (i) for an M/M/l/B queue, p(B,C,f) is a convex function (see Nagarajan and Towsley, 1992); and (ii) approximating p(B, C, / , [X]) = Y^IZB ^ where TT^ is the stationary

2 Design of IP VPNs under End-to-end QoS Constraints

47

distribution of an M[X]/M/l/oo queue, the loss probability is a convex function of B. We can thus classify both the CA and BA problems as multivariable constrained convex minimization problems; therefore, the global minimum (for each subproblem) can be found using convex programming techniques. We solve the minimization problems applying first a constraints reduction procedure which reduces the set of constraints by eliminating redundancies. Then the solution of the CA and BA problems is obtained using the logarithm barrier method, see Wright (1992).

3.6

Setting the AQM parameters

The output of the BA problem is the buffer size J3/ for each router interface, assuming a droptail behavior. If more advanced AQM schemes are deployed by network providers to enhance the TCP performance, it is possible to derive guideline for the configuration of the AQM parameters as well. In this chapter, we consider Random Early Detection (RED), see Floyd and Jacobson (1993), as an example, and discuss how to set its parameters. The basic RED algorithm has three static parameters min-th, max-th, maxjp, and one state variable avg. When the average queue length avg exceeds min-th, an incoming packet is dropped with a probability that is a linear function of the average queue length. In particular, the packet dropping probability increases linearly form 0 to maxjp, as avg increases from minJbh to maxJh. When the avg exceeds maxJbh, all incoming packets are dropped. Ideally, the buffer size should be sufficiently large to avoid that packets are dropped at the queue due to buffer overflow. Therefore, we choose B\ — a.maxJh, a > 1, e.g., a = 2 as suggested in the "gentle" Rosolen et al. (1999) variation of RED 2 Therefore, the RED parameter dimensioning problem can be solved by imposing that: (T> n * r v n Et[N} - minJhi L p(Bh Cu fu [X]) = -—7-max.pi 77

maxJhi-

min.thi

. (2.10)

(

Note that (2.10) fixes maxjpi by imposing that the average RED dropping probability evaluated at the average queue length E[ [N] (obtained considering the Mnq/M/1/.B queue) satisfies the Pi0Ss{rsd) constraint 2 Gentle-RED is a modification of RED that allows a smoother transition of the dropping probabilities when the average queue length exceeds the maximum threshold, making it more robust to the setting of the parameters.

48

NEXT GENERATION INTERNET

in (2.7). Finally, we set mindhi = p.maxJhi, (3 < 1. In the numerical examples that follow, we selected a = 2, /? = 1/16,

4,

Numerical examples and simulations

In this section we present some selected numerical results, showing the accuracy of the IP network designs produced by our methodology. We first applied our optimization method to design some networks topologies and next used a simulation procedure to evaluate if the QoS constraints were actually respected. The tool used for simulation is ns version 2. For all simulations, the "batch means" technique, with 30 batches, was considered. We assume that New Reno is the TCP version of interest. In addition, we assume that TCP connections are established choosing at random a server-client pair, and are opened at instants described by a Poisson process. Connection opening rates are determined so as to set the link flows // to their desired values. The packet size is assumed constant, equal to the maximum segment size (MSS), the maximum window size is assumed to be 32 segments. The amount of data to be transferred by each connection (i.e., the file size) is expressed in number of segments. We consider a mixed traffic scenario where the file size follows the distribution shown in Figure 2,3, which is derived from one-week long measurements, conducted in Mellia et al. (2002), in three different time periods. In particular, we report the discretized CDF, obtained by splitting the flow distribution in 15 groups with the same number of flows per group, from the shortest to the longest flow, and then computing the average flow length in each group. The large plot reports the discretized CDF using bytes as unit, while the inside one reports the same distribution taking today's most common MSS of 1460 bytes as unit. We use the most recent measurements in the following simulations.

4,1

Single-bottleneck topology

We start by considering a very simple single bottleneck topology. We assume one-way TCP New Reno connections with uncongested backward path. The topology comprises the bottleneck link, and a number of peripheral links, whose capacities are equal to 25 Mbps and whose propagation delays are uniformly distributed between 0.01 and 0.03ms. Table 2.1 reports the capacity and buffer size of the bottleneck link obtained with our method. In order to obtain some comparisons, we also implemented a design procedure using the classical formula, see Kleinrock (1976), which considers an M/M/l queue model in the CA problem. We also extended the classical approach to the BA problem,

49

2 Design of IP VPNs under End-to-end QoS Constraints

2 4 8 16 32 64128E56

10

2

3

10

10" File Size [Bytes]

105

10°

Figure 2.3. TCP connection length cumulative distributions

which is solved considering M/M/l/B queues. We choose as target parameters the following: latency Lt < 0.3s for flows shorter than 20 segments, throughput Th > 512 Kbps for flow longer than 20 segments and Pioss = 0.01. Using the transport layer QoS translator, we obtain the equivalent constraint RTT < 0.03s (for the sake of simplicity, in the examples we will consider RTTsd = RTT, V s,d G V), which corresponds to the most stringent latency constraint (Figure 2.2). We imposed these same constraints also in the classical approach. Looking at the CA solution we observe that using our methodology a much higher data rate than the classical approach is required, as shown by the average link utilization p\ = 0.64. Also when considering the buffer size design, we observe that the adoption of the M\x]/M/1/B model leads to larger buffer requirements than the one with a simpler M/M/l/B model. Table 2.2 shows the average packet delay E[T], the average queue size E[N], and the packet loss probability PiOss, fr°m the M^X}/M/l/B queue model and from ns-2 simulations (considering droptail and RED buffers). We can observe good agreement between model and simulations results. Notice also that the assumption of exponential length packets does not affect the performance evaluation. Indeed, recalling that in the simulation all data packets have a fixed length of 1460 bytes, no significant differences are noticed. This point out as an indication that packet length distribution is not a critical factor. Table 2.3 reports file transfer latencies for different flow sizes (in number of segments), as estimated by the CSA model (second column) and

50 Table 2.1.

NEXT GENERATION INTERNET Design results for bottleneck network M[x]/M/1/B fi[Mbps] Ct[Mbps] pi Bi\pkts]

Table 2.2.

M/M/l/B

16 25 0.64 79

16 17 0.93 28

Model and simulation results for bottleneck network (network layer) Mm/M/1/B E[T] E[N] Pioss

0.0095 13.2 0.0098

droptail

RED

0.010 13.4 0.0044

0.0091 12.4 0.0039

as observed by simulations. Results are shown considering both our approach and the classical methodology, and by considering either droptail or RED buffers. We can observe that the accuracy of the network design obtained with our methodology is extremely good, with flow latencies always meeting the QoS constraints. Note also that longer flows obtain a much higher throughput than the target, because the flow transfer latency constraint is more stringent (as also shown in Figure 2.2). On the contrary, the network design obtained with the classical formula fails to meet the QoS constraints. This is mainly due to the adoption of an M/M/l queue model, which fails to capture the high burstiness of IP traffic. No major differences are visible when RED buffers are present in the network, if our methodology is adopted, while a degradation of performance is observed if the classical approach is used. This is due to the very small buffer sizes resulting from the classical design, which do not allow RED to work properly, and therefore cause a large packet dropping probability.

4.2

Multi-bottleneck topologies

As a second example, we present results obtained considering the multi-bottleneck mesh network shown in Figure 2.4 with 5 nodes and 12 links. In this case, link propagation delays are all equal to 0.5ms, that correspond to a link length of 150 Km. Figure 2.4 shows link identifiers, link weights (in parentheses), and the traffic requirements matrix F. Link weights are chosen in order to have one single path (by using a minimum cost routing algorithm) for every source/destination pair. A

2 Design of IP VPNs under End-to-end QoS Constraints

51

Table 2.3. Model and simulation results for bottleneck network (Lt) M[x]/M/1/B

M/M/l/B

seg.

CSA

droptail

RED

droptail

RED

1 2 4 6 10 19

0.05s 0.08s 0.12s 0.16s 0.20s 0.26s

0.08s 0.09s 0.12s 0.13s 0.15s 0.18s

0.05s 0.08s 0.11s 0.13s 0.16s 0.19s

1.84s 2.12s 2.44s 2.56s 2.84s 3.16s

1.56s 2.31s 3.71s 5.26s 6.59s 10.41s

195

2.07Mbps

5.2Mbps

5.1Mbps

180kbps

15kbps

(8)

O/D

1 2 3 4 5

1

0 7 9 8 3 9 0 3 9 2 4 1 0 8 7 8 1 6 0 2 3 8 4 9 0

2 3 4 5

Traffic Matrix [Mbps]

Figure 2.4-

5-node network: topology and traffic requirements

number of peripheral links (not shown in the picture) are attached to each node. These links are not congested, their capacities being equal to 30 Mbps, and their propagation delays are uniformly distributed between 0.01 and 0.03ms. We considered the same QoS target constraints for all source/destination pairs, which are: (i) file latency Lt < 0.5s for TCP flows shorter than 20 segments, and (ii) throughput Th > 512 Kbps for TCP flows longer than 20 segments. Selecting P/oss = 0.01, we obtain as design constraint RTT < 0.07s as can be seen in Figure 2.2. The CA and BA problems associated with this network have 12 unknown variables and 11 constraint functions (we have discarded 9 redundant constraint functions). Table 2.4 reports the link capacities, link utilizations, and buffer sizes obtained with the proposed method. We also report in the same table the average packet delay E[T], average queue size E[N], and the average packet loss probability Pi0Ss-> computed using the Mw/M/l/B queue model.

52 Table 2.4-

NEXT GENERATION INTERNET Design results for 5-node network

M[X}/M/1/B Link

C [Mbps]

P

B

E[T]

E[N]

Ploss

1 2 3 4 5 6

18.9 23.9 26.9 11.9 4.4 25.4 18.4 23.4 23.9 13.9 9.4 3.4

0.85 0.88 0.89 0.75 0.67 0.82 0.76 0.81 0.88 0.79 0.85 0.58

196 261 265 137 88 184 160 188 243 154 163 72

0.033 0.035 0.034 0.031 0.061 0.021 0.021 0.022 0.034 0.032 0.062 0.061

47.8 65.2 73.2 25.3 16.1 40.2 26.8 37.0 63.3 31.1 43.9 10.8

0.006 0.004 0.006 0.004 0.010 0.004 0.002 0.003 0.005 0.005 0.01 0.009

~P

B

E[T]

E[N]

175

0.037

40.0

7 8 9 10 11 12

204.16

0.794

It can be noticed that the link utilization factors are in the range [0.67,0.89], with average equal to about ~p = 0.8. Buffer sizes are in the range [70 : 270], with average B — 175, which is about 4 times the average number of packets in the queue (E[N] = 40). This is due to the bursty arrival process of IP traffic, which is well captured by the M[X]/M/1/B model. To complete the evaluation of the new methodology, we compare the link utilization factors and buffer sizes obtained when considering the classical algorithm, i.e., by using an M/M/l/B queueing model. Figure 2.5 shows the link utilizations (first plot) and buffer sizes (second plot) obtained with our method and with the classical approach. It can be immediately noticed that considering the burstiness of IP traffic radically changes the network design. Indeed, the link utilizations obtained with our methodology are much smaller than those produced by the classical approach, and buffers are much longer. To evaluate the quality of the design results, we ran ns-2 simulations for droptail and RED buffers. Considering first network layer QoS parameters (E[T] and PjoSS), Table 2.5 reports ns-2 simulation results for the average packet delay and the average packet loss probability on every link (considering droptail and RED buffers). It can be noticed that the resulting delay and loss probability are very close to the de-

2 Design of IP VPNs under End-to-end QoS Constraints

53

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65

0.6

M7M7r

0.55

M [yl /M/1

0.5 Links

Links

Figure 2.5. Link utilization factor and buffer size for a 5-node network

sired one; in fact, simulated E[T] has few results larger than the targets, while simulated Pioss results smaller than target one (considering that the simulation margin of error for E[T] and P/ oss , for each link, are about ±13.5% and ±20%, respectively). In order to verify the e2e QoS constraints at the transport layer, we report detailed results selecting traffic from node 4 to node 1, which is routed over one of the most congested path (three hops, over links: 8,7,6). Figure 2.6 plots the file transfer latency for all flow sizes for the selected source destination pair (95% confidence intervals are shown). The QoS constraint of 0.5s for the maximum latency is also reported.

54

NEXT GENERATION INTERNET

2.5

CSA Model DropTail RED

1 0.5 0 4

6 10 Flow Length [pkt]

19

195

Figure 2.6. Model and simulation results for latency; 3-link path from the 5-node network

In this case we can see that model results and simulation estimates are in perfect agreement with specifications, being the constraints perfectly satisfied for all flows shorter than 20 segments. Note also that longer flows obtain a much higher throughput than the target, because the flow transfer latency constraint is more stringent (as also shown in Figure 2.2). It is important to observe that the test of the QoS perceived by end users in a network dimensioned using the classical approach cannot be performed, since simulations fail even to run, because the dropping probability experienced by TCP flows is so high that retransmissions cause the offered load to become larger than 1 for some links, i.e., the network designed with the classical approach is not capable of supporting the offered load and therefore cannot satisfy the QoS constraints. As a second example of multi-bottleneck topology we chose a network of 10 nodes and 24 links. For all (90) source/destination pairs, traffic is routed over a single path. Link propagation delays are uniformly distributed between 0.05 and 0.5ms, i.e., link lengths vary between 15 Km and 150 Km. The traffic requirement matrix is set to obtain an average link flow of about 15 Mbps. The CA and BA problems associated with this network have 24 unknown variables and 66 constraint functions (we have discarded 24 redundant constraint functions). We considered the same design target parameters as for the previous example. In order to observe the impact

2

Design of IP VPNs under End-to-end QoS Constraints

55

Table 2.5. Simulation results for 5-node network (network layer) RED

drop tail

Link

E[T]

E[N]

1 2 3 4 5 6 7 8 9 10 11 12

0.029 0.015 0.019 0.033 0.079 0.018 0.020 0.023 0.024 0.031 0.055 0.096

40.3 26.8 40.3 25.5 20.4 33.4 24.4 38.8 44.5 29,2 38.2 16.5

±loss

E[T]

E[N]

0.002 0.001 0.001 0.004 0.005 0.003 0.004 0.005 0.002 0.003 0.002 0.010

0.027 0.019 0.029 0.033 0.086 0.021 0.026 0.033 0.020 0.031 0.052 0.090

38.3 35.9 61.6 25.6 22.3 37.8 31.2 54.8 36.6 29.3 36.3 15.4

±loss

0.004 0.002 0.003 0.004 0.008 0.005 0.002 0.004 0.002 0.003 0.005 0.010

of traffic load and performance constraints on our design methodology, we consider different numerical experiments. Figure 2.7 shows the range of network link utilizations versus traffic load (first plot). Looking at how traffic requirements impact the CA problem, we observe that the larger is the traffic load, the higher the utilization factor. This is quite intuitively explained by a higher statistical multiplexing gain, and by the fact that the RTT is less affected by the transmission delay of packets at higher speed. The behavior of buffer sizes versus traffic requirements is shown in the second plot. As expected, the larger is the traffic load the higher the space needed in queue (buffer sizes). The impact of more stringent QoS requirements is considered in Figure 2.8 (Pioss — 0.01, link traffic load = 15 Mbps). Notice that, in order to satisfy a very tight constraint (file latency Lt < 0.2s), it is necessary an utilization factor close to 20% on some particularly congested links (first plot). Tight constraints mean packet delays with small values and thus larger capacity values concerning the link flows. On the contrary, relaxing the QoS constraints, we note a general increase in the link utilization, up to 90%. The behavior of buffer sizes versus file transfer latency requirements is shown in the second plot. Finally, Figure 2.9 shows link utilizations and buffer sizes considering different packet loss probability constraints, while keeping fixed the file transfer latency Lt < 2s and throughput T^ > 512 Kbps (link traffic load

56

NEXT GENERATION INTERNET

0.8 -

i« 0.4 -

0.2 •

Average —•— Min — : Max —*— 5 10 15 20 Source/Destination Average Traffic [Mbps]

300 -

5 10 15 20 Source/Destination Average Traffic [Mbps]

Figure 2.7. Link utilization factor and buffer length for a 10-node network (considering different source/destination traffics)

= 15 Mbps). Obviously, an increase of Pi0Ss values forces the transport layer QoS translator to reduce the RTT to meet the QoS constraints. As a consequence, the utilization factor decreases (first plot). More interesting is the effect of selecting different values of Pi0Ss o n buffer sizes (second plot). Indeed, to obtain Pioss < 0.005, buffer sizes longer than 350 packets are required, while Pi0Ss < 0.02 can be guaranteed with buffers shorter than 70 packets. This result stems from the correlation of TCP traffic and is not captured by a Poisson model.

2 Design of IP VPNs under End-to-end QoS Constraints

•«

57

0.6

13

0.5

1 1.5 File Transfer Latency [s]

0.5

1 1.5 File Transfer Latency [s]

600 -

Figure 2.8. Link utilization factor and buffer length for a 10-node network (considering different target file transfer latencies)

Running simulations in ns-2 with more than about 1000 TCP connections becomes very expensive with standard computers, so in this case we performed path simulations rather than simulating the complete network, i.e., we selected a single path, and simulated only that one, using ns-2. TCP connection opening rates are determined so as to set the link flows to their determined values. The results obtained by path simulations are in general a worst case with respect to what would be obtained by running simulations of the entire network, because the "interfering"

58

NEXT GENERATION INTERNET

0.8

5 •g '3

0.4 0.2

0.005

0.01 0.015 Packet Loss Probability

0.02

0.005

0.01 0.015 Packet Loss Probability

0.02

400

Figure 2.9. Link utilization factor and buffer length for a 10-node network (considering different target packet loss probabilities)

traffic is more aggressive, since it does not cross all links along its path, hence no loss and no traffic shaping occurs. As an example, by choosing a 4-link path, we obtained average packet delay E[T] and average packet loss probability Pi0Ss results that are reported in Table 2.6, considering three scaled versions of the traffic matrix (and network designs). It can be observed that, for all the traffic scenarios, simulated E[T] are in accordance to the targets, and simulated Ploss have results smaller than the targets (considering that the simu-

2 Design of IP VPNs under End-to-end QoS Constraints

59

Table 2.6. Simulation results for a 4-link path from the 10-node network (RED) 37

E[T] Pioss

0.070 0.0052

7/3

0.071 0.069 0.0060 0.0079

Figure 2.10. Latency simulation for a 4-link path from the 10-node network (RED)

lation margin of error for E[T] and Ploss are about ±15% and ±20%, respectively). File transfer latency results are also obtained (for the chosen 4-link path) and are reported versus the file size, and scaled versions of the traffic matrix, in Figure 2.10 (in the case of RED buffers). It can be noted that the target QoS constraints are met in all cases (95% confidence intervals are shown).

4,3

Computational complexity

Finally, we briefly discuss the computation times needed to solve the CA problem. The solver algorithm was implemented in the C language, and the computation was carried out on a lGHz processor under Linux O.S. We considered networks with different numbers of nodes (from 10 to 100) and different number of ingoing/outgoing links per node (from 3 to 9). For each number-of-nodes/number-of-links pair, we obtain problems

60

NEXT GENERATION INTERNET

with different number of variables and different number of constraints. CPU times range from less than 1 second (to solve a 10-nodes/30-links network design problem) to about than 40 minutes (to solve a 100node/900-links network design problem).

5.

Conclusion

In this chapter, we have proposed a new packet network design and planning approach that is based on user-layer QoS parameters. The main novelty of our approach is that it considers the end-toend performance constrains at the application layer, mapping them into transport layer QoS constraints first, and finally into network layer performance constraints. Traditional packet network design approaches model a communication network as a Jackson queueing network, thus assuming packet flows to be Poisson. A second important improvement with respect to traditional approaches lies in the fact that we have tried to consider more realistic packet traffic models, accounting for both longlived and short-lived TCP connections, and considering more complex systems of queues which have been recently proved to effectively represent the performance of modern IP networks (Garetto and Towsley, 2003), Examples of application of the proposed design methodology to different networking configurations have shown the effectiveness of our approach. Acknowledgments The authors would like to thank the anonymous reviewers for their helpful comments and suggestions.

References Cardwell, N., Savage, S., and Anderson, T. (2000). Modeling TCP latency. In: Proceedings of Infocom 2000, pp. 1742-1751, Tel Aviv, Israel. Chao, X., Miyazawa, M., and Pinedo, M. (1999). Queueing Networks, Customers, Signals and Product Form Solutions. John Wiley. Cheng, K.T. and Lin, F.Y.S. (1995). Minmax end-to-end delay routing and capacity assignment for virtual circuit networks. In: Proceedings of IEEE Globecom 1995, pp. 2134-2138. Claffy, K., Miller, G., and Thompson, K. (1998). The nature of the beast: Recent traffic measurements from an Internet backbone. In: Proceedings of INET'98, Geneva, CH. Floyd, S. and Jacobson, V. (1993). Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1(4):397-413. Fraleigh, C , Tobagi, F., and Diot, C. (2003). Provisioning IP backbone networks to support latency sensitive traffic. In: Proceedings of IEEE Infocom 2003, pp. 375385, San Francisco, CA. Garetto, M. and Towsley, D. (2003). Modeling, simulation and measurements of queuing delay under long-tail Internet traffic. In: Proceedings of ACM SIGMETRICS

2

Design of IP VPNs under End-to-end QoS Constraints

61

2003, pp. 47-57, San Diego, CA. Gavish, B. and Neuman, I. (1989). A system for routing and capacity assignment in computer communication networks. IEEE Transactions on Communications, 37(4):360-366. Gavish, B. (1992). Topological design of computer communication networks - the overall design problem. European lournal of Operational Research, 58:149-172. Gribble, S.D. and Brewer, E.A. (1997). System design issues for Internet middleware services: Deductions from a large client trace. In: USITS'97. Kamimura, K. and Nishino, H. (1991). An efficient method for determining economical configurations of elementary packet-switched networks. IEEE Transactions on Communications, 39(2):278-288. Kleinrock, L. (1976). Queueing Systems, Volume II: Computer Applications. Wiley Interscience, New York. Knoche, H. and de Meer, H. (1997). Quantitative QoS mapping: A unifying approach. In: Proceedings of the 5th Int. Workshop on Quality of Service (IWQoS97), pp. 347-358, New York, NY. Mai Hoang, T.T. and Zorn, W. (2001). Genetic algorithms for capacity planning of IPbased networks. In: Proceedings of the 2001 Congress on Evolutionary Computation CEC2001, pp. 1309-1315. Markopoulou, A,, Tobagi, F., and Karam, M. (2002). Assessment of VoIP quality over Internet backbones. In: Proceedings of IEEE Infocom 2002, pp. 747-760, New York, NY. Mellia, M., Carpani, A., and Lo Cigno, R. (2002). Measuring IP and TCP behavior on edge nodes. In: Proceedings of IEEE Globecom 2002, pp. 2533-2537, Taipei, TW. Nagarajan, R. and Towsley, D. (1992). A Note on the convexity of the probability of a full buffer in the M/M/l/K queue. CMPSCI Technical Report TR 92-85. Padhye, J., Firoiu, V., Towsley, D., and Kurose, J. (2000). Modeling TCP Reno performance: a simple model and its empirical validation. IEEE/ACM Transactions on Networking, 8(2):133-145. Paxson, V. and Floyd, S. (1995). Wide-area traffic: The failure of poisson modeling. IEEE/ACM Transactions on Networking, 3(3):226-244. Rosolen, V., Bonaventure, O., and Leduc, G. (1999). A RED discard strategy for ATM networks and its performance evaluation with TCP/IP traffic. ACM Computer Communication Review, 29(3):23-43. Wright, M. (1992). Interior methods for constrained optimization. Acta Numerica, 1:341-407.

Chapter 3 DESIGN OF PROTECTED WORKING CAPACITY ENVELOPES BASED ON P-CYCLES: AN ALTERNATIVE FRAMEWORK FOR SURVIVABLE AUTOMATED LIGHTPATH PROVISIONING Gangxiang Shen Wayne D. G rover

Abstract

1.

A recently proposed concept for dynamic provisioning of survivable lightpath services, called the protected working capacity envelope (PWCE) is considered from a network capacity design and blocking standpoint. The PWCE concept offers several attractive properties for a next-generation Internet based on "IP over optical" transport in which transport level connections can be independently and rapidly requested and released by service layer nodes. The main advantages of PWCE are simplification of the state databases and signaling involved for protection considerations in such dynamically operating transport networks. This chapter reviews the concept and related background. It then develops and tests models for the design of PWCEs involving the optimized partitioning of installed capacity into a working envelope, for simple and rapid service provisioning, and a separate reserve network which is configured into a set of p-cycles that provides 100% res tor ability to the envelope.

Introduction

IP over Wavelength Division Multiplexing (WDM) is the most promising architecture to transmit IP services for the next generation Internet (Rajagopalan, 2003). In this two-layer architecture, an underlying optical transport network provides lightpath connectivity between routers and other source-sinks of aggregated IP traffic flows as needed. Service-

64

NEXT GENERATION INTERNET

layer nodes and applications function as clients of the transport network, requesting and releasing high capacity paths to various destinations and specifying certain features such as protection requirements for such paths. Survivability of the transport layer is crucial. The simple cut of a thumb-sized fiber-optic cable can disrupt millions of web applications, phone calls, banking services, flight bookings, and so on. At the same time new networked devices and applications such as sensor networks, grid computing, mobile telephony, wireless PDAs, peer-to-peer file-sharing, videoconference, gaming, virtual reality applications contribute to a more dynamic and unpredictable environment of time- and space-varying demand for transport connectivity and capacity. In this chapter we describe a general strategy to support arbitrarily quickly varying patterns of demand on the transport network, including protection arrangements, and we focus on the related capacity design theory and resulting performance in terms of blocking. The basic approach involves the logical partitioning of total physical capacity into working and protection channels under the recently proposed "Protected Working Capacity Envelope" (PWCE) concept (Grover, 2004a, 2004b). A major advantage is that the scheme greatly simplifies the problem of arranging protection for any statistically stationary pattern of random demand. Compared to the only other scheme currently proposed for handling dynamic protected demands (to be detailed), major advantages ensue because the arrangement of protection becomes invisible to end-user applications or devices. Major issues of database integrity, state dissemination, and scalability with current methods are improved in this new paradigm. Instead of requiring explicit network-wide configuration changes for protection arranged on the time scale of individual connections themselves, under PWCE changes of any type are required only on the time scale on which the statistics of the random demand evolve as a non-stationary process. The problem of efficiently managing and configuring a transport network to both route and protect services in the face of traffic uncertainty in time and space is a challenging and important current problem (Gerstel and Ramaswami, 2000; Zang and Mukherjee, 2001). In recent years virtually all research on the problem has been framed within what is called the Shared Backup Path Protection (SBPP) approach to dynamic survivable service routing (Kini et al., 2000; Kawamura et al., 1994). In the SBPP approach responsibility for database management, signaling and routing control for path establishment and network coordination for protection is put in the end-users hands. SBPP is mainly driven by the Internet Engineering Task Force (IETF) and conforms to the Internet tradition of "end-to-end" control. But the extensive dependency on

3 Design of Protected Working Capacity Envelopes global state information and related issues about scalability of the traditional Internet paradigm are increasingly questioned. The U.S. National Science Foundation has even said that finding radically new paradigms for resilient Internetworking is one of the "grand challenges" (Ammar et al., 2003). Others have coined the phrase "disappearance of telecommunications" (Saracco et al., 2000), which really means that the complexity of accessing service needs to become invisible to the end-users or applications. The PWCE approach falls in line with this ethic. The end-to-end view that users see is simplified to requesting a connection and specifying its protection class.

1.1

Objective and scope

The PWCE concept is an alternative for dynamic survivable service provisioning. The concept was initially developed in Chapter 5 of Grover (2004a) and then proposed in summary form to a wider audience in Grover (2004b). Subsequent work has studied the implementation, signaling and database requirements, and the prospect of adaptive PWCE (Shen and Grover, 2003; Shen and Grover, 2004). The role of this chapter is to recap and introduce the issues and ideas behind the PWCE concept and to provide treatment of some of the Operations Research (O.R.)-related design theory associated with PWCE design. Thus, parts of the chapter are a synthesis drawn from the few other works already published on the topic and other parts, mainly the PWCE design formulations, are new to this chapter. While the PWCE concept applies with any form of span-protection mechanism, we consider p-cycles as the specific protection mechanism of interest. We look into various strategies for maximizing the "volume" and "shaping" (both terms to be defined) of the operational envelope available for dynamic provisioning of random demand. The optimized designs are then tested by simulation to compare the blocking performance of the PWCE method to the SBPP method. The present work is also limited in scope to considering socalled "static" PWCE designs, intended to operate in the presence of stationary statistical load patterns. With the methods here, a succession of static envelope configuration can be computed as the demand pattern evolves or, as we are pursuing in ongoing research, it may be possible for the envelope configuration to be continually adapted in a self-organizing way.

1.2

Outline

A challenge in writing about the PWCE approach based on p-cycles, is that not only are both PWCE and p-cycles recent developments that

65

66

NEXT GENERATION INTERNET

will have to be recapped, but we also need to give sufficient background about the "status quo" approach to dynamic survivable routing (the SBPP approach) to explain the issues that we are trying to improve upon. Some background on protection schemes in general seems also to be advisable in passing. We will also be drawing from standard methods of stochastic process simulation to produce comparative test case results, and we will need to comment on the role of those methods as well. We will lay this groundwork piece by piece, before reaching the new content which is the PWCE design models. To do this Section 2 is devoted to recapping the p-cycle concept and Section 3 reviews the SBPP approach to dynamic survivable service provisioning. These sections could be skipped by a reader already familiar with recent work in optical networking. Section 4 then describes the PWCE concept and the apparent advantages it provides over SBBP, and which motivates this work. Section 5 is where the specific contribution of the chapter lies; giving various formulations for the design of PWCEs using p-cycles as the protection mechanism. Section 6 is then concerned with explaining the routing algorithms and stochastic methods used to simulate dynamic survivable provisioning in test cases where PWCE and SBPP blocking performance are compared. Section 7 presents and discusses the results. Section 8 provides some concluding comments.

2.

The p-cycle concept

p-Cycles, introduced in Grover and Stamatelakis (1998) are a novel approach to network protection. p-Cycles are in some ways like Bidirectional Line-Switched Rings (BLSR), but with support for the protection of straddling span failures as well as the usual protection of spans on the ring itself. An important property is that p-cycles are fully preplanned and pre-connected so that when a failure happens, only the two end nodes of the span do any real-time switching. Unlike generalized span or path restoration, no switching actions are required at any intermediate nodes. This property is only found elsewhere in Automatic Protection Switching (APS) systems or selfhealing rings. So p-cycles have an inherent speed advantage over other mesh restoration scheme in general. Less obviously, however, it has been found that by admitting the protection of so-called "straddling" span failures, 100% restorable networks can be designed with essentially the same capacity-efficiency as a span-restorable mesh network, which can be 3 to 6 times more capacityefficient than ring-based networks. Figure 3.1 illustrates the basic operation of p-cycles. In Figure 3.1 (a) a single p-cycle of one channel of

3 Design of Protected Working Capacity Envelopes protection capacity is shown on a small network. All working channels on a span are either protected in an "on-cycle" manner as in (b) or as "straddling" spans as in (c). For the failure of an on-cycle span, the pcycle will offer one protection path, as in a BLSR, whereas for the failure of a straddling span, two protection paths are available per unit of pcycle capacity. This apparently small technical difference enables a jump from ring-like (over 100 percent) to mesh-like redundancies (well under 100 percent redundancy). But we still retain the structural simplicity and speed of a ring because only two nodes do any real-time switching, and they are fully preplanned for such actions. An additional source of capacity efficiency compared to rings is that when straddling spans are admitted, demands can take shortest path routes over the graph, as opposed to ring-constrained routing. Another important property of p-cycles is that the cycles are fully preconfigured. As mentioned, this improves the p-cycle restoration speed to be essentially the same as BLSR rings. But in fact where propagation distances factor into the BLSR speed, p-cycles will beat BLSR restoration times for straddling span failures because the protection path length averages only half the cycle circumference, not its full circumference as in rings. This is an important restoration speed advantage over the SBPP scheme to follow, where the protection routes are pre-planned but all the switches on the intermediate nodes of these routes need to seize and cross-connect spare capacity in real-time upon a failure. toopback

Protecting an "on-cycle" failure

(c) Protecting a "straddling span" failure

Figure 3.1.

Basic concept of p-cycles

67

68

NEXT GENERATION INTERNET

p- Cycles are also of special interest with respect to survivability in the next generation Internet based on IP over optical transport. As fully pre-connected protection structures, they have an inherent predictability of transmission performance in an optical network. Schemes that have to cross-connect optical channels in real time upon failure can only hope to match this kind of "first-time, every-time" certainty about the performance of a dynamically assembled optical path without extensive pre-failure monitoring of optical power levels, dispersion, cross-talk, noise, and polarization impairments coupled with transmission quality prediction tools. For this chapter, it will serve as background to review the basic model used for conventional p-cycle network design. The parameters and variables are as follows: S is the set of spans of a network. \S\ is the number of spans of the network. P is the set of all cycles of the network. Those are eligible cycles to be chosen as p-cycles. The number of p-cycles in a design is normally much smaller than that of the eligible cycles. X\ is two if span % is a straddler on cycle j , one if span i is an on-cycle span, zero otherwise. 3 P k is one, if cycle j uses span fc, zero otherwise. Wk is the number of working channels on span fc. In a basic sparecapacity assignment model (such as here) these are input parameters typically arising from shortest-path routing of the demand matrix over the graph prior to determining placement of spare capacity and p-cycles. In other so-called "joint" optimization models Wk are variables arising from the internal decisions of how working demands should be routed. See Chapter 10 of Grover (2004a) or Grover and Doucette (2002) for details of joint optimization models. Sfc is the number of spare channels on span k. (Later, in PWCE models, Sk can become a parameter as well.) (integer variable) rij is the number of unit-capacity (i.e., single-channel) copies of cycle j chosen to form a p-cycle in the design, (integer variable)

Objective: minimize

3 Design of Protected Working Capacity Envelopes

69

Constraints: ^

{

sk > Yl Pk ' nJ

MieS

(3.1)

yk

(3-2)

e S

The objective is to minimize the total required spare capacity. Constraint (3.1) asserts the restorability condition on the design by ensuring that for every span failure scenario, the lost working capacity of the span is covered by the set of p-cycles that have either an on-cycle or a straddling protection relationship to that span. Constraint (3.2) generates spare capacity on each span that is sufficient to form the set of p-cycles needed to satisfy constraints (3.1) on each span. In the context of an optical network where capacity is managed at the lightpath level, all variables are integer, representing the exact number of lightwave channels of spare capacity required on each span and, for each p-cycle, its capacity in terms of the number of lightwave channels from which it is formed. The capacity efficiency of p-cycle designs obtained from this simple model can be remarkably attractive and in a special case of semihomogenous networks can actually reach the long-recognized lower bound on redundancy of l/(d — 1) where d is the network average nodal degree. In networks of typical nodal degree, this implies redundancies as low as ~40% (Sack and Grover, 2004). Heretofore redundancies as low as 40% or so for 100% span survivability were only ever approachable through end-to-end path restoration technologies or even global reconfiguration solutions which are far more complex to design and operate and very much slower-acting than p-cycles. p-Cycles also allow great flexibility and adaptivity to growth and changing patterns of demand when cross-connects are used to create and manage p-cycles. Existing p-cycles can be recalculated and installed to cover new spans or new traffic patterns as they occur. The pre-failure configuration of p-cycles itself is a non-real time activity occurring in the spare capacity only. The literature on p-cycles is already fairly extensive, so we need not say too much more. For a comprehensive introduction to p-cycles, see Chapter 10 in Grover (2004a). An earlier overview is in Grover and Stamatelakis (2000). In addition, a basic theory of p-cycles as pre-configured structures in the spare capacity of a mesh network has been developed (Stamatelakis, 1997; Stamatelakis and Grover, 2000a), and there have been studies on self-organization of the p-cycle sets (Stamatelakis and Grover, 1998), application of p-cycles to the MPLS/IP layer (Stamatelakis and Grover, 2000b), application to DWDM networking (Schupke et

70

NEXT GENERATION INTERNET

al., 2002) and studies on joint optimization of working paths and spare capacity (Grover and Doucette, 2002). With reference to the current project, it is important to note, however, that all literature on p-cycle network design has so far assumed a "static" demand matrix (i.e., a forecast of the exact number of essentially permanent lightpath requirements for each node pair). It is sometimes thought as result that published design methods do not apply in a dynamic demand environment. Here it will become apparent that by combining the PWCE provisioning method with some extensions to existing p-cycle design models, we obtain a class of networks that provide for low-blocking dynamic survivable service provisioning.

3,

SBPP for dynamic survivable service provisioning

The simplest (and still surprisingly widely used) method for survivable optical networking is 100% duplication of the transmitted signal on two physically disjoint routes over the physical layer graph (the facilities network). This is known as Automatic Protection Switching with Diverse Protection (1+1 APS/DP). The obvious issue with this is its wastefulness of optical transmission resources. The backup or protection path is entirely dedicated to one working path and over 100% redundancy (in terms of the capacity-distance product) is always implied, relative to a single path over the shortest route. Figure 3.2 illustrates two 1+1 APS setups for node pairs (0, 12) and (1, 11). On each of the protection paths, a dedicated copy of spare capacity is assigned, which therefore requires span (4-8) to reserve two spare channels. The Shared Backup Path Protection (SBPP) method can be easily appreciated in light of this example. SBPP can be thought of as arranging a set of 1+1 APS diverse protection paths but keeping the spare links on the backup paths unconnected and shared with other 1+1 setups, with connection occurring as needed upon failure. The aim is to increase capacity efficiency by allowing sharing of protection channels between "primary" working paths that do not have any single-failure modalities in common. SBPP is thus logically like disjoint path-pair provisioning for 1+1 APS except that we exploit opportunities for sharing of spare channels on backup routes. In Figure 3.2, the backup paths (0-4-8-12) and (1-4-8-11) may share a spare channel on the span (4-8) because their corresponding primary paths have no common-cause failure scenarios (and so would not ever both have a simultaneous need to use the share backup channel). Technically, such primary paths are said to have no Shared Risk Link Groups (SRLG) in common (Rajagopalan et al.,

3

71

Design of Protected Working Capacity Envelopes

Working or "primary" lightpath

Protection or "backup" lightpath

Figure 3.2. Illustration of 1+1 APS and SBPP provisioning methods

2003). As a result, one channel of spare capacity is saved on span (4-8) relative to 1+1 APS for the same two primary paths. In conjunction with Generalized Multi-Protocol Label Switching (GMPLS) signaling protocols (Mannie et al., 2003), SBPP lends itself to survivable dynamic routing in a way that has been very widely studied and adopted as the current paradigm for dealing with dynamic random demand for lightpath services. When extrapolated to realistic-size transport networks, with demands exchanged at varying levels between all node pairs, opportunities such as the simple one illustrated in Figure 3.2 are much more numerous and overall efficiencies rivaling that of adaptive path restoration are approached. However, the overall complexity of the network state that has to be tracked by each node participating in dynamic provisioning operations becomes the issue. SBPP has the desirable properties of being a shared spare capacity scheme, so it is efficient. It is also an end-to-end scheme so it is amenable to end-user (or application) control for establishment of both primary and backup paths. This is seen as being desirable to some, while in other contexts there seems no reason to burden the end-user or application with the task of establishing protection. In the latter view, protection can be just a service attribute guaranteed by the carrier. One of SBPPs advantages is also in optical networks where fault sectionalization is thought to be slow or difficult. In that case, operation is simplified because the switchover is always only to the same pre-planned backup path, and activation is controlled independently by the receiving endnode. No matter where the failure occurs, when end-nodes see an alarm, they just trigger the switchover. Under SBPP one usually tries to route the working path over the shortest or least cost path over the graph. This is called the "primary" path. Usually, but not always, there will be one or more possible dis-

72

NEXT GENERATION INTERNET

joint backup routes between the same end nodes of the primary path. To be eligible as a backup route, a route has to have no nodes or spans in common with the route of the primary path and no spans or nodes in common with any other primary path whose backup route has any spans in common with the route being considered. Together these considerations ensure that when a primary path fails (under any single failure scenario): (a) No span or node along its backup route is simultaneously affected. This means it will be possible to assemble a backup path along that route if sufficient spare channels have been pre-planned. This self-disjointness requirement of the backup route is the fairly obvious condition for survivability. (b) No other primary path that is affected by the same failure has a pre-planned backup path that assumes the use of the same spare channels) on any span of the first primaries backup route. This is the failuredisjointness requirement and is required to enable the sharing of spare channels over different backup routes. This more complicated set of considerations makes sure that if primaries A and B are both pre-planned to use a certain spare channel on span X in their backup routes, then there is no (single) failure where primaries A and B would ever both need that spare channel at the same time. The preferred choice for the backup route (assuming there are several possibilities) is the one that requires the fewest new channels of spare capacity to be placed (or committed from the available capacity). In other words one tries to chose a backup route on which a backup path can be formed for primary A which to the greatest extent possible assumes the use only of spare channels already associated with other primaries that have no "shared risk" in common with primary A. Depending on network details, it has been found that if at most three to five such sharing relationships are allowed to be established by diverse primaries on each spare channel, then the theoretical minimum total investment in spare capacity can be approached. To coordinate all these conditions in provisioning a pair of primary and shared backup paths for a new service, SBPP provisioning assumes the maintenance of synchronized global databases of network state at each node, keeping track of all capacity usage and the sharing relationships between primary paths already in place. This data is essential, in a global and current form, at every node that is to originate in dynamic establishment of protected service paths. One concern about this pertains to the scalability of SBPP-based dynamic service provisioning to large networks or rapidly dynamic networks. In essence, SBPP proposes to operate entirely on traditional Internet-style dissemination of global

3 Design of Protected Working Capacity Envelopes state for routing decisions. Correct network-wide dissemination of capacity and shareability-state changes must (ideally) be known immediately in all nodes for every connection setup and takedown in the network. If not, further provisioning must be blocked until global state coherence is confirmed. The state data that every node must have is not only global in extent but it is updated on the same time scale as the connection changes. If millions of transactions must be handled day after day with essentially zero errors, any scheme that has to disseminate global state changes for every connection change would seem inherently limited relative to a scheme where the state information depended upon for provisioning does not change at all, or changes only on a time scale much longer than that of individual connections. Thus, the SBPP paradigm seems to be, by its nature, vulnerable to lags or errors in global database integrity—the same phenomena that already causes "brown-outs," slow downs, route-flapping, and other pathological behaviors in the existing Internet. The opposing view is that with mature routing and signaling protocols it is not that difficult for each node to obtain the full information on the spare capacity sharing (Li et aL, 2002). Assuming such "full information" for SBPP routing decisions, (Li et al., 2002) showed that one can achieve much better capacity efficiency than SBPP provisioning with partial information as in Kodialam and Lakshman (2000). What is significant in this regard is that in the PWCE approach that follows, we do not require full information in the sense of Li et al. (2002), but we still achieve efficiency (measured by blocking performance in the presence of the same total capacity) that is essentially as good, often better, than SBPP with "full information routing." Other literature on SBPP is mainly concerned with routing heuristics to try to achieve low blocking (Ou et al., 2003; Yang et al., 2003). More details about how SBPP operates follow in Section 6 where we detail the routing methods for SBPP and PWCE simulation trials.

4,

The PWCE concept and advantages

The concept of PWCE starts with considering a conventional survivable network design for static traffic demands. Given a demand forecast, a survivability scheme is applied where the restorability is guaranteed for any one failure at a time. In networks based on span protection this results in a division of capacity into working and spare channels. The working capacity serves the demands in the forecasted matrix and the spare capacity provides protection for the working capacity. This design process was reviewed for p-cycles in Section 2.

73

74

NEXT GENERATION INTERNET

At first glance, such design methods seem limited to problems with a static demand matrix, or, in other words they are intended to produce optimal designs only for cases where an exactly known set of point to point lightpath requirements are known. The PWCE concept involves adaptation of these static design models to create not an exact solution for one single demand matrix, but an envelope of protected working channels, well suited for a large family of random demand instances that may be somehow related to a single representative demand pattern. To begin with, we realize that even when a model such as in Section 2 is solved for a specific demand matrix, the result is a set of working channels on each span that can actually support many different demand patterns, and a distribution of spare capacity that protects them all, as long as none exceeds the wi quantities of the initial design problem. Figure 3.3 illustrates one of the many possible partitionings of total capacity into working and spare channel sets. What all such partitionings have in common is that the working channel count (wi) on each span is fully restorable within the corresponding graph of spare channel capacities <Si> on all other spans. Few arbitrary partitionings of capacity on each span will satisfy this property, but the theory for such fully-restorable partitionings is easily based on existing knowledge about span-restorable mesh network design (Herzberg et al, 1995). Even while satisfying the restorability property, there are a very large number of different partitionings, i.e., protected working envelopes that are feasible under an initial distribution of total capacity. In the top of Figure 3.3, a set of spare channels on each span defines a reserve network of spare capacities <Si>. Under span restoration, any distribution of spare channels provides for a certain corresponding number of protected working channels (wi) on each span below, To understand (wi), it is in effect the answer to the question: "If span i fails, then by rerouting through the spare capacity of the surviving graph between the end nodes of i, what is the maximum number of replacement path segments that we can create?" A capable restoration algorithm will achieve (wi) equal to the capacity of the minimum-cut between endnodes of the failed span through the reserve network <Si>. Once the partitioning is defined any number or combination of working paths can be routed through the envelope, up to the point where all (wi) channels are used on some span, without any attention to protection arrangements because the channels used for provisioning in the working layer are themselves protected by the reserve network (and some embedded restoration or protection mechanism). As long as the quantities (wi) on spans support routing of the demand, it is inherently protected end-to-end with no further action. Once a connection is served, local marking on each

3

75

Design of Protected Working Capacity Envelopes

"Reserve" Network

/

j

/

1

^ - ^ Total deployed capacity=l£^

_

^^^W

•

J

Protected Working Capacity Envelope

Figure 3,3. Partitioning of total installed capacity into working and spare to define one possible protected working capacity envelope (Shen and Grover, 2004)

span indicates to subsequent path setup processes that the individual channel is no longer available. No other nodes need an accounting of individual channel states as they do in SBPP where sharing relationships are defined between individual paths and individual backup channels. Under PWCE other nodes need only know that the span continues having one or more provisionable channels available. This is the default case requires no signaling for state update dissemination. Moreover it can be appreciated that as long as span occupancies remain under (wi) (i.e., within the envelope) then path setups and tear-downs can be going on arbitrarily frequently and it makes no difference—there is no signaling needed to arrange protection per-path and no global state update dissemination whatever. The only signaling is for the source-routed establishment and tear down of each path by its originator node, and does not involve any nodes other than those on the paths themselves. Only if the pattern of random dynamic demand evolves—in a way such that a span approaches the envelope—is any updated network state dissemination required. A single Link State Advertisement (LSA) then either withdraws the highly utilized span from further routing or issues an updated cost for OSPF-type routing over that span. Nodes operating under PWCE need only participate in simple Open Shortest Path First (OSPF)-type of topology orientation (not OSPFTE)1 to support distributed end-node provisioning via a constrained source-routing protocol (such as RSVP-TE or CR-LDP). At this level of transport, basic topology is almost never changing, so this is an almost one-time learning of the basic graph topology. Full-blown OSPF-TE dis1 OSPF-TE is the extension of the basic OSPF topology-orientation protocol to include details of capacity and other data on each link.

76

NEXT GENERATION INTERNET

semination of detailed changes in actual capacity and shareability state on each span is not needed because every edge of the graph will remain available for routing as long as its current in-use channel count is below the maximum number of working channels (wi) that can be protected on that span. Nodes can be told via a centralized NMS what the dimension of the current PWCE is (i.e., the (wi) value on each of its incident spans). This is a database of one number per span to be maintained within each node and does not need global dissemination. Alternately the information may be infrequently discovered by each node by running a mock restoration trial using a distributed restoration algorithm executed in the background within the spare capacity (Grover, 1997). For now, let us consider a network in which the spare capacity of each span is a pre-assigned and fixed. This defines a "static" PWCE. An important property is that actions of any type related to ensuring protection occur only on the time-scale of the statistical evolution of the network load pattern itself, not on the time-scale of individual connections. Thus, any need for network management actions or state change dissemination is far less dynamic than the traffic itself. It takes a shift in the statistics of the demand pattern to require a logical change in the working envelope. Importantly, such adjustment actions also occur on a time scale where traffic exhibits correlated observable trends that can be taken into account in capacity-configuration planning. Variations in total demand and the pattern of demand, have strong correlations day-over-day which would allow the advance planning of several envelope configurations within the installed total capacities, each of which is known to suit the characteristic time of day to minimize any blocking. In contrast SBPP works at the call-by-call time-scale where individual departures and arrivals always appear essentially random, and routing is individually controlled by end-users. This is an environment of inherently incremental local reaction to the next arrival, not involving any opportunity for optimizing capacity use or routing strategies at global network level taking all demand into account at once. The protected envelope requirement is, however, very slowly changing or static over long periods of time, even in the most frenetically dynamic network. No matter how rapidly individual lightpath demands come and go at random, the envelope requirement will not change at all if the demand process is at statistical equilibrium. The envelope is only sensitive to non-stationary drift in the underlying pattern of random arrival/departure processes. As indicated, PWCE is based on locally-acting span restoration or protection mechanism. Examples of mechanisms that protect bearer capacity (i.e., channels not paths) are span restoration (Grover, 1997) or pre-planned span protection (Herzberg et al., 1995), cycle covers (Elli-

3 Design of Protected Working Capacity Envelopes

77

nas et al., 2000), generalized loopback networks (Medard et al., 2002), or p-cycles (Grover, 2004a; Grover and Stamatelakis, 2000). BLSR rings also inherently protect bearer capacity and, under a suitable transformation (that represents the constraints on routing between available rings) they are also amenable to the PWCE strategy. If network-level protection against node failure is required (in addition to span failure protection), the corresponding mechanisms can be node-inclusive span restoration (Doucette and Grover, 2003), node-encircling p-cycles (Grover and Stamatelakis, 2000), or a centralized multi-commodity maximum flow rerouting solution for the transiting flows. All these options can be preplanned for a very fast localized response against single span failures, complemented by a slower, but highly effective adaptive response to multiple failures including node failures (Clouqueur and Grover, 2002).

4*1

PWCE concept in the context of p-cycles

A PWCE protected by p-cycles is similar to a span-restorable mesh network. The difference is that the spare capacity is pre-connected into a set of set of p-cycles which have pre-defined protection relationships with the individual channels of working capacity. Figure 3.4 illustrates a PWCE designed based on a conventional p-cycle survivable network design. A six-node network and a demand matrix are shown on the left side, and the working and protection capacities based on the ILP model in Section 2 for the p-cycle network are shown on the right. The protection network corresponds to the reserve network in Figure 3.4, the working network corresponds to a PWCE, and between them, a pcycle layer is inserted, which consists of four cycles, cycles (1-2-4-3-1), (1-2-4-5-6-3-1), (1-2-5-6-4-3-1), and (1-2-5-6-3-1). All these p-cycles are pre-planned and pre-configured using the spare capacity of the protection network to directly offer protection for the working network.

5.

Design of protected working capacity envelopes using p-cycles

The way in which existing theory for "static demand" design problems relates to networks with dynamic provisioning can be fairly simple: we need only extend such methods to address the question of designing the best PWCE for operational use. The best PWCE will involve two general notions: For a given total capacity constraint, we want the operating envelope to be (in a sense to be defined) as large as possible. This we call volume maximization. We will also want the PWCE to be (in another sense to be defined), structured or shaped reasonably well to support the characteristic pattern of demand intensity, even though instantaneous

78

NEXT GENERATION INTERNET

Network topology Node index

Working capacity

Span-based p-cycle

ft 1 ?3 4 3 1 2 2 3 1 ? 7 ? •p 1 1 3 3 4 3 3 3 .,,6,

^cycles

Protection capacity

Demand matrix

Figure 3.4- An example of PWCE for the p-cycle network based on the conventional survivable network design

demands will be random. We refer to this notion as structuring or shaping.

5-1

Volume maximization of a PWCE

In a p-cycle network, we can maximize the volume of PWCE channels by fully exploiting the protection potential of each pre-configured pcycle. This corresponds exactly to the process of filling out all the working capacities protected by the p-cycles as in Sack and Grover (2004). Figure 3.5 illustrates such an exploitation procedure. Under the conventional design, we are given working capacity of spans as shown in Figure 3.4. To protect them, four p-cycles are needed as shown in detail in Figure 3.5. Three p-cycles each require two units of spare capacity and one p-cycle needs one unit of spare capacity. To maximize the PWCE volume, for each p-cycle with a certain spare capacity we try to fill a working capacity, which is the same as the p-cycle capacity, on each on-cycle span, and a working capacity, which is double the p-cycle capacity, on each straddling span. These filled working capacities are fully protected by the same p-cycles. For example, in Figure 3.5, as p-cycle (1-2-4-5-6-3-1) has two units of spare capacity, we can fill two units of working capacity on each of the on-cycle spans (1-2), (2-4), (4-5), (5-6), (6-3), and (3-1) and four units of working capacity on each of the straddling spans (2-5), (3-4), and (4-6). For the remaining three p-cycles,

3

79

Design of Protected Working Capacity Envelopes 2 units

p-cycle 1-2-4-3-1

2 units

yc?-cycle 1-2-4-5-6-3-1

2 units

p-cycle 1-2-5-6-3-1

p-cycle filling 1 unit

p-cycle 1-2-5-6-4-3-1 Figure 3.5.

Volume-maximized PWCE

PWCE volume maximization based on the p-cycle filling procedure

similar filling processes can be carried out.2 Subject to using no more spare capacity than the initial design, this forms a volume-maximized PWCE. The new PWCE has a total of 54 working channels for service provisioning, compared to 46 for the initial design. Thus when constructing a PWCE we can "volume-maximize" the PWCE with respect to the spare capacity that is required in any case for an initial design to some static forecast demand matrix.

5.2

Shaping principle in PWCE Design

But PWCE volume maximization alone simply creates the greatest total number of protected working channels over the entire network. By itself, there is no concern about placing the protected working channels in the uright place." The idea of target pattern matching is therefore to improve the design of a PWCE by influencing the distribution of the spare capacity so that the corresponding maximal numbers of protected working channels on spans are at least reflective of some basic pattern of relative demand intensities expected of future demand. The philosophy is that although future traffic cannot be exactly known, a plausible forecast of relative demand intensities between node pairs can at least give an indication of which spans will be traversed by many shortestroutes between all node pairs and hence should preferably have a large 2 In O.R. terms this process is equivalent to raising the right hand side (R.H.S.) values of constraint (3.1) in Section 2, so that all become simultaneously binding constraints on the spare capacity solution.

80

NEXT GENERATION INTERNET

working capacity assigned to them in the PWCE volume maximization. Mathematically this is expressed by a shaping template to help structure the PWCE as it is simultaneously maximized in volume. Given a total capacity constraint, we can then design a PWCE with maximized volume and a form of structuring that is reflective of the relative demand intensity pattern. Such shaping should improve blocking performance when later subjected to random demand having an overall pattern of intensities similar to the pattern used in the PWCE design. Later, to test this hypothesis, we quantify the statistical correlation between the PWCE channel distribution (wi) and the set of relative demand intensity accumulations on spans (li) that arise from shortest path routing of demand intensity matrix. While structuring by itself will make the "shape" of PWCE reflective of an expected relative pattern, at some point asserting shape-matching may come into conflict with the joint goal of volume maximization. An interesting question is therefore to see which factor is more dominant in influencing network blocking performance, sheer volume or shaping of the PWCE. For this trade-off, we use a factor a to mediate the two aspects in the following PWCE design models.

5,3

ILP design models

In Shen and Grover (2003), we developed three ILP models to design volume-maximized PWCEs given certain spare capacity constraints. Here, we extend these models to construct PWCEs under more general capacity constraints and we also introduce the structuring aspect to the design models. The different PWCE design models reflect different real-world situations that may exist. This involves eight possible combinations to compare, summarized in Figure 3.6. In addition to those of the previous conventional p-cycle design model new parameters and variables are as follows: Ik is a vector of one value per span relating to the strategies of "structuring" the PWCE. For example Ik may reflect target (not necessarily required) capacity levels to match a given demand intensity matrix undergoing shortest path-based routing. It is a feature through which we can influence or promote certain envelope structural properties, but without pinning the design down to optimality for only one single demand matrix. Tk is the total number of deployed channels on span k. Bs is a total network-wide spare capacity constraint. Bw+S is a total network-wide capacity constraint (i.e., on the sum of all working and spare capacity).

3

Design of Protected Working Capacity Envelopes

81

PWCE designs

Span-based budget / . Spare capacity budget structuring Structuring

Network-wide budget

\ Total capacity budget

J££fao

/ Spare capacity budget

Structuring

J%*tog

\ Total capacity budget

Structuring

8 | r

^

n g

Structuring

Figure 3.6. Taxonomy of PWCE design models and capacity contexts

a is a factor which mediates the trade-off between structuring and volume maximization of a PWCE. A is a shape-asserting factor which structures the PWCE relative to a certain structuring pattern such as a demand intensity matrix, (real variable) Now the ILP models under the eight combinations are as follows. Volume Maximized Design Models Every second one of the eight models (starting with A) in Figure 3.6 is a model that is only concerned with sheer bulk maximization of the total number of working channels protected within the envelope. As a group these are all called the volumeonly or non-structured models. The objective function for all of them (models A, C, E, G) is: maximize V w^ kes Combined Structuring and Volume Maximization The other four models, B, D, F, H are bi-criterion in natures, involving both volume maximizing and shaping considerations. These problems aim to form a set of p-cycles that protects the largest possible number of working channels as a whole, but with guidance from the added term in the objective to seek a distribution of the working channels that is similar to the desired shaping. The objective function for these models is: maximize < A + a • 2_] wk \

[

kes J

Constraints common to all models: All models are subject to constraints (3.1) and (3.2) from the basic p-cycle design model above. These assert restorability and spare capacity requirements to support restorability.

82

NEXT GENERATION INTERNET

Constraints for shaping: In each of the structured design problems we also have the constraints: wk>\-lk

Vfc G S

(3.3)

(lk) is the structuring pattern, which we think would normally be based on shortest path routing of the demand intensities between node pairs to accumulate "target" (wk) quantities to use as (lk). As used in the test results here, a is set to be very small so that the maximization of the shape conformance is the primary objective. As a result we tend to see A maximized until the spare capacity cannot guarantee restorability of the PWCE if A is further increased. The secondary objective is to maximize the PWCE volume in general after it cannot be further enlarged under the exact shape of the structuring pattern. Different Capacity Constraints: Finally, the eight possible design problems are completed with varying contexts of what the ultimate resource limitation is. In models A and B, there are constraints on the maximum amount of spare capacity that can be allocated to each span. In C and D, there are constraints on the total capacity of each span. (Working plus spare capacity must be less than the total on each span). In E and F there is a network-wide limitation on total spare capacity and in G and H there is simply a network wide limitation on the total capacity of the network. Models C and D therefore represent the perhaps more realistic context of a network with physically deployed transmission system of each span and whatever PWCE strategy is employed must be realized within those fixed as-built total capacities. Models G and H also have a direct real-world interpretation: a network is being designed from scratch and there is a single total budget limit for the network capacity investment. For the span-wise, total capacity-limited cases we thus add the constraints: Tk>wk + sk

Vfc G S

(3.4)

For the models with a network-wide total spare capacity constraint, we add:

Bs > ] T sk

Vfc G S

(3.5)

kes For the models with a network-wide total capacity constraint, we add:

kes

3 Design of Protected Working Capacity Envelopes

83

Note that in any of the total capacity-constrained cases, s^ becomes a variable.

6,

Simulation test methods

In this section we detail how SBPP and PWCE schemes were implemented for comparison of blocking performances under random demand. We used standard approaches for simulation of statistically stationary memoryless random arrival and departure processes. A network was regarded as a discrete-event system with two types of random event, lightpath arrival and lightpath release. Arrivals followed a Poisson process with an arrival rate of p per unit time. Each established lightpath had a mean holding time of l//i and a negative-exponential distribution. We normalize time measurements using l//i = 1 so that the lightpath traffic load between each node pair can be considered in units of Erlangs as p. The arrival and release event sequences run independently on each node-pair concurrently. The blocking probability is defined as the ratio of the network total blocked lightpath requests to the number of total arriving requests over the simulation period. We compute the blocking from simulations of both PWCE and SBPP-based survivable provisioning processes operating under equivalent capacity in the same networks with the identical time-sequences of random arrivals and departures. The next sections explain in detail how PWCE and SBPP schemes were implemented for these comparisons.

6,1

Simulating lightpath routing for PWCE provisioning

Under the PWCE scheme, once a PWCE is defined, provisioning within it is the same as service provisioning with no protection, and we employ a hop-based shortest path algorithm to search for a feasible route for each newly arriving service request. If the search is successful, then the path is established and the status of the available network resources is updated at each span with the consumed resources set as unavailable. This is a centrally computed result but exactly emulates the operation in practice with constrained source routing based on a simple OSPF-type view of the topology of currently non-exhausted spans. Only when there is no free capacity left on a span, is an LSA issued and the exhausted span is effectively removed from the graph seen by the routing algorithms for new arrivals. This happens only when the operating envelope is reached. Note that in itself such a withdrawal is not associated with a blocking event. The last path to use that span is accommodated, and subsequent routings inherently take the fully used

84

NEXT GENERATION INTERNET

state of that span into account. In future work this may be improved upon by using OSPF-type edge costs in shortest path calculations and allowing the edge costs to be updated more often in inverse response to the current fill levels on each span. This can, however, only serve to further improve PWCE blocking performance, so here SBPP is being compared against the simplest form of PWCE routing. If no route is currently feasible at the time of a given lightpath request, the request is blocked and dropped. For lightpath release events we simply unmark all the working channels of the released path to make them available for new use. In the general concept, when a release restores a certain minimum number of unused channels to a span that was previously withdrawn, then a single LSA is issued reinstating that span for provisioning. In our resent simulations that threshold is only one free channel.

6.2

Simulating lightpath routing for SBPP provisioning

SBPP provisioning establishes a working lightpath and a disjoint shared backup lightpath simultaneously for each survivable lightpath service. The routing algorithm needs to search for a working route and protection route as a pair for each protected lightpath request. The protection route must be link disjoint from its own primary and from the backup routes of any other working paths whose backup path shares spare channels with the present working route. To achieve high spare capacity sharing, the protection route should be selected from candidates which have the highest sharing potential. The SBPP algorithm also then needs to record any new lightpath protection sharing relationships. The overall database of network state that must be kept coherent records which primary is protected by which backup, and sharing information for each spare channel (i.e., which primaries are protected by which protection capacity unit). The implementation for SBPP is based on an all-distinct route searching algorithm to find the working and protection route pairs for consideration. The algorithm is also referred to as First Fit (FF) because the search is terminated whenever the first pair of working and protection routes, which are link-disjoint from each other and eligible to establish a pair of working and backup lightpaths are found. The algorithm is similar to the deterministic approach in Bouillet et al. (2002). The details are as follows: When a path request arrives, we use the all-distinct route searching algorithm to find the complete list of all-distinct routes between the node pair in an ascending order of hop length. From the list, we examine the possible combinations of pairs of routes to find the first pair which

3 Design of Protected Working Capacity Envelopes

85

satisfies the conditions that: (i) the two routes are mutually link disjoint, and (ii) the available channels on spans of these routes suffice to establish both of them with one functioning as a primary and the other as a shared backup. A pseudo-code follows description of the procedure is as follows: 1. Use the all-distinct route searching algorithm to find the complete list of distinct routes {Ri,i = 1,... , n} between a node pair in an ascending order of hop-length, 2. Set a flag f:=false. 3. For (all the routes in the route list : Ri, i = 1,..., ri) {

Select Ri as a candidate working route and based on the current network resources, check if there is enough capacity to establish a working path via route Rf, If (YES) { For (all the routes in the route list: Rj,j — 1,... ,n) {

Check if Rj is link-disjoint from the working route Ri\ If (YES) { Based on the current network resources (including free resources and shared resources), check if there is enough capacity to establish a backup path via route Rj; // (YES) { Establish the working and backup paths on the routes Ri and Rj, respectively; the working path consumes one capacity unit on each hop. The backup path shares existing protection capacity on each hop en route wherever possible. If on some span, no protection capacity can be shared, then a new unit of unused capacity is occupied as protection capacity. Update the network resource status; Set the flag f:=true; Break the internal for loop.

Check flag / to see if the working and backup paths can be established successfully when Ri is selected as the working route; // (f=truej Break the outer for loop. 4.

} // (f=true) The survivable service request has been established successfully; Else

The request has to be blocked. End.

86

NEXT GENERATION INTERNET

During the protection route selection, the FF algorithm checks the spare channels en route one by one to see if any one of them is sharable by the new backup lightpath. If so, we let the backup lightpath share it; otherwise, a currently uncommitted spare channel on that span needs to be associated with the backup for the current primary. Because the algorithm examines the routes from the shortest to the longest, the working route found is implicitly the shortest among the working routes of all the protection and working route pairs that satisfy conditions (i) and (ii). Upon lightpath release, the algorithm will release the resources consumed by the working lightpath and any spare channels solely used by its backup lightpath. Here "solely" means that the released backup lightpath is the only one currently assuming use of that spare channel if needed. A spare channel is only truly released back to an available state when all sharing relationships on it are removed. Although the FF algorithm is greedy, it consumes a near minimum of resources for each SBPP lightpath established. In a network with spare capacity sharing, a working lightpath normally consumes more network resources than a backup lightpath because the working lightpath always consumes whole working channels, while the backup lightpath can share as much existing spare capacity with other backup lightpaths as possible. Thus, it is an economical priority to keep the working lightpath shortest, and based on this, to establish the shortest backup lightpath. Selecting the working and protection routes in this fashion does not guarantee that the sum of the hops of the two routes is the shortest, nor would this be optimal in any case. A minimum hop cycle that transverses both of the end nodes with the fewest total hops can be found with the algorithm in Surrballe and Tarjan (1984). However, the minimumhop cycle does not guarantee that the working route is the shortest among all the eligible candidates, which is a very dominant principle in achieving SBPP efficiency, especially if extensive sharing on spare channels is being achieved. With shared protection, the minimum-hop cycle is not an optimal choice in general. If the working route that the minimum-hop cycle contains is the second or third shortest route, then very often there will be more resources consumed compared to the method we used because the working lightpath consumes 100% of one channel on each of its hops.

6.3

Test networks and random demand patterns

We evaluated the various design cases in simulations on several test networks. With space in mind we show only sample of results for two

3

87

Design of Protected Working Capacity Envelopes

(a)

(b)

Figure 3.7. Sample test networks: (a) NSFNET and (b) COST239

typical test networks. These are a 14-node 21-span NSFNET and an 11-node 26-span COST239 network as shown in Figure 3.7. We dimensioned these networks to define the capacity constraints for different PWCE cases, with a conventional p-cycle spare capacity minimization model and flow-leveling shortest path routing of an initial demand matrix to determine an initial set of working and spare capacities. In NSFNET we assumed each node pair had on average three units of demand intensity (e.g., three lightpaths). COST239 had two demands on each node-pair. These were the initial demand matrices used to generate the capacity requirements of the conventional design, which provide baseline capacity constraints for the following volume-maximization PWCE designs and performance comparison between PWCE and SBPP. Given the above demand intensity matrices, the flow-leveled shortest path algorithm operated as follows: if there was only a single shortest route between a node pair, then all the demand between the node pair was carried by the route; otherwise, if there was more than one shortest route between the node pair, then the total demand was allocated as evenly as possible while remaining integer onto each of the routes. The required spare capacity was computed by the p-cycle model in Section 2. The resulting spare capacities were the basis for span-wise and network-total spare capacity constraints where applicable, and the working capacity was used as the shaping vector for these structuring designs. The sum of working and spare capacities was used where total capacity constraints applied. All the above problems were solved quickly to optimality by AMPL/CPLEX 7.1 on an Ultrasparc Sun Server at 450MHz with 4GB of RAM. To compare the performance of the PWCE and SBPP fairly, we ran simulations for both with equivalent total network capacity. For the survivable network based on the conventional p-cycle design, we used

88

NEXT GENERATION INTERNET

the resulting protected working capacity to function as a PWCE. Corresponding to this, we used the same total capacity on each span as being available to provision services with SBPP. SBPP therefore sees the total capacity of the network as being available, without any a priori designation of channels as working or (shared) protection. The identical demand intensities in Erlang between node pairs were presented for both SBPP and PWCE provisioning methods, but the latter provisioning process only "sees" the PWCE capacities of each span as being available for service provisioning. A total of 105arrival events were simulated for each blocking measurement.

7. 7.1

Results and discussion Structural properties and PWCE capacity

Let us first inspect design results for models B, E, and F in terms of how volume maximization and combined shaping and volume maximization affects the working and spare capacity structures of the networks. Figure 3.8 portrays the structure of PWCEs designed by the conventional design and volume-maximized designs for the NSFNET network with three units of forecasted demand on each node pair as the target shaping. Figure 3.8(a) is for the conventional p-cycle design model. The corresponding volume-maximized PWCE designs are as shown in the figures from (b) to (d). The result of (b) was obtained by employing the spare capacity on each span as the constraints (i.e., model B). The results of (c) and (d) were obtained by employing the network-wide total spare capacity, i.e., 417 units, as the constraint (i.e., models E and F). The difference between them is that model E is a non-structuring design, while model F is structuring. Wefindthat after implementing volume maximizations, all three designs can exploit significant volumes of extra protected working channels for their PWCEs (respectively 38.8%, 60.4%, and 44.2% relative to the original PWCE volume). Note that all these increases in PWCE envelope volume are without any additional spare capacity increase. Model B has the same spare capacity structure as the conventional design, and because the working capacity of the conventional survivable network design is assumed as the shaping vector in the model, the PWCE designed by model B will always be able to accommodate all the working capacity of the conventional design; based on this, it then adds extra protected working capacity on each span until spare capacities on the spans in the network become equally binding constraints on further volume maximization. Comparing the distributions of the two PWCEs in Figure 3.8(a) and (b), we see that their spare capacity is always the

3

89

Design of Protected Working Capacity Envelopes

same but the PWCE capacity of model B is always greater than that of the conventional design on any span. We can thus say that the PWCE of the conventional design is inside the PWCE of model B. Also because the working capacity of the conventional design is used as the shaping vector in model F, the PWCE of the conventional design is inside the PWCE of model F as well. However, model F does not need to follow the spare capacity distribution as in model B; it needs only to ensure the total spare capacity never breaks the total constraint. Therefore, model F exhibits more flexibility in spare capacity arrangement and can construct a larger PWCE. However, in this case the PWCE of model B is not necessarily included inside the PWCE of model F; here the word "larger" only means the volume. As shown in Figure 3.8(b) and (c), although the PWCE of model F is "larger" than that of model B, on

• spare • working

60

o g 40 13

>30

H 20 'w §"• -m <

0 00

I*.

no

Span (a) PWCE of conventional design (NSFNET)

Span (b) PWCE of model B (NSFNET)

-»>

-^

-ix

-U

9>

^

90

NEXT GENERATION INTERNET

CO

CO

t

CO

CO

CO

o>

00

CO

CO

CO

(12-1

CO

(10-1

CO

(8-12

CO

(0-3)

CO

9

Span

(c) PWCE of model E (NSFNET)

Span (d) PWCE of model F (NSFNET) Figure 3.8. PWCEs of the NSFNET network with three units of uniform forecasted demand on each node pair

span (8-11) model B has a larger PWCE volume than that of model F. Finally, for model E because no shaping is involved, it always has the largest PWCE among all the design cases, but this again does not mean that the PWCE can fully contain the PWCE of the conventional design, so its shaping may not be ideal. For example, in Figure 3.8(a) and (d) the conventional design has a larger PWCE capacity on span (2-5) than model E although the former has a much smaller overall PWCE volume than the latter. Another interesting observation for model E is that the design almost has a flat total capacity when summing the spare and working capacity on each span. Here in NSFNET, we can see that the network designed by model E almost has the same total capacity

3 Design of Protected Working Capacity Envelopes (i.e., 60 units) on each span. We conjecture that this tendency may bring model E some advantages when designing a network with modular capacity. Similar results were obtained for the COST239 network, but are not portrayed graphically, with space in mind. To summarize, however, in COST239 the design of model E again shows the largest PWCE volume and the design of model B having the smallest PWCE volume. Also, compared to the PWCE constructed by the conventional design, all the PWCEs of the three models are enlarged greatly (11.6%, 20.9%, and 16.3% greater volume than the conventional designs for models B, E and F respectively). For the structured designs, Table 1 shows the actual correlations between the structuring pattern and PWCE channel capacity distribution (i.e., the shape of the PWCE). The structuring patterns are just the working capacity vector of the conventional design as shown in Figure 3.8(a). We note that models B and F always have larger correlation than that of model E. This implies that the PWCEs designed by models B and F are indeed conforming to the shape target. The correlations are high in both models B and F, where a structuring constraint is applied, but not in model E. Table 3.1. Correlation coefficients between structuring pattern and actual PWCE capacity distributions in models B, E, and F (E is non-structured) Network NSFNET COST239

7.2

Model B 0.601 0.801

Model E 0.233 -0.381

Model F 0.520 0.805

Comparative blocking performance of SBPP and PWCE

Figure 3.9 shows the simulation results for the test networks NSFNET and COST239 under various design cases and the PWCE and SBPP provisioning methods. In the legend, "conv." denotes the network based on the conventional design, and "models B, E, F" denote the networks designed by models B, E, and F respectively. Figure 3.9(a) shows the blocking probabilities in the NSFNET network. It is found that SBPP strictly outperforms PWCE for each of the design cases. However, the results of the more highly connected COST239 network (average nodal degree is 4.7) shown in Figure 3.9(b) are the other way

91

92

NEXT GENERATION INTERNET 1.E+00

I

1.E-01

1

1.E-02

Q . 00 CD

M—

O ^

D O

**-*—SBPP(conv.) <_ x —SBPP model B SBPP model F SBPP model E

1.E-03

1 Q.

conv.) model B - - « - - PWCE model F - - - • - - • PWCE model E --••--•PWCL --••-•PWCE

1.E-04

ki

D) C

DQ

1.E-05 1.8

2 2.2 2.4 2.6 Traffic load in Erlang per node pair

2.8

(a) NSFNET 1.E+00 OB

a-

2 ^ Q. M—

1.E-02

CO CD

O D

—o—SBPPfconv.) — x — S B P P model B SBPP model F SBPP model E PWCE conv.) PWCE model B PWCE model F - P W C E model E

= "" 1.E-03 .Q

2

8 a

1.E-04

1.E-05 0.8

1 1.2 1.4 Traffic load in Erlang per node pair

1.6

(b) COST239 Figure 3.9. Blocking performance of PWCE and SBPP based on hop-based Dijkstra's and FF algorithms in the NSFNET (a) and COST239 (b) networks

around. Here, PWCE outperforms SBPP in all the design cases. We can explain these results as follows: Two differences between SBPP and PWCE provisioning methods seem to lie at the heart of the experimental behaviors seen. One set

3 Design of Protected Working Capacity Envelopes of considerations relates to the basic protection mechanism that they employ. The other is countering considerations about the scope of optimality inherent to each architecture under random arrivals. SBPP is a path-oriented protection mechanism, while p-cycles are a span-oriented protection mechanism. A path-oriented protection mechanism normally has a wider spare capacity sharing opportunity than a span-oriented protection mechanism. Therefore, in this aspect the SBPP provisioning method can more widely share spare capacity during provisioning process than the PWCE provisioning method. On the other hand, the PWCE method always involves a more global optimized relationship between the spare capacity and protected working capacity even during the provisioning process. After a PWCE is constructed, the protection capacity is never involved in any direct operation (or change) related to the dynamic survivable service provisioning and so cannot evolve incrementally into a poor global configuration under random (statistically stationary) demand. But with SBPP, the optimization is greedy in nature, on an incremental pre-connection basis. For each incoming survivable service, the SBPP provisioning process finds the first shortest route to establish a working lightpath and second shortest route, which is link-disjoint from the first route, to establish a backup lightpath. While setting up the backup lightpath, although it is allowed to maximally share spare capacity en route, such sharing is still only a local optimization of the situation, and may be sub-optimal in the wider sense of how it relates to the next several arrivals or departures. It is impossible to reconfigure other existing connections or even foresee the future connections to make a network-wide optimization, PWCE is, however, by its nature free of this effect. Its working-to-spare relationships are fixed and built-in, determined once with a global optimization based on an at least representative view of the overall pattern of relative demand intensities. In summary, there are two counteracting effects, i.e., inherent optimality and scope of spare capacity sharing that affect the overall network performance. SBPP shows advantage in the scope of spare capacity sharing, while PWCE can ensure the network-wide optimality of spare capacity sharing during the whole provisioning process. Thus, all the results reported for the test networks are consequences of the interaction, or tradeoff of the above two effects. In the COST239 network, where the relative benefit of wider sharing scope of SBPP is weakened due to the high connectivity of the network, the effect of network-wide optimality therefore overwhelms the effect of sharing scope, which causes the p-cycle-based PWCE provisioning method to display a lower blocking probability. In contrast, in the NSFNET network, due to its low

93

94

NEXT GENERATION INTERNET

connectivity, the wider sharing scope of SBPP demonstrates a stronger effect on the network blocking performance than the network-wide optimality of PWCE, which therefore leads the SBPP provisioning method to achieve a better blocking performance than the PWCE provisioning method. Based on the above observations, it seems that the blocking performances of SBPP and PWCE are related to the network connectivity. To confirm this, we have conducted further experiments to study how the blocking performances are affected by the network connectivity. We designed a series of network topologies based on a 10-node ring network as an initial topology. Step by step we added spans to the network to gradually increase the network nodal degree. Because the number of nodes in the network is ten, any addition of a span will increase the network nodal degree by 0.2. Starting from the ring network with nodal degree 2.0, we added ten spans to the network until the network nodal degree is equal to 4.0 and generated a total of eleven network topologies. Based on these networks, we designed PWCEs based on the previous volumemaximized models and ran simulations for them. It was found that with the increase of the network nodal degree, the blocking performance of PWCE improves progressively; it outperforms the SBPP provisioning method when the network nodal degree is equal to 3.2. This is quite in line with the results that we have obtained for the above three specific test networks. NSFNET has an average nodal degree of 3.0, in which the performance of SBPP is better than PWCE; COST239 has average nodal degrees larger than 4.0, in which the performance of SBPP is better than SBPP. Practically speaking, however, what is of most significance is simply that in no case is PWCE blocking very different from SBPP at reasonably low loads. If the blocking is even similar, it means that all the other advantages of PWCE can be accessed for their own purposes and benefits. As conjectured before, the similarity between the forecasted demand intensity pattern and actual assigned network capacity can strongly affect blocking performance if total PWCE volume does not suffer greatly. To confirm this, we also studied the effect of the correlation of shaping with the blocking performance. It is found that the blocking probabilities of model F are always lower than those of model E for the PWCE provisioning method even though the network designs of model E have lager PWCEs than those of model F in both test networks. This confirms that the structuring effect is of usefulness to PWCE design.

3

8.

Design of Protected Working Capacity Envelopes

95

Concluding comments

We have considered the concept of p-cycle-based PWCE as an alternative to SBPP for automated provisioning of dynamic survivable lightpath services. The combination of the p-cycle technique and the concept of PWCE has two main motivations. p-Cycles are the first protection technique so far to break the tradeoff between the restoration speed and spare capacity efficiency. They exhibit ring-like switching speed and mesh-like spare capacity efficiency. Using PWCE to provision dynamic protected lightpath services is also a novel technique to break the tradeoff between operational simplicity and spare capacity efficiency. PWCE provisioning is simpler than even 1+1 ASP provisioning, but operates in a shared mesh network. The main contribution of this work was to develop PWCE optimization models that exploit either given allocations of spare capacity, or total capacity, to define PWCEs (operating envelopes) with the largest possible number of working channels (volume), or to maximize this volume combined with an aspiration to shape the distribution of working channels in a desired characteristic way. To show the advantages of PWCE we compared it with SBPP, virtually the only dynamic lightpath provisioning scheme so far considered other than PWCE. The simulation results suggest that in a sparse network the current method of static PWCE operation may be somewhat poorer but still close to SBPP in blocking. With higher network nodal degree, PWCE achieves better blocking than SBPP. In summary, we think p-cycle-based PWCE design and operational concepts may provide considerable advantages for dynamic survivable service provisioning with fast restoration speed, good blocking performance, and simple network operation. Ongoing work is looking at adaptive self-organization of the PWCE configuration, and at morefinely-resolvedreporting of span usage levels under PWCE so that, particularly on sparse networks, routing of new paths through the envelope can effect betterflow-levelingand congestion avoidance. Preliminarily results from recently submitted further work (Shen and Grover, 2004) indicates that adaptive PWCE with slightly increased reporting of span usage levels greatly reduces blocking relative to SBPP without significantly increasing update signaling volumes. Acknowledgments The authors would like to thank J. Doucette, University of Alberta/TRLabs. for proofreading and helpful comments on the manuscript.

96

NEXT GENERATION INTERNET

References Ammar, M., and 26 others (2003). Report of NSF Workshop on Fundamental Research in Networking. Report to the National Science Foundation Directorate for Computer and Information Science and Engineering (CISE). http://www.cs.virginia.edu/^jorg/workshopl. Awduche, D. et al, (2001). RSVP-TE: Extensions to RSVP for LSP tunnels. IETF RFC 3209, (Standards Track). Bouillet, E., Labourdette, J.F., Ellinas, G., Ramamurthy, R., and Chaudhuri, S. (2002). Stochastic approaches to compute shared mesh restored lightpaths in optical network architectures. INFOCOM'02, 2:801-807. Clouqueur, M. and Grover, W.D. (2002). Dual failure availability analysis of spanrestorable mesh networks. IEEE Journal on Selected Areas in Communications, 20(4):810-821. Doucette, J. and Grover, W.D. (2003). Node-inclusive span survivability in an optical mesh transport network. In: NFOEC'03, pp. 634-643, Orlando, FL. Ellinas, G., Hailemariam, A.G., and Stern, T.E. (2000). Protection cycles in mesh WDM networks. IEEE Journal on Selected Areas in Communications, 18(10): 192437. Gerstel, O. and Ramaswami, R. (2000). Optical layer survivability: A services perspective. IEEE Communications Magazine, 38(3): 104-113. Grover, W.D. (1997). Self-organizing broadband transport networks. Proceedings of IEEE, 85(10):1582-1611. Grover, W.D. (2004a). Mesh-Based Survivable Networks, Options and Strategies for Optical, MPLS, SONET, and ATM Networking. Prentice Hall. Grover, W.D. (2004b). The protected working capacity PWCE concept: An alternate paradigm for automated service provisioning. IEEE Communications Magazine, 42(l):62-69. Grover, W.D. and Doucette, J. (2002). Advances in optical network design with pcycles: Joint optimization and pre-selection of candidate p-cycles. In: IEEE-LEOS Topical Meetings, pp. 49-50, Quebec, Canada. Grover, W.D. and Stamatelakis, D. (1998). Cycle-oriented distributed preconfiguration: ring-like speed with mesh-like capacity for self-planning network restoration. In: ICC }98, pp. 537-543, Atlanta. Grover, W.D. and Stamatelakis, D. (2000). Bridging the ring-mesh dichotomy with p-cycles. In: DRCN'00, pp. 92-104, Munich, Germany. Herzberg, M., Bye, S., and Utano, A. (1995). The hop-limit approach for sparecapacity assignment in survivable networks. IEEE/ACM Transactions on Networking, 3(6):775-784. Kawamura, R., Sato, K., and Tokizawa, I. (1994). Self-healing ATM networks based on virtual path concept. IEEE Journal on Selected Areas in Communications, 12(l):120-127. Kini, S., Kodialam, M., Laksham, T.V., and Villamizar, C. (2000). Shared backup label switched path restoration. IETF Internet Draft, draft-kini-restoration-sharedbackup-00.txt. Kodialam, M. and Lakshman, T.V. (2000). Dynamic routing of bandwidth guaranteed tunnels with restoration. INFOCOM'OO, 2:902-911, Tel Aviv, Israel. Li, G., Wang, D., Kalmanek, C, and Doverspike, R. (2002). Efficient distributed path selection for shared restoration connections. INFOCOM'02, 1:140-149, New York, NY.

3

Design of Protected Working Capacity Envelopes

97

Mannie, E. (editor) (2003). Generalized multi-protocol label switching architecture. IETF Internet Draft, draft-ietf-ccamp-gmpls-architecture-07.txt, (Standard Track). Medard, M., Barry, R.A., Finn, S.G., He, W., and Lumetta, S.S. (2002). Generalized loop-back recovery in optical mesh networks. IEEE Transactions on Networking, 10(l):153-164. Ou, C., Zhang, J., Zang, H., Sahasrabuddhe, L., and Mukherjee, B. (2003). Nearoptimal approaches for shared-path protection in WDM mesh networks. In: ICC03, pp. 1320-1324, Anchorage, Alaska. Rajagopalan, B. et al. (2003). IP over optical networks: A framework. IETF Internet Draft, draft-ietf-ipo-framework-05.txt. Sack, A. and Grover, W.D. (2004). Hamiltonian p-cycles for fiber-level protection in homogeneous and semi-homogeneous optical networks. IEEE Network, 18(2) :4956. Saracco, R., Harrow, J.R., and Weihmayer, R. (2000). The disappearance of telecommunications. IEEE Press, New York. Schupke, D.A., Gruber, C.G., and Autenrieth, A. (2002). Optimal configuration of p-cycles in WDM networks. IEEE ICC'02, 5:2761-2765, New York, NY. Shen, G. and Grover, W.D. (2003). Exploiting forcer structure to serve uncertain demands and minimize redundancy of p-cycle networks. In: SPIE OPTICOMM'03, pp. 59-70, Dallas. Shen, G. and Grover, W.D. (2004). Adaptive self-protecting transport networks based on j9-cycles under the working capacity envelope concept. Submitted to IEEE Journal on Selected Areas in Communications, Special Issue on Autonomic Communication Systems. Stamatelakis, D. (1997). Theory & algorithms for preconfiguration of spare capacity in mesh restorable networks. M.Sc. Thesis, University of Alberta, Canada. Stamatelakis, D. and Grover, W.D. (1998). OPNET simulation of self-organizing restorable SONET mesh transport networks. OPNETWORKS'98, (CD-ROM) paper 04, Washington, D.C. Stamatelakis, D. and Grover, W.D. (2000a). Theoretical underpinnings for the efficiency of restorable networks using pre-configured cycles ("p-cycles"). IEEE Transactions on Communications, 48(8): 1262-1265. Stamatelakis, D. and Grover, W.D. (2000b). IP layer restoration and network planning based on virtual protection cycles. IEEE Journal on Selected Areas in Communications, 18(10):1938-1949. Surrballe, J.W. and Tarjan, R.E. (1984). A quick method for finding shortest pairs of disjoint paths. Networks, 14(2):325-336. Yang, Y., Zeng, Q., and Zhao, H. (2003). Dynamic survivability in WDM mesh networks under dynamic traffic. Photonic Network Communications, 6(l):5-24. Zang, H. and Mukherjee, B. (2001). Connection management for survivable wavelength-routed WDM mesh networks. SPIE Optical Network Magazine, 2(4):17-28.

Chapter 4 NETWORK TRAFFIC ENGINEERING WITH VARIED LEVELS OF PROTECTION IN THE NEXT GENERATION INTERNET Shekhar Srivastava Srinivasa Rao Thirumalasetty Deep Medhi Abstract

1.

In this paper, we consider the network traffic engineering problem for provisioning tunnels in a backbone network where services with varied levels of protection are offered. Network protection to address for a failure continues to be a critical issue for the Next Generation Internet. Our modeling framework allows protection at various levels to be considered in a unified manner through the notion of cycles that are made up of a disjoint pair of paths where the network may be capacitated by both bandwidth as well as tunnel constraint. We also consider a variety of network goals including the ability to provide as much bandwidth as possible for best-effort services along with guaranteed protection services and develop a composite objective function. We then present two heuristic for solving the models presented. Through our studies of different network topologies, we show the convergence as well as the effectiveness of our approach in considering multiple goals in a unified manner. For example, we have shown the tradeoff between accepting new requests of protection service classes and providing residual bandwidth for besteffort services. Finally, our results also show that capacity and tunnels can have equally important roles in ensuring effective traffic engineering of a network.

Introduction

Next generation Internet (NGI) is expected to carry a wide variety of services with differing service requirements. Often, the work on this area falls primarily into the category of addressing different quality-of-service (QoS) requirements. While QoS is important, providing network protection to address for a failure is a paramount issue as well. Our vision for

100

NEXT GENERATION INTERNET

NGI is the ability to offer services with different levels of protection for different customers (or service classes) while still providing best-effort services. For example, consider the following service classifications: (i) guarantee the service quality only under normal network operating conditions, (ii) full guarantee of bandwidth/service quality under normal situations plus a reduced level of service in the event of a major link failure, and (iii) fully guaranteed bandwidth/service quality both under normal as well as under any major link failure situation. A customer may sign up for any of these service classes ahead of time. For brevity, we refer to them collectively as book-ahead guaranteed (BAG) services. Our goal is to consider the network traffic engineering problem of provisioning demands and routes for BAG service requests, along with varied protection requirements for survivability. A way to provide provisioned routes for BAG is through the notion of tunnels. For example, multi-protocol label switching (MPLS) for the NGI is well-suited for this capability. MPLS technology (Davie and Rekhter, 2000) provides the ability to allocate bandwidth for different service classes through labelswitched paths (LSP). Recently, MPLS has been enhanced with the fast reroute capability which allows the possibility to provide protection for a tunnel through provisioned back-up tunnel using LSPs (Aubin and Nasrallah, 2003). There is however an important limitation imposed by MPLS. For example, each LSP setup requires a label on each intermediate node which is used for switching the input traffic to the destined output port. Thus, setting up each new LSP introduces additional labels to each intermediate node. To route each packet, a label switched router (LSR) would need to search through the Label Swapping table to find the matching label and the port to determine the output label and the port. It then appends the output label to the packet and sends the packet to the output port. Consequently, each activated LSP leads to more labels at LSRs, thereby requiring more processing to forward each packet. Thus, our interest here is to consider the network traffic engineering problem for varied levels of protection taking into account any restriction on number of tunnels imposed by label-switched routers while allowing enough bandwidth available for best-effort services at the same time. Network protection and survivability have been addressed for a variety of communication networks over the years; for example, see Fumagalli et al. (1999); Kajiyama et al. (1994); Kawamura et al. (1994); Medhi (1994); Medhi and Khurana (1995); Ramamurthy and Mukherjee (1999); Xiong and Mason (1999); Wu (1992). Using MPLS for traffic engineering has received considerable attention in recent years (Awduche et al., 1999; Le Faucheur and Wai, 2003; Aubin and Nasrallah, 2003). The work that

4 Network Traffic Engineering with Varied Levels of Protection

101

is closest to ours is by Kodialam and Lakshman (2000) where they have presented optimization models and algorithms for guaranteed tunnels with restoration. However, there are several differences. For example, we focus on protection rather than restoration, especially considering bookahead guaranteed protection services with varied levels of protection. The rest of the paper is organized as follows. In Section 2, we provide an optimization formulation of the BAG traffic engineering problem with tunneling constraints for varied levels of protection. The basic problem can be considered for a variety of different objectives; in Section 3, we show how different objectives can be a unified into a single objective. In Section 4, we present two heuristic algorithms to solve the problem formulated. In Section 5, we present numerical results to show effectiveness of our formulation.

2.

Problem formulation

For each origin-destination node pair in the network, we assume that we need to satisfy the required level of protection for a set of customers where each customer has a different protection grade-of-service requirement. Recall that we consider three protection grade-of-service classes; for simplicity, we refer to them as zero-, fractional-, full-protection BAG services. Here zero-protection means that the service is guaranteed under normal operating conditions but not under a failure; fractionalprotection means providing a reduced level of services under a major link failure in addition to guaranteed service under normal operating conditions; finally, full-protection means providing guarantee under both normal as well as failure conditions. Consider a demand pair k between originating node i{k) and destination node j(k) (refer to Table 4.1 for a summary of all notations) where we have a set of service requests Sk from customers requiring BAG services at differing protection grade-of-service. We denote the bandwidth demand of customer s G Sk for demand pair k by dsk. First, consider zero-protection BAGS level demand request. In this case, only a path with bandwidth dsk needs to be provisioned ahead of time (or allocated ahead of time). Considering only the shortest (e.g., in terms of hops) path may not address the overall traffic engineering goal. Thus, we need to consider a set of candidate paths for each demand k. For full-protection BAGS service class, a backup path needs to be available and bandwidth dsk needs to be reserved on the backup path. We require that the backup path survive if the primary path is affected due to any critical failure situation of interest to a provider, e.g., for a single link failure at a time. Although the primary and backup path

102

NEXT GENERATION INTERNET

Table l±.l. Summary of notations M C /C k i(k),j(k) Sk d% Vk Ci Te ask c fcm(cfcm) (5^ Pkm U

{<*%>0}

Set of nodes in t h e network Set of links in the network Set of demand pairs with traffic demand in the network Demand pair identifier originating node i{k) and destination node j(k) for demand pair identifier k Set of BAG service requests for demand k G /C Demand volume of service request s for demand pair k Set of candidate cycles for service request s G Sk for demand k G /C Capacity of link £ G C Maximum number of tunnels allowed on link £ G C Protection level of service request s G Sk of k G /C Cost of primary path p (back-up path b) associated with cycle m , request s, demand /c 1, if candidate cycle m G V^ for service s G Sk of demand pair /c G /C uses link ^ G £ in its primary path; 0, Otherwise 1, if candidate cycle m G VI for service s G Sk of demand pair k G /C uses link ^ G £ in its backup/secondary path; 0, Otherwise 1, if a% > 0; 0, otherwise

Variables: 0/1 decision variable for choosing cycle m for 5, & 0 / 1 artificial (slack) variable for s, /e Parameters: 7 Vk

osk

usk R

Weighing factor for routing cost on t h e back-up path Penalty cost of artificial p a t h w% Cost normalization factor for s, k Utility of 5, k Utility weighing factor

could be independently modeled, we use a pairing idea, i.e., consider a pair of paths consisting of primary and backup paths. Similar to the case of zero-protection services, the selection of the shortest pair of primary and backup paths for a demand dsk may not be in the best interest of a traffic engineering objective. Thus, we consider a candidate set of primary/backup paths for a flow demand d|, for full-protection. Finally, in the case of fractional-protection BAGS services, the backup path needs to be allocated bandwidth sufficient to carry a fraction of dsk in order to address partial survivability. Thus, the fractional BAGS service class also requires a pair of disjoint paths. The difference is that, on the backup path, only a fraction of the demand is required to be reserved. If we denote the fraction by ak (where 0 < ak < 1), then the primary path would reserve dk while the backup path would reserve

4 Network Traffic Engineering with Varied Levels of Protection

103

Figure 4.1, Illustration of cycles

When we consider all the three cases, it is easy to see that by appropriately setting ak we can consider each of the protection levels, i.e., ask = 0 refers to zero-protection, ask = 1 refers to full-protection, while ask refers to fractional-protection if 0 < ak < 1. There are two benefits to the way we have introduced a | : (i) fractional-protection need not be of a specific pre-defined value; each customer can request a different level, (ii) for zero-protection, we can also consider a pair of disjoint paths as well where on the backup path, we assign ask = 0; this means we can still consider a backup path but it is not used. Consequently, the three service classes can be considered in a unified manner from a modeling framework—all we need to do is to consider a set of candidate pairs of disjoint paths (see the work presented in Suurballe (1974); Suurballe and Tarjan (1986) on how to compute a pair of disjoint paths). For simplicity, we refer to a pair of disjoint paths as cycles. For example, a provider may be interested in providing protection service for a single link failure at a time. Thus, we need to consider a cycle that consist of a pair of link-disjoint paths; in fact, we need to consider a set of candidate cycles as input for the traffic engineering problem. As an illustration, consider Figure 4.1 where for the demand pair connecting nodes 1 and 2, we have three different candidate cycles 1-2-3-1, 1-2-4-1, and 1-3-2-4-1 which are link-disjoint; they all are considered as candidate cycles in our formulation while the traffic engineering objective decides which cycle to pick based on a set of requirements. In general, we denote the set of candidate cycles for service s for demand pair k by Vk. If we associate xkm as the decision variable with candidate cycle m, then for each s 6 <S/c, k € /C, the decision to select only one cycle is governed by the following constraint:

£

< 1-0,

seSk,ke)C.

(41)

The inequality allows for the case when a demand request may not be satisfied. By introducing the artificial (slack) variable wk, the above

104

NEXT GENERATION INTERNET

constraint can be re-written as E

x

km + wk

= i-0*

seSk,kelC.

(A2)

Note that for each cycle (because of the way each of them are generated) we have a 'primary' path and the backup path. Using link-cycle indicators (see Table 4.1), flow on link £ (denoted by f^) to carry demand volume under both normal and failure situations (for primary and backup path for different BAG demand request) can be captured by the amount

h = E E fcG/C seSk

E [SfL + «&sftK4mmeVl

Note the inclusion of parameter ask with the second term to dictate the level of protection on the backup path; also since we are considering a cycle consisting of disjoint paths, for a specific link £, if 6^m takes the value 1, then the corresponding (3^m must be zero. Given capacity Ct for link £, we thus have the following capacity constraint;

E E

E [SfL + PfLotMv'krn <
^ £.

keic sesk mevsk

v

(4 3) O)

Finally, we have an additional constraint due to restriction on the number of tunnels outgoing from any label-switched router. If we restrict the number of active tunnels on link £ to T^, then we can write

E

E s

E [SfL + u{ai>o}Pskem}

xskm <Ti, t e c

ke)cseskmev k

(4 4) K ' }

where U{as>oy is to indicate that back-up paths are not to be counted as tunnels in the case of zero-protection request ask — 0. The network traffic engineering problem (P) is to minimize a suitable objective / ( x , w) (to be discussed in detail in the next section) in the presence of both capacity and tunneling constraints for book-ahead guaranteed services while incorporating varied levels of protection requirements; this can be written as F* = min

f{x,w)

subject to constraints (4.2), (4.3), (4.4) 4m. wk £ (0,1}, m e VI

seSk,kelC.

The above formulation does not include how to handle best-effort services; this will be discussed in the next section. Problem (P) is a multicommodity flow model using link-path representation where "path" is

4 Network Traffic Engineering with Varied Levels of Protection

105

replaced by cycles to allow for pairs of disjoint paths. The notion of using cycles in a link-path formulation setting was originally presented in Medhi (1991). Another advantage of the above formulation is that we can incorporate situation-disjoint path to the primary path without changing the overall formulation since this only involves generation of cycles. The notion of situation-disjointness has recently been presented in Krithikaivasan et al. (2003); Pioro and Medhi (2004); in essence, construction of situation-disjoint paths generalizes the idea of link-disjoint or node-disjoint paths.

3.

Objective function

There are several possible goals that can be considered: i) provide bandwidth for best-effort services, ii) to minimize routing cost for different service requests, iii) to minimize the penalty for not meeting some service requests, iv) to maximize the revenue for service requests that are carried. In the following, we show how to combine these four goals in a single objective function. We start with how to handle best-effort services. Besides allocation for BAG services, our goal is to provide maximum residual bandwidth to best-effort services so that this class can provide as good of a service as possible. We incorporate this requirement in the objective function by considering the maximization of residual capacity which can be written as max ]T ( Q - r£) {}

where 77 is the bandwidth consumed by different BAG services on link L This is equivalent to: min ^ re. That is, to provide as good of a service as possible to best-effort services, we minimize the total bandwidth allocated to links in the network. Thus, our first objective is

A = £ n.

/ 45 )

v J tec In a network with BAG services, there are however two ways to determine rt* In the first case, m is computed assuming that the primary and backup paths, which were chosen for each s,/c, are allocated bandwidth dsk and a | d | , respectively, at the time of network provisioning; this is referred to as the hard requirement In the second case, while the backup path is assigned for every primary path, the actual bandwidth is not necessarily reserved on the backup path during normal operating conditions; rather, signalling mechanism is used to actually allocate

106

NEXT GENERATION INTERNET

bandwidth once the failure occurs. In other words, since the bandwidth which would be needed on the backup path is not reserved during normal operation, best-effort services can use this bandwidth under normal operating conditions; this is referred to as the soft requirement. In the following, we discuss the construction of the objective function for various goals, separately for the hard and soft requirements.

,1

With hard requirement

For the hard requirement, (4.5) can be written as

E

ieckeicseskmevsk

m

km

km

6

'

'

for the first goal. Consider next the second goal: minimization of the routing cost of demand request. If c£ is assumed to be the routing cost on the primary path, and cs^m is assumed to be the routing cost on the backup path of the candidate cycle m G Vf. of service request s G <S& and demand pair k G /C, then the objective of minimizing the total routing cost can be written as

E (4m + ^Ckm)Xkm

H= E E

U 7) K

keK,seSkmev*

J

where 7 (0 < 7 < 1) is a weighing factor for the routing cost of the backup path. For example, this parameter allows less weight to be given on the backup path for routing. If we use (4.7) in Problem (P) as it is, then the optimal solution is not to route any bandwidth at all due to constraint (4.2); certainly, this is not desirable. Thus, for the third goal, if we introduce penalty cost 77^ with artificial variable wf^ to indicate the cost incurred for not routing demand, then we can re-write the objective function (4.7) as

/2< = E E (( E (cZn + nlHm

+« ) '

s keJCsesk k mev mevk

(4 8)

v

*;

In order to combine the cost due to first goal, / 1 , with the cost due to the second and the third goals, fy, we need to incorporate a new factor so that these two cost components can be normalized. That is, by introducing the normalizing factor 0^(> 0) only for routing variable part in (4.8), i.e., for (4.7) (and not double counting with penalty cost for artificial variables in (4.8)), we can write the combined objective as

h+e* = E E E E (SfL + Pkm<)dpskm tec k&K sesk mevf.

E E

E (C + 7cfK + « )

,.

{

4 Network Traffic Engineering with Varied Levels of Protection

107

In general, the unit routing cost on primary and backup, cskm and cskm) can be written in several ways. We show here three different ways. If it is based on hop count of a tunnel, then we can write

C=EC

and cfm=E/3fm«|.

eec eec If the cost of tunnel is based on flow forwarding cost associated with the tunnel, then we can use instead

cfm = 4

and

cskbm = askdsk.

Finally, if the unit cost is based on both flow and hop-count, then we can write c

km ~ 1^ °kmdk a n c l Ckm ~ 2^ Pkmakdk' (4.10) v ; eec eec Suppose we incorporate the last one in the objective (4.9); then, by re-arrangement, we get

= E E ( E [(i + 8tKl + (i + et)cfm]xim ke!Csesk mevi (4.11) Finally, we consider the fourth goal: revenue maximization for accepting a demand. We address this factor in the form of a utilization parameter. The utility may vary from one request to the other, i.e., a request may be required to be allocated to a cycle even at a higher cost (that is, it consumes more bandwidth), if the utility of that request is higher. Let usk (0 < usk < 1) be the (normalized) utility of service class s G Sk of request k £ /C, By incorporating utility, the objective function finally becomes

h = E E ( E [(i + 0£)C + (7 + tydt - Rusk]x%m + keiCseSk rneVf.

(4.12) where R is the utility weighing factor. The value of the utility weighing factor R dictates the importance of utilities of requests over the costs. The special case of Problem (P) of minimizing function fh will be referred to as (P^). To summarize, in the above function, we have incorporated four different goals-—this is done through a combination of parameters ^ , 7 , i ? , u | , ^ . However, we need to distinguish that usk is an input choice due to customer requirement for a network provider much like a | . Thus, we are left with the key parameters: 0£,7, R, and

108

3,2

NEXT GENERATION INTERNET

With soft requirement

Soft requirement pertains to the scenario where the backup paths are indicated at the time of provisioning, but not reserved with bandwidth, i.e., the capacity needed on the backup path is "allowed" for use by besteffort services as long as there is no failure; certainly, backup paths are immediately assigned the required protection bandwidth depending on the BAG service class as soon as a failure occurs through a signalling message, thus, bumping out best-effort services. Note that such a benefit may come at the expense of increase in time required to restore from a failure. Regardless, load on link due to bandwidth only on the primary paths can be captured as re = ]T E E ^kmdkxkm f° r *he ^ r s t S o a ^ keic sesk mePj?

leading to the network load / l = E E E

E

S

kmdkxkm'

U 13)

With soft requirement, the routing cost is only due to primary paths; thus, incorporating penalty cost, we have / 2 = E E ( E cSkmXkm + VkWk)' keicsesk mevsk

(4 14) ' ;

v

for the second and the third goals. Similar to the hard requirement scenario, we can construct the combined allocated capacity and routing cost function /i+#2 &s 2

x

I

kmckm, + uk 2^ °kmakxkm\ "

Incorporating the routing cost in the same fashion as in (4.10), and rearranging the terms, we get

= E E ( E 4 J i + ^K™ Incorporating the utility for revenue for carried demands (the fourth goal), we have the final objective function as

/- = E E [ E ((i + oi)cfm - uiR)xim + « ] . k€K seSk

mZVl

(415)

V

'

;

Note that (4.15) looks the same as (4.12); however, there is a subtle difference due to hard and soft requirement which is captured through the cost components. Problem (P) with the above objective function will be referred to as problem (P 5 ).

4 Network Traffic Engineering with Varied Levels of Protection

109

From either the hard or the soft requirement, we can see that four goals are combined in a single objective function. In the process, we find that there are four key parameters (drivers) that impacts the final allocation of tunnels and bandwidths: R,6k)rik in case of the hard and the soft requirements, and 7 in the case of the hard requirement. In Section 5, we will discuss the implication of these parameters on the effectiveness of our formulation through appropriate network measures.

4.

Solution approach

The network traffic engineering problem presented in this work is for book-ahead guaranteed services; thus, the tunnel selection can be determined off-line ahead of time so that tunnels can be provisioned for different request and level of protection. First note that Problem (P) is an integer linear programming (ILP) problem. We describe here two solution approaches. Our first solution approach, gIPQ, is based on successive approximation by continuous relaxation of Problem (P), while our second approach, gSAQ, is based on the simulated allocation method (Pioro and Gajowniczek, 1997).

4,1

Heuristic I: gIP

The first approach is based on successive approximations by continuous relaxations of the integer linear programming problem P which is shown in Algorithm 1. We define two sets Xf and Xv containing variables x which have been fixed (their values are already decided) and the ones that are still kept as variables, respectively. The LP relaxation of Problem (P), given these two sets, will be denoted by Relaxed.gIP(Xv,Xf). Initially, Xf is empty and Xv contains all x variables. Note that Relaxed-.gIP(Xv,Xf) is a continuous linear programming problem which can be solved by the simplex method. Upon solving the very first relaxation, we obtain the lower bound on the objective to the original problem. We then inspect values of variable xskm € Xv at optimality. These values can be categorized into the following ranges: (a) xskm > 1.0 - 6, (b) xskm < e, (c) e < x%m < 1.0 - e, where e > 0 is the acceptable error margin. The values of xskm's that fall in the range described in (a) are set to 1, while for range described in (b) are set to 0. With this set up, sets Xv and Xf are updated as, Xv <— Xv\{xskm}, Xf <— Xf U {xskm}. The variables which has the range that fall under type (c) are left as variables, that is, they remain in set Xv. We adjust capacities and tunnels on each link based on the values of xkm which are now in Xf. We then solve the reduced problem Relaxed-gIP(Xv,Xf). We repeat the procedure till we find variables that fall only in range

110

NEXT GENERATION INTERNET

Algorithm 1 gIP: Successive Approximation Approach

set Xv = {xskm, m€V%,seSk,kelC}U

{wsk, s eSk,k

elC}

set Xf = 0, done <— 0, change <— 1 while done = 0 AND change = 1 do x
for all x G Xv do if x < e then X / - Xf U {x}

Xv -

Xv\{x}

x = 0, change = 1 else if x > 1.0 — e then X/ - Xf U {x}

X, -

Xv\{x}

x — 1, change = 1

else done = 0

end if end for end while if done = 0 OR change = 1 then x <— solve Integer.gIP(Xv) end if return x (b); when this occurs, we solve the problem as an ILP problem. In our experience, we have found that the original ILP problem is reduced considerably through this procedure, and the final, reduced ILP can be manageably solved by a standard branch-and-bound or branch-and-cut method. It is possible that when the value of xskm that falls in the range of type (a) is set to 1 and the problem is rerun, the solution might be infeasible because of the capacity constraint (4.3) being violated. To avoid this problem, we keep track of current state of allocation on link capacities and tunnels. When the obtained value of xvkm is of type (a), we assign it to be 1 only if none of the links violated capacity and tunnel constraints; accordingly, we update sets Xv and Xf. Otherwise, we do not update these sets and let xpkm to remain as a variable. Such an issue does not arise when deciding for types (b) and (c).

4

Network Traffic Engineering with Varied Levels of Protection

4.2

Heuristic II:

The second heuristic Gajowniczek, 1997). It (xskm,m G Vk,s G <S/c,fc values of w are chosen satisfied, i.e.,

111

gSA

is based on simulated allocation (Pioro and works with partial allocation sequences x = G /C). During the execution of the algorithm, in such a way that constraint (4.2) is always

xskm = 0 ^wsk = l,

E

(4.16)

otherwise wk = 0; for all s G <S/c, k G /C. Additionally, we define

c(*>0 = E E fc€/C seSk

k

and

E \-5km + ^K>o}/3fj x%m.

E

(4.17b)

Note that c(x,£) and t(xJ£) determine the present state of the constrained resources (allocated capacities and tunnels on link £) for a given allocation sequence x. A path vn! of set V% is said to be an accessible path from present allocation sequence x if

c(x, t) + [Sfm, + asXm,}dsk

< C e i e V{

(4.18a)

and

t(x, t) + [5fm, + U{ai>0}(3fm,}
(4.18b)

Thus, setting the chosen x km, — 1 does not violate constraints (4.3) and (4.4). We define set Ai as the set of maximum allocation sequence, such that x £ M. means that for an allocation x there exists no unallocated demand (wk, = 1) with an accessible path ml'. The algorithm starts with x = 0 and w = 1. At each step, we either choose to allocate (allocate(w)) or to deallocate (deallocate(w)) based on the current state of allocation (x). For x £ M, we execute procedure allocate(x)^ otherwise procedure deallocate(x). The routine allocate(x) collects all the unallocated demands (wk — 1) and amongst them randomly chooses a s G Sk of k G /C. All the paths in the set of candidate paths Vsk are chosen in the order of increasing cost (££m) and checked for accessibility. The first accessible path vn! that satisfies capacity and tunnel constrains is chosen, and corresponding variable xskm, is set to 1. Procedure, deallocate(w), invokes another procedure, deallocate A (w), with probability q(x); otherwise, it invokes procedure, deallocate-2(w).

112

NEXT GENERATION INTERNET

Algorithm 2 gSA: Simulated Allocation Approach step <— 0 count *— 0 F* *-oo (x,w)^(0,l) while (step < step max AND F* > F^in) do step = step + 1 if x G M then allocate(w) else if random < q(x) then deallocate A[w) else

end if if f(x,w)
Km' = 0Probability q{x) in Algorithm 2 plays an important role in the convergence and solution quality of the algorithm. This depends on the current value of the objective function compared to a lower bound on the optimal objective, F ^ n , which is obtained from the LP relaxation

4 Network Traffic Engineering with Varied Levels of Protection

113

of the entire problem. By considering the quantity, ~F(™\ — J\X)W)

~ min min

we set q(x) as follows:

{

9u if ~p(^*) i {x
Y >"

#2> otherwise.

In practice, we have found the following values to work well: F = 0.1, qi = 0.96, q2 = 0.8. We have set step m a x = 10,000.

5,

Results and discussion

The primary goal of this section is to show the effectiveness of our formulation in solving the network traffic engineering problem at varied levels of protection; in particular, we show the role of the key parameters 7, i?, #/!,77| in capturing the different goals, for both the hard and the soft requirements. In addition, we are interested in the convergence behavior of the heuristics, and the interplay between the capacity and tunnel constraints. For our study, we have considered four example networks EN-I, ENII, EN-III, EN-IV (Figures 4 . 2 - 4.5). EN-I has 12 nodes, 18 links and average nodal degree (ratio of number of edges to number of nodes) of 1.5; EN-II has 6 nodes, 12 links and an average nodal degree of 2.0; ENIII has 12 nodes, 25 links and an average nodal degree of 2.08; EN-IV has 10 nodes, 26 links and an average nodal degree of 2.6. The baseline link capacity is set to 622 Mbps for each link. Three BAG demand classes are considered for each demand pair in the network at 100 Mbps, one for each of the three levels of service protection: zero-, fractional-, and fullprotection; we have then associated an utility cost appropriate for each of these classes. In our case, for each k G /C, we have set u\ — 1.0 with Q! = 0.0 for zero-protection (service class: c sl'), while u\ = 3.0 with Q/| — o.5 for partial-protection (service class: cs2'), and finally, u\ = 5.0 with a\ = 1.0 for full-protection (service class: 4s3'). The penalty cost for each demand and service class, 77^, is then computed as 77*(1 + a|)
114

Figure 4.2. Experimental network I

Figure 4-4- Experimental network III

NEXT GENERATION INTERNET

Figure 4-3. Experimental network II

Figure 4-5. Experimental network IV

the generation of candidate cycle paths. It may be noted that Suurballe and Tarj an have developed an algorithm for generating shortest pair of disjoint paths (Suurballe, 1974; Suurballe and Tarjan, 1986); this, however, helps in generating only the shortest cycle, not a set of candidate cycles. In our case, we have implemented a simple procedure by extending the /c-shortest path algorithm (Lawler, 1976) where the k paths generated by the /c-shortest path algorithm are compared to each other to filter out common links to generate a set of candidate cycles containing only disjoint pair paths. For the test networks we considered in our study, the determination of the candidate cycles through this procedure took only a few seconds of computing time. Note that the candidate set is only a feeder to the optimization model, and certainly, the eventual solution can depend on how many candidate cycles are included at the beginning. Based on our preliminary study, we have found that there was no difference in solution quality when the number of candidate cycles is five or more for EN-II, while some minor differences were found in solution quality for the other test networks until about 15 candidate cycles; thus, in all our studies reported below, we have set the number of candidate cycles to 5 for EN-II, and 15 for the others.

4

Network Traffic Engineering with Varied Levels of Protection

5.1

115

Comparison of gIP and gSA

For comparative study of heuristics gIP and gSA, we use Problem (P) with the hard requirement, i.e., Problem (P/J. In our study, we have set parameter values as 9 — 0.5, 77* = 5, R = 100, and 7 = 0.5. We have run cases with various tunnel limits (T^); here, we will report when tunnels limit on each link was set to 50, 15, 25, 20 for EN-I, EN-II, EN-III, EN-IV, respectively, We have started the capacity at the baseline value (622Mbps) and increased it up to 400%. The combination of tunnel values and the change in capacity allows us to understand the relation between them; we will be exploring the variation in number of tunnels later in Section 5.3. Besides gIP and gSA, we also use a hybrid heuristic gIP+gSA where we first solve Problem (P/J using gIP and then use the final solution of gIP as the starting point in gSA, for further improvement using gSA. Note that the continuous relaxation of the binary variables of Problem (P/J serves as the lower bound on the integer solution obtained; this is marked as "LP" in results reported. We report our results in Figures 4.6 to 4.9. From the results, we can see that Heuristic gIP is a very good method in practice and often gives results close to the LP lower bound; on the other hand, Heuristic gSA was not as effective unless the network has plenty of capacity. Nevertheless, in several instances, we have found the hybrid method, gIP+gSA, to give a better objective value than just gIP. Thus, in the rest of the paper, we will report results obtained using the hybrid method, gIP+gSA.

5.2

Effectiveness of formulation (Ph)

We now discuss detailed results for Problem (P/J, especially to understand its effectiveness through key parameters 7, i?, 0, and 77*. For this purpose, it is helpful to consider a set of measures or indicators, instead of considering the objective function value. We define the following measures for this purpose: APR Average Path Ratio in terms of length of the primary path to the backup path MRC Minimal Residual (normalized) Capacity (= min^f) FAD Fraction of Accepted Demands (out of total demand requests) The above measures are common measures of importance to network providers in regard to provisioning of services in a network; they will be discussed in relation to dependency on the key parameters which a network provider can tune depending on the importance of one or more

116

NEXT GENERATION INTERNET 140000 y K

130000 I

' X --X

•^K

i .

I

-.

120000 ^

%

v

•* .x

giP —•— glP+gSA —*--- ' LP-&-

100000

I

90000 80000 100

11000 /\

10000 9000

.

-

•

•

*

-

,

,

80001

•

:

110000

^

12000

t5

7000

2.

6000

>

5000 4000

150

200

250

300

350

Percentage Capacity (% of baseline)

3000 1

400 >

150 200 250 300 Percentage Capacity (% of baseline)

350

400 >

Figure 4.6. Value of fh for EN-I with in- Figure 4.7. Value of /fc for EN-II with creasing capacity and Tt = 50 increasing capacity and Tt — 15 60000 9 SA glP+gSA LP

55000 50000

-x-

-

. x \ 35000 45000 5 40000

5 I >

30000 25000

x

20000 15000 100

150 200 250 300 Percentage Capacity (% of baseline)

350

400 >

100

150 200 250 300 Percentage Capacity (% of baseline)

350

400 >

Figure 4.8. Value of fh for EN-III with Figure 4.9. Value of fh for EN-IV with increasing capacity and Tt = 25 increasing capacity and Te — 20

over other goals within their network. For brevity, we will discuss our results for example network EN-III where we have set Tp — 50 (impact due to different values of tunnel constraint will be discussed later in Section 5.3). The default values of key parameters are: 7 = 0.5, R — 100, 0sk = 0.5, and 77* = 5 except for change of values when one of these parameters is varied independently. First, we start with impact on relative prioritization of service classes: zero-protection, partial-protection, and full-protection.

Relative Prioritization of Service Classes: In Figure 4,10, we show the measure FAD for network as a whole and also for each protection service classes. We look at this measure as the capacity of the network is increased. For baseline capacity, some demands are accepted; as capacity is increased, more demands are accepted into the network which is expected. It is important to note that in all cases, most demands are accepted for the service class with full-protection ('s3'). It may however be surprising why not all demand requests with full-protection are accepted; this again depends on the value of the key parameters, in

4

Network Traffic Engineering with Varied Levels of Protection

117

100% -—j 300 % • X- •• • 500 % . x-. • •• X- — • X" •- x " 4 - • X- - •

K-—'

150

200

250

300

Percentage Capacity (% of baseline)

350 >

-H

1

f"

0.4 0.6 Increasing value of y -->

Figure 4JO. FAD for EN-III when ca- Figure 4.11. Value of APR for EN-III pacity is increased with increasing 7

particular, 77* (which can be increased to let more demands be carried, rather than penalized) and R with u (revenue of a particular service class is increased depending on the importance of the service class). Dependence on 7: We have also evaluated the role of 7; recall that 7 is the routing cost weighing factor for the backup path. When 7 is set to zero, there is no routing cost for backup path where as when it is set to one, backup path has full routing cost based on number of hops and allocated flow. For experiments with increasing value of 7, we compute APR for all service classes and demands with protection requirement (a > 0) for three different values of the network capacity. We present results in Figure 4.11 for the EN-III. We can see that as 7 increases, the value of the ratio APR increases as well. Since the routing cost of primary path is always more than the backup path, in the presence of additional capacity our model, on average, results in provisioning that prefers shorter primary path. Dependence on R: We study the impact of the utility weighing parameter R on the allocation of demands; it determines the overall utility as compared to the routing costs and the bandwidth/tunnel requirement costs. Thus, we consider the measure MRC for increasing values of R. We present results in Figure 4.12 for EN-III. It may be noted that change in R does not seem to impact as R increases. As expected, with increase in network bandwidth, MRC also increases. Dependence on 0: We now consider impact of changing 0; recall that 6 controls the importance given to the allocation cost of a request as compared to its normalized routing cost. When set to zero, only routing cost is accounted for in the optimization formulation; when set to one,

118

NEXT GENERATION INTERNET 1 *

0.9 0.8 >< 0.7

"* '*

* - x

*^.

^-.-^

* " ' x-

x

^ -x

x

.x.

m 5

0.6 0.5 0.4 0.3

100% —I— 0.2 • 300 % x 500% — *•--

•

n1

0.4 Increasing value of R -

0.6

Increasing value of 6 -->

Figure 4.12. Value of MRC for EN-III Figure 4A3. Value of FAD for EN-III with increasing R with increasing 6

both have the same weight and hence play equally important roles. Due to its role in determining the cost of a request, the measure FAD is the most relevant measure. We present results for EN-III in Figure 4.13. We note that FAD decreases with increase in the value of 6. Due to the increase in 6, for some demands the penalty cost becomes less than the minimum cost of the associated cycle. Such a scenario forces the model to reject demand. More so, the cycles with higher number of hops pay heavier penalty. Observe that the impact of increasing 6 is only realized when the overall cost of the minimal cost of accessible path increases above the penalty cost for rejecting demand (77). Thus, we find regions in values of 9 where FAD remains unchanged. Dependence on 77*: The parameter, 77*, appears in the form of penalty for not accepting a specific demand request. For a sufficiently high value of 77*, the network would accept all the demands that it can carry, leaving minimal or no bandwidth for best-effort traffic. However, if we choose very low value for 77*, most or all of the demands will be rejected and network would be largely under utilized. In this case, the measure, FAD, is of interest for different values of 77*. We present results for ENIII in Figure 4.14. For a tightly capacitated network, there is not much difference as 77* increases. However, a network with moderate bandwidth that can accept more demand, we do see that MRC is low when 77* is small, and as expected, increases dramatically as 77* increases.

5.3

Impact of number of tunnels

In this section, we study the impact due to restriction on the number of tunnels on each link. For this study we use the Formulation (P/i). We consider two measures, MRC and FAD, for increasing network capacity and increasing number of tunnels (see Figures 4.15- 4.18). We have set

4

Network Traffic Engineering with Varied Levels of Protection 1

100%

300 %

0.9

500%

0.8

119

—I— -x- • • —a—

y...

0.7

/

0.6

• • • • • - . . . ; . . • • • • • • •

'

0.5 0.4 '

•

/

/

'

0.3 0.2 > 0.1 3

4

5

6 . 7

Increasing value of T| -->

Figure 4>H> Value of MRC for EN-III with increasing 77*

the value of other key parameters as follows: R = 100, 77* — 10, 6^ = 0.5 and 7 = 0.5. Note that the value FAD increases with increasing capacity until it stabilizes based on the allowed number of tunnels on the links. After that, the excess amount of capacity is of no consequence since the links are "congested" in terms of number of tunnels. Increasing the number of tunnels on each link affects the solution in two ways. One way is the direct increase in the value of FAD and a consequent decrease in the value of MRC. The acceptance of more demands makes links utilization go higher and leave lesser bandwidth for best-effort traffic. It allows demands to use paths with less number of hops which had excess capacity but no extra tunnels. With increasing the limitation on tunnels on links, many more shorter paths (mostly less in overall cost too) become accessible. The above mentioned behavior can lead to an increase in the value of MRC for increasing tunnels and increasing FAD. As demands begin to choose cycles with fewer hops, additional capacity is freed from the network. This capacity is used by other demands in a similar efficient way. Similar behavior on the part of all the demands leads to solutions with smaller value of MRC. When we consider increasing capacity of links, we see a similar behavior. As we increase the amount of capacity, the value of MRC shows a sudden decrease (assuming same FAD), since some of the demands using the longer paths can now be moved on to shorter paths which leads to significant decrease in the value of MRC. However, once all the demands are moved to minimum hop cycles, the subsequent decrease in the value of MRC is only obtained by the increase in the capacity of each link. We specifically consider the behavior of EN-II (see Figure 4.16). Note that the value of FAD reaches 1.0 at 150% of baseline capacity at which

120

NEXT GENERATION INTERNET 1

•

0.8 0.7' 0.6

a

eg

FAD:T=10 • x- MRC:T=25 FAD:T,=25 .... B-.- • -

0.9)

•Y--K

0.5 •

0.4 \

0.3 \

0.2 n1 150 200 250 300 Percentage Capacity (% of baseline)

150 200 250 300 Percentage Capacity (% of baseline)

Figure 4.15. MRC and FAD for EN-I Figure 4-16. MRC and FAD for EN-II with increasing capacity with increasing capacity MRC:T,=15' — t FAD:T=15 - x MRC:T=50 ---*, FAD:T,=50 B-A ^

r.V

0.9 3 0.8 0.7 o.6 0.5 0.4 0.3 0.2 0.1

150

200

250

300

Percentage Capacity (% of baseline)

350 >

MRC:T,=10 — i — FAD:T=10 •••*• MRC:T=25 — * - FAD:T=25 • - s ••

150 200 250 300 Percentage Capacity (% of baseline) •

Figure 4.17. MRC and FAD for EN-III Figure 4.18. MRC and FAD for EN-IV with increasing capacity with increasing capacity

value MRC is almost equal for both the cases (I) = 10,25). When we increase the capacity to 200% of the baseline capacity, the value of MRC for Ti — 25 falls dramatically as compared to Ti — 10. This again is due to the freed up capacity in the network as the demands shift from longer to smaller hop cycles. At 200% of the baseline capacity, all the demands have preferably chosen minimal hop cycles and hence the subsequent decrease in MRC is only due to increasing capacity (similar to that of Tt = 10). The results demonstrate that capacity and tunnels are equally important while provisioning a network. Presence of fewer number of tunnels nullifies the presence of abundant capacity leading to under utilized links. More so, having too many tunnels is useful only when sufficient capacity is available in the network. Thus, our inference is that accounting for both capacity and tunnels leads to effective traffic engineering solutions. Both of them could be viewed as resource which impact the amount of traffic carried by a network.

4

Network Traffic Engineering with Varied Levels of Protection

5A

121

Effectiveness of formulation (Ps)

In this section, we study the soft requirement formulation on its capability to incorporate the various objectives in an integrated fashion. Note that soft requirement is based on the circumstances that the backup paths are only found apriori but are not reserved, rather they are used by best-effort traffic during normal operating conditions. In the event of failures, the affected demands are routed over the already chosen backup paths preempting best-effort traffic. Here, there is no cost of reserving the backup paths since the capacity is still used by best-effort traffic; consequently, parameter 7 has no role to play. Such a difference fundamentally changes the way the formulation (P s ) is affected by changes in the parameters. In Figures 4.19-4.21, we present results on impact of changes in parameters i?, 6 and 77*. We have used the number of allowed tunnels as Ti = 15, 20, 50, 50, the number of candidate cycles as 5,15,15,15 and the capacity of the links as 150%, 200%, 400%, 500% of the baseline capacity for example networks EN-I, EN-II, EN-III, and ENIV, respectively. The default values of the key parameters are: R = 100, 0% = 0.5, and 77* = 5. The impact of increase in R on the minimum residual capacity (MRC) is shown in Figure 4.19. The behavior observed is similar to the hard requirement formulation (4.12), although there are minor differences. Here, we have smaller values of routing related cost since this cost component does not include the cost of backup path; thus, smaller values of R provide the same overall cost to a demand. This leads to similar behavior of soft requirement formulation at smaller values of R as that of hard requirement formulation at higher values of R. Similarly, as 9 increases, we note that the behavior shown in Figure 4.20 is similar to that of hard requirement formulation, shown in Figure 4.13. On a closer look, we observe that the impact of 9 has been diluted. The decrease in the value of FAD is much less than the one observed for hard requirement. This can be attributed to the absence of routing cost for the backup paths. Note that 9 determines the routing cost vis-a-vis allocation cost of a cycle. Since for soft requirement, the cost of cycle is reduced to that of the primary path, the impact of 6 towards the overall cost is reduced, The impact of increase in 77* on the FAD is shown in Figure 4.21. The minor differences between the soft and hard requirements are due to the relative change of the value of the routing cost component. Here, we only have the cost of primary path (no cost for backup path) and hence smaller values of 77* provide same relative over all cost to a demand.

122

NEXT GENERATION INTERNET EN-I EN-II EN-III EN-IV

0.9 0.8

J

0.7 0.6

•

/

0.5 • / '

\

X \

x

x

x

..a

0.4 t 0.3,

,..>$

S--

•ts

\

100 150 Increasing value of R •

Figure 1^.19. Value of MRC for example networks with increasing R

0.2

0.4

0.6

Increasing value of 9

3

4

5

6

Increasing value of V

Figure 4-20. Value of FAD for example Figure 4-21. Value of MRC for example networks with increasing 9 networks with increasing rj*

This leads to similar behavior of soft requirement formulation at smaller values of 77* as that of hard requirement formulation at higher values. In general, model with soft requirements, (P s ), leads to dilution of impact of parameters on the solution as compared to the hard requirement (P/i). Our results above confirm this behavior.

6.

Summary

In this paper, we consider the problem of traffic engineering a backbone network supporting services with varied protection requirements. We have developed a novel modeling approach by using a cycle path concept so that service classes with different protection requirements can all be modeled in the same framework. We also introduce different objectives, with seemingly different goals, into a unified, single objective function. The objectives considered not only accounted for the allocation of the service classes but also the bandwidth available for the best-effort service class. Moreover, we have considered hard and soft requirements in regard to provisioning of backup paths ahead of time or only during a failure.

4 Network Traffic Engineering with Varied Levels of Protection

123

Models formulated are integer linear programming problems for which we have presented two heuristic methods. First heuristic is iterative in nature and uses continuous relaxations of the problem to derive integer solutions. Second heuristic is based on Simulated Allocation technique. We compare the two heuristics and show that a hybrid approach combining both the methods works quite well. We then proceed to show the effect on various key parameters that are integral to the integrated traffic engineering problem. We have presented extensive example results showing that solving the problem based on the aggregated objective function can satisfy a variety of measures typically of interest to a service provider in a satisfactory manner. We also discuss the interplay between various parameters and resources and show their relative impact on each other. Tradeoff between accepting new requests of survivable classes and the residual bandwidth for the best-effort services was also evaluated. The results also showed that capacity and tunnels can have equally important roles in ensuring effective traffic engineering of a network. Acknowledgments We thank Michal Pioro for his suggestions on how to apply the Simulated Allocation method for solving the formulation presented in this paper. This work is supported in part by DARPA and Air Force Research Lab under agreement no. F30602-97-1-0257.

References Aubin, R. and Nasrallah, H. (2003). MPLS Fast reroute and optical mesh protection: A comparative analysis of the capacity required for packet link protection. In: Proceedings of Design of Reliable Communication Networks (DRCN'2003), pp. 349355, Banff, Canada. Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M. and McManus, J. (1999). Requirements for traffic engineering over MPLS. Internet RFC 2702, http://www.ietf.org/rfc/rfc2702.txt. Davie, B. and Rekhter, Y. (2000). MPLS: Technology and Applications. Morgan Kaufmann Publishers. Fumagalli, A., Cerutti, L, Tacca, M., Masetti, F., Jagannathan, R., and Alagar, S. (1999). Survivable networks based on optimal routing and WDM self-healing rings. In: Proceedings of INFOCOM, pp. 726-733, IEEE Press. Kajiyama, Y., Tokura, N., and Kikuchi, K. (1994). An ATM VP-based self healing ring, IEEE Journal on Selected Areas of Communication, 12(1):171—187. Kawamura, R., Sato, K., and Tokizawa, I. (1994). Self healing ATM networks based on virtual path concept. IEEE Journal on Selected Areas of Communication, 12(l):120-127. Kodialam, M and Lakshman, T.V. (2000). Dynamic routing of bandwidth guaranteed tunnels with restoration. In: Proceedings of INFOCOM, pp. 902-911, IEEE Press. Krithikaivasan, B., Srivastava, S., Medhi, D., and Pioro, M. (2003). Backup path restoration design using path generation technique. In: Proceedings of Design of Reliable Communication Networks (DRCN), pp. 77-84, Banff, Canada.

124

NEXT GENERATION

INTERNET

Lawler, E.L. (1976). Combinatorial Optimization: Networks and Metroids, Holt, Rinehart, and Winston. Le Faucheur, F. and Lai, W. (2003). Requirements for Support of Differentiated Services-aware MPLS Traffic Engineering. Internet RFC 3564, http://www.ietf.org/rfc/rfc3564.txt, July 2003. Medhi, D. (1991). Diverse routing for survivability in a fiber-based sparse network. In: Proceedings of International Conference on Communications, pp. 672-676, IEEE Press. Medhi, D. (1994). A unified approach to network survivability for teletraffic networks: Models, algorithms and analysis. IEEE Transaction on Communication, 42:534548. Medhi, D. and Khurana, R. (1995). Optimization and performance of network restoration schemes for wide-area teletraffic networks. Journal of Network and Systems Management, 3:265-294. Pioro, M. and Gajowniczek, P. (1997). Solving multicommodity integralflowproblems by simulated allocation. Telecommunication Systems, 7(1-3): 17-28. Pioro, M. and Medhi, D. (2004). Routing, Flow, and Capacity Design in Communication and Computer Networks. Morgan Kauffmann Publishers. Ramamurthy, S. and Mukherjee, B. (1999). Survivable WDM mesh networks, part I - protection. In: Proceedings of INFOCOM, pp. 744-751, IEEE Press. Srivastava, S., Krithikaivasan, B., Medhi, D., and Pioro, M. (2003). Traffic engineering in the presence of tunneling and diversity constraints: Formulation and Lagrangean decomposition approach. In: Proceedings of International Teletraffic Congress, Berlin, pp. 461-470, Elsevier Science. Suurballe, J.W. (1974). Disjoint paths in a network. Networks, 4:125-145. Suurballe, J.W. and Tarjan, R.E. (1986). A quick method for finding shortest pairs of disjoint paths. Networks, 14:325-336. Wu, T.H. (1992). Fiber Network Service Survivability. Artech House. Xiong, Y. and Mason, L,G. (1999). Restoration strategies and spare capacity requirements in self healing ATM networks. IEEE/ACM Transactions on Networking,

Chapter 5 BALANCING TRAFFIC FLOWS IN RESILIENT PACKET RINGS Peter Kubat James MacGregor Smith Abstract

Resilient Packet Ring (RPR) is a new telecommunication transport technology that combines (a) high bandwidth utilization usually associated with Ethernet and (b) the 50ms protection schemes (in the case of segment /node failures) associated with SONET rings. The RPR is in essence, a distributed Ethernet switch, in which the RPR nodes are connected with two counter-rotating rings (clockwise and counter-clockwise ring). The ring spans are either SONET of Gbit Ethernet. The (unidirectional) point-to-point traffic demands (10/100/1000 Ethernet and/or TDM) can be carried on either ring. In this paper, a ring-loading problem is considered which arises in engineering and planning of the RPR systems. Specifically, for a given set of non-splitable and uni-directional commodities (point-to-point demands), the objective is to find the routing for each commodity (i.e., assignment of the commodity to either clockwise or counter-clockwise ring) so that the maximum link segment load is minimized. In the stochastic scenario, when the objective is to minimize the maximum packet delay, we will show how to formulate an optimization model which can be solved analytically as well. In both the deterministic and stochastic case, the RPR loading problem is formulated as an Integer Programming (IP) problem and two simple heuristics (namely: (i) Greedy and (ii) LP relaxation) are proposed to solve it. The computational experience with these heuristics is reported and results compared with the optimal (integer programming) solution are presented.

126

NEXT GENERATION INTERNET

Introduction

1.

A. Acronyms ATM CBR DSL DS1 DS3

GigE

FR

MAC

Asynchronous Transfer Mode Constant bit rate Digital Signal Level DSL 1 - 1.544 Mb/s DSL 3 - 44.736 Mb/s Gigabit Ethernet Frame Relay Media Access Control

MPLS OSI

OC-n QoS

SONET TDM UBR VBR

Multi-Protocol Label Switching Open System Interconnect Optical Channel level-n (n = 3, 12, 48, 192) Quality of Service Synchronous Optical Network Time. Division Multiplexing Unspecified Bit Rate Variable Bit Rate

B. Resilient packet ring A Resilient Packet Ring (RPR) is de facto a distributed Ethernet switch, with the backplane spanning all the ring nodes. The nodes are connected with two unidirectional (clockwise and counter-clockwise) high-speed rings (see Figure 5.1). The RPR can carry both TDM and Ethernet traffic. The tributary ports at the nodes may be lOMbps, lOOMbps, and lGps Ethernet; alternatively, DS1/DS3 ports, OC-n SONET ports, are possible. TDM traffic is mapped into Ethernet frames; priority classes are considered. A customer's unidirectional point-to-point demand of certain specified

Figure 5.1. Resilient Packet Ring Architecture

5 Balancing Traffic Flows in Resilient Packet Rings

127

bandwidth is assigned to one and only one ring (either clockwise or counter-clockwise); the demand cannot be split. The high speed node connecting spans are unidirectional, running either lGbps or lOGbps Ethernet or SONET OC-48/OC-192. The traffic on the rings is Ethernet (or MPLS) frames and the highest priority class is given to the passing through traffic. The high speed rings function as the extended backplane of the Ethernet switch. Some RPR advantages relative to existing SONET transport facilities include: • Superior capacity utilization and management - this is due to ring utilization in both directions, spatial reuse, statistical multiplexing, and non-blocking cross-connect and grooming. • SONET-ring-like resilience, i.e. in the case of a segment failure, the traffic is rerouted in the other direction in less than 50ms. • Supports multiple levels of QoS, similar to those used in ATM, such as VBR, CBR and UBR (Variable, Constant, Unspecified Bit Rate). • RPR currently supports Ethernet services and SONET; more services such as FR and ATM may be introduced later. • Support of over-subscription (for Ethernet). In the OSI reference model, RPR is positioned in the Data Link Layer (Layer 2). More specifically, it is positioned between MPLS (traffic engineering and service provisioning) and the physical layer (SONET). The RPR will be governed by IEEE 802.17 standards which were finalized in 2004. For more details and up to date documents one should check the official IEEE web site. A good description of the RPR technology and its application is available in Aybay et al. (2001) and Green and Schlicht (2002). Because of its potential high capacity utilization and superior performance, support for variety of services, simplicity of provisioning and management and carrier class reliability it seems that the RPR technology is well suited to be a transport vehicle of choice in the next generation IP networks.

C. Traffic allocation To effectively utilize the RPR's potential, namely spatial reuse, statistical multiplexing and bi-directionality, it is necessary to route the demands efficiently consequently achieving good utilization and best performance. Namely, given a set of point-to-point unidirectional customer traffic demands of specified bandwidth, which demands should be assigned to the clockwise and which demands to assign to the counterclockwise ring to yield the best performance?

128

NEXT GENERATION INTERNET

For the sake of the argument, customer demands cannot be split, and in the current implementation, the RPR management system will route point-to-point traffic demands over either the clockwise or counterclockwise ring, according to some (vendor specific) internal optimization criteria usually based on a shortest path algorithm. This is a simple (and rather obvious) traffic demand assignment rule in which the demand will traverse the smallest number of segments. This method is simple and will use the least amount of total ring capacity, but in many cases this protocol will lead to unbalanced traffic loads, which will negatively affect the overall system performance. In this paper, a new strategy is developed which allocates the demands to the rings in a more global and optimized way. The objective is to assign the customer demands to either clockwise or counter-clockwise ring so that the maximum segment load for all segments and both directions is minimized. This method seeks uniformity in the load balancing, thus eliminating potential bottlenecks and, in turn, earning more effective utilization and better performance from the entire system. D. Overview of the paper In Section 2, a stochastic model with the objective to minimize the maximum link delay is developed and then formulated as a linear program. The deterministic version of the "Ring-Loading" problem model for RPR is presented in the Section 3 and it is shown that both prob,sthe melstochastic and deterministic RPR loading problem version are equivalent. In Section 4 two solution heuristics - greedy and LP relaxation are suggested. In Section 5, the "precision" of the heuristics is discussed and upper bounds are derived for the heuristic solutions. Section 6 documents computational experience with the heuristic algorithm implementation and Section 7 presents the conclusions. E. Previous work Traditionally, in packet switched network designs, one of the possible objectives is to minimize a function of packet or message delays. Many models have been considered over the years; good description of the models and related references can be found, for instance in Bertsekas and Gallager (1992), or in more recent surveys Girard (1999); Sanso (1999). The objective considered in the present paper is similar to the one described in Ramaswami and Sivarajan (1999) but the problem setting is very different - RPR is a new technology and the traffic-loading model applied to RPR is considered for the first time. Multi-commodity flows in ring networks and the consequent problem of balancing loads in SONET rings, were extensively studied by Cosares

5 Balancing Traffic Flows in Resilient Packet Rings

129

and Saniee (1994); Schrijver et al. (1998); Okamura and Saymour (1983); Vachani et al. (1996). Carpenter et al. (1997), considered the "ring loading problem" with "slotting." The load balancing models for RPR, which are considered in this paper differ from the SONET ring loading in a number of significant aspects. Namely, SONET demands are bi-directional and demands assigned to go clockwise compete for common span capacity with the demands assigned to go counter-clockwise. In RPR two distinct rings occur (clockwise and counter-clockwise) and the demands do not compete for the common capacity. In SONET the demands are "circuit-switched," and deterministic, while the RPR is based on "packet" stream technology with MPLS like "tunnels," statistical multiplexing and different level of service.

2.

Stochastic model

In Ramaswami and Sivarajan (1999) Ramaswami and Sivarajan discussed an interesting but simplified virtual topology optical network design arising in ATM network planning. Assuming a stochastic model for the ATM traffic, (i.e., the arrival rate for packets for source-destination pair measured in packets per second), they formulated a model for selecting optical (OC-n) links connecting the ATM switches so that the topology satisfies a node degree constraint, and the worst congestion in the network is minimized. The formulation is a large-scale integer linear programming problem with flow conservation constraints at each node defining virtual circuits, and the program is trying to jointly minimize the total virtual circuit flow on each link by limiting the maximum flow. A similar argument can be made for the RPR as well. Here the topology (dual ring) is given, and the problem is reduced to the assignment of customers to either the clockwise/counter-clockwise ring, so that the traffic congestion is minimized. As mentioned in the previous section, the RPR technology is a packet technology (IP packets) and thus a transport segment of the ring (in either direction) carries an aggregation of Ethernet packets. Since the speed of the inter-node trunk is given (e.g., OC-48, OC-192), one could approximate the behavior of the node output buffers for the transport pipe by an M / M / l queue. To capture the essence of the problem, a simple stochastic formulation of the problem is presented. This problem does not fully capture non-linear multiplexing effects of the traffic aggregation, but it provides an intuitive understanding of the load balancing issues and is, de facto, a first order approximation. Load balancing models with statistical mul-

130

NEXT GENERATION INTERNET

tiplexing, over-subscription and priority classes will be considered in a subsequent paper. A. Model assumptions Consider a resilient packet ring with: (i) TV nodes; (ii) Unidirectional demand D^, k = 1 , . . . , M (in some units of bandwidth, e.g., lOMbps). The demand is further characterized by its origin Orig(Z)/,) = i, and destination Dest(£>^) = j , i,j = 1,..., N; (iii) Demand cannot be split. There might be more than one demand from i to j (different customers), so we really have a multicommodity flow problem; (iv) One customer class only - all demands are considered to be "non protected," (i.e., in the case of failure all demands are routed on the only feasible direction and MAC protocol will take a care of policing the traffic using the best effort); (v) Capacity of the ring is K in each direction {e.g., K — lGbit/s); but, in general, we will assume that the capacity on the link I (from node i to node i + 1) in the "+" direction is K^, and in the "-" direction K^ (from node i + 1 to node i) this is very helpful in describing the algorithm. B. Objective function The goal is to route all the demands either in clockwise "+" or counterclockwise "-" direction in order to minimize the maximum individual link/segment delay. Note there are 2N links, TV segments on the "+" ring and N segments on the "-" ring. C. Decision variables and loading parameters The decision (assignment) variables are defined as _ J 1 if demand Dk will go in the "+" direction; [ 0 otherwise. Obviously, if Xk = 0, this demand will be routed (or "assigned") in the "-" direction. Construct loading parameters e^, and e^, for all demands fc, 1 < k < M, and all segments £, 1 < £ < N. These loading parameters depend on the origin and destination of the demands, and describe which segment will be on the routing path in the "+" or "-" direction. Note that the loading parameters are not decision variables, but they are uniquely derived based on the demand direction. The parameters are formally

5 Balancing Traffic Flows in Resilient Packet Rings

131

defined as: I 1 if Db will use link £ in the "+" direction ; u

|0 otherwise.

and

_ _

1

_

D. Model formulation Assume that behavior of the transport segment £ in the clockwise and counter-clockwise direction capacity is approximated by its outgoing line bit rate, say \i\ — K^, and /JLJ = JFQ~, respectively. For the sake of exposition, the "service" rate is assumed to be exponential rather then deterministic. Furthermore, assume that the packets of the demand k arrive according to a Poisson process with the rate Dk, It is well known (Bertsekas and Gallager, 1992) that the average delay in the M / M / l queuing system with arrival rate A and service rate \i is l/(^—A), and thus, the assigned traffic in the u +" direction represents the total offered traffic load is A = Ylk ^k^texk o n ^e "+" ^ n ^ ^ (similarly for "-" direction), and it is easy to see that the following formulation minimizes the maximum average segment delay: S2: Minimize y s.t.

l / ( K + - ( J ] ^ e + ^ ) )
for every £ - 1 , . . . , TV; (5.1)

y> xk = {0,1},

for e v e r

^ ^ = i>• • • ^ ;

for every fc = 1,...,M; and

V > 0.

(5-2)

(5.3) (5.4)

A simple change of variable z = 1/y transfers the above non-linear system to: S2': Maximize z s.t.

K^ - ^2 Dketexk

> zi

f° r every £ = 1 , . . . , iV;

(5.5)

k

K~ - Yl

D e

* k^1

~ xk) > z,

for

every £ = 1 , . . . , iV;

(5.6)

k

Xk = {0,1},

for every commodity fc = l , . . . , M , and

(5.7)

132

NEXT GENERATION INTERNET (5.8)

z>0.

which again is a linear system. If we set K|" = KJ = K, for all £, (i.e., assuming the same capacity all around) and change variable to w = K — z) one can restate the above problem to get: S2": Minimize w s.t.

^

Dke^xk

< w,

for every + link £ = 1 , . . . , N\

(5.9)

for every - link £ = 1 , . . . , TV;

(5.10)

k

Dke^e(l - xk) < w, xk = {0,1],

for every commodity k = 1,...,M; and

w > 0.

(5.11) (5.12)

The formulation S2" is exactly the same as the formulation D l in the deterministic case of RPR loading derived in the next section.

3,

Deterministic model

If we assume that the point-to-point demands Dk are fully deterministic, i.e., Dk represents a fixed bandwidth requirement, it is conceivable that a good RPR design will try to distribute the traffic load more or less equally over the two rings. Distributing the traffic load smartly might accommodate the potential surges in traffic when they occur and, in addition, will position the system to accept more future customer demands. It is easy to see that the following integer programming formulation minimizes the maximum segment load, and, thus, de facto the overall network performance: Dl: Minimize y s.t

Y,Dkeuxk
V "+" links £=!,..,7V;

(5.13)

k

Dke-M(l - xk) < y, V «-" links £ = 1,.., N; Xk

- {0,1}, I/<0.

Vfc= 1,..., M; and

(5.14) (5.15) (5.16)

The problem D l is NP-complete (shown below), and thus efficient heuristics for its solution are desirable. THEOREM

5.1 The RPR problem is NP complete.

5 Balancing Traffic Flows in Resilient Packet Rings

133

Proof. The proof will proceed by showing that there is a polynomial time transformation from set partitioning to RPR. Other transformations from set cover and set packing are also possible. Let's posit the set partitioning problem: Set Partitioning Problem (SPP): Given M items with integer sizes s i , . . . , SM , is there a partition of the items into sets A\ and A2) such that

This problem is NP-complete (Garey and Johnson (1979), page 223). Consider an arbitrary case of a SPP problem. Furthermore, let us assume we have an RPR ring with two nodes 1 and 2 and let Dk — sk, k = 1, ...m, where Orig(k) — 1 and Dest(k) = 2 V k. Note that the two-node special case is reduced to: Dl-reduced (2nodes) : Minimize y x 2^L'k k S V) in "+" direction

s.t.

(5. 17)

k x

k) < y,

in "-" direction

(5. 18)

k

Xk

- { o , i } , Vfc = 1,... ,M; and y < o.

(5. 19) (5. 20)

Any feasible solution to RPR problem (Dl-reduced), must assign the demands to go either on the u +" ring (defines index set A\) or on the "-" ring (defines index set A2). This defines the partition of the index set A = {i : 1,..., M} into A\ and A2) A\ U A2 = A, A\ n A2 = 0. Thus, if the optimal solution of Dl-reduced is y* = l/2(^2kDk ) then the partition problem has a solution (i.e. a partition exists). If the optimal solution of Dl-reduced is y* > 1 / 2 ( ^ 1 ) ^ ) then the partition problem does not have a solution (y* cannot be smaller than l/2(%2k Dk ) as we are dealing with the maximum). In other words, to decide, if the partition problem has a solution is the same as to solve the problem Dl-reduced and find that the demands can be divided into to equal groups ( ie., y* = 1/2(]T^ Dk) ). Thus we took an arbitrary SPP problem and transferred it to a particular instance of RPR problem in polynomial time. Since the SPP problem is NP-Complete, then so is the RPR. • The above arguments are roughly parallel to the NP-completeness proof for the Ring Loading Problem as done by Cosares and Saniee (1994).

134

4.

NEXT GENERATION INTERNET

Solution heuristics

Since the problem is NP-complete, it is practical to seek a heuristic solution. In this section, two simple heuristics are proposed to solve the problem D l . For the sake of completeness, define tf and tj to be the already existing traffic load on the link £+ and £~ respectively, and define T to be the maximum load for the entire system (on both "+" and "-" rings), i.e.

Heuristic 1 (Greedy) This heuristic sequentially adds demand after demand to the "+" ring or the "-" ring; the goal is to assign the demand to a ring for which the incremental increase in the load (capacity) for the entire RPR is the smallest. This greedy type heuristic is similar to one considered in Wu (1992) and Cosares and Saniee (1994) with one small but significant difference: in SONET rings the demands assigned to "+" or "-" direction compete for the common capacity on the same ring; in RPR two distinct rings occur (clockwise and counter-clockwise), the demands are assigned to one or the other ring, each ring has its own capacity. Formally the heuristic can be described as: Heuristic 1 1. Initialization. (a) Initially set all tj=tj

= 0.

(b) Order Dk in descending order, i.e., D(i) > D(2) > -D(3) ... > D(M)< 2. Main. F O R k - 1,...M DO (i) Get temp+ - t+ + D(fc)c+ V z, and Ti = max{£erap^,£^* }; (ii) Get temp^ ~ ^7 + ^(k)eke^ i i, and T2 = max {tempi ^t }'•> IF Ti < T2 then update: tl = tempi; a n ( l se^ Xk ~ ^» IF T2 < Ti then update: ^ = tempi; a n d s e t ^fc^O; IF Ti = T2 assign shortest path calls for either direction; then tl = tempi; anc^ s e t ^fc = ELSE : tj = tempj; anc l NEXT jfe;

lj

set Xk — 0;

3. Write results. Output rcfc, and final t / , t^, for all k and i. Exit. End Heuristic 1

5 Balancing Traffic Flows in Resilient Packet Rings

135

This heuristic is intuitive, fast and can be easily implemented in the RPR management system. Furthermore, if a new demand is added to the original set, it is not necessary to rearrange the entire set of assignments; only demands which are smaller then the new demand might be affected; thus the heuristic can be executed only for the latter. As the demands are handled sequentially, one by one, the best routing for demand D^ is determined on the current link loads which are based on the highest demands routed so far, i.e. -D(i),. • •, £)(/c-i)« Since at the start of the algorithm, all tj=ztj= 0, the first demand, D^ is always assigned to be routed according to the shortest path rule and may lead to a solution which is far from optimal. The heuristic can be substantially improved by running the algorithm twice - in the first run, D(!) is assigned the "+" direction and in the second run D^ is assigned the "-" direction. The remaining demands are assigned according to the recently described heuristic. The better of the two solutions is then selected. Heuristic 2 (LP-relaxation) In this heuristic, the integrality constraints are relaxed and then solved as a Linear Program (LP). Fractional values are then systematically rounded to integral solutions. For our problem (in which the demand cannot be split) the LP solution provides a lower bound for the optimal (i.e., IP solution) or any other heuristic solution. To get an integer solution from the fractional LP variables we propose a simple rounding heuristic which sequentially rounds the fractional variables (one by one) either to 1 or 0 depending for which value the corresponding LP gives better solution. Once decided, the variable will remain fixed in its. integer value and no backtracking is done. Although it sometimes may lead to sub-optimal solutions, this heuristic is reasonably fast (when compared to Branch and Bound) and the results are generally quite good. A related "rounding" but different "branch and cut" approach, applied to another telecommunication network design problem, has been shown to work rather well (Kubat and Smith, 2001). The formal pseudo-code of the heuristic follows.

5.

Upper and lower bounds

The value of the objective at the LP solution, y£ p , is the lower bound for all other feasible solutions. For heuristic # 2 , however, one can obtain an interesting upper-bound. This bound is similar in character to the bound obtained by Carpenter et al. (1997) for SONET ring loading

136

NEXT GENERATION INTERNET

Heuristic 2 1. Initialization. (a) Initially solve problem Dl using LP, assuming 0 < xk < 1, V k = 1,.. .M. (b) Fix all integer x*k> Establish a set S = {x% : x% is fractional }. 2. Main. While ( 5 ^ 0 ) D O Get the largest x% G S, and (i) fix x*k = 1 and solve the LP; (ii) fix x*k = 0 and solve the LP; Select the solution xl for which the LP solution is smaller; Remove x^ from the set S; Remove any other xk which becomes integer (in the restricted LP solution) from the set S; End While 3. Write results. Output x%, and final ring loads tf,tj End Heuristic 2

V k and i. Exit.

problem. Namely, for the RPR problem, any integer solution derived by rounding the fractional LP solution results in a maximum load always less than or equal to Let us introduce the following notation. X *LP ~ ^ e optimal LP solution (can have fractional assignment variables) ; VLP "" ^ e maximum load on all segments (optimal LP solution); MLLP+ - maximum load in the "+" ring for x*LP solution; MLLP - maximum load in the "-" ring for x*LP solution; x*IP - optimal IP solution, (i.e., all X£ = {0,1}); VIP ~ optimal IP solution (maximum load) for D l ; x *Heu2 ~ Heuristic 2 integer solution obtained by rounding x*LP VHeu2 ~ the solution (maximum load) for Heuristic 2. Since, the objective is to minimize the maximum load one must have: ylP = m a x ( M L L P + , M L L P - ) . To show the upper bound, the following lemma is first established. LEMMA

MLLP

5.1 In the LP solution, x\p^ -

we have y^P

= MLLP+

=

5 Balancing Traffic Flows in Resilient Packet Rings

137

Proof. By contradiction. Suppose, on the contrary, that the statement is not true and therefore we must have |ML/,p+ - MLLP — | = rj > 0 and there will be two cases. > MLLP — . Let £ be the Casel. Assume that y*LP — MLLP+ segment where this maximum load (i.e., y*Lp) is attained. There must exist at least one demand, say J9j, traversing this segment (i.e., 0 < x* < 1). Set "new" xnew*j — x* — e, where e < r]/(2Dj) and so that xnew*j > 0. However, now the "new" yLP = y\p — ?Dj, and this contradicts the assumption that x* is optimal. Case2. Similarly assume that y^p = MLLP— > MLLP + • Here select the demand Dj which traverses this maximum load segment (i.e., 0 < 1 — x* < 1) and set xnew*j — 1 + eDj,e < TJ/2. In this case, e of the demand Dj is sent via the "+" ring improving the maximum load and this again contradicts the assumption that x* is optimal. • In the above lemma, the x* solution could be integer or fractional. If x* is integer, then the solution to problem D l has been found. 5.2 Suppose x*Heu2 is found by rounding the x*LP (fractional) solution. Then y*LP < y*IP < y*Heu2 < 2y\p.

THEOREM

Proof. The first part, i.e., y*LP < y}p < y^eu2 is obvious. So one needs to show only that y# e n 2 ^ ^VLPSuppose that the LP solution is fractional. From the above lemma, y*LP = MLLP+ = MLLP- ; so there exist one segment (at least one) in each direction, £ + , (<£"), for which this maximum load is reached (the dual price there is positive). Select all fractional x!- which are in the max load segments. They must be in both segments, otherwise the solution could not be optimal. These fractional x* contribute < Yl-DjX* (n.b. this quantity is < y*LP ) to the max load on the "+" ring and T,Dj(l — x* ) ( again < y^p ) to the "-" ring. So, in the worst case scenario, if all these fractional x* are rounded up to 1 the load on the u +" ring increases byZDjilx). Since T>Dj(l — x*) < y*LP as well, then the maximum load for the rounded-up solution on the u +" ring must be < 2 y^p- Similar arguments hold if one rounds-down the fractional solutions. • It is important to point out that this upper bound cannot be improved. To show this consider the following example: Example 1: D\ — 5, from node 1 to node 2, ring of any size. The LP solution is x^ = 0.5 and y*LP = MLLP+ = MLLP— = 2.5. Rounding up gives #i = l, for which one gets: MLjp+ = 5 and MLjp— = 0 and VlP = y*Heu2 = 5 = 2y*LR

•

138

6*

NEXT GENERATION INTERNET

Computational experience

It is reasonable to assume that, in practice, the number of nodes on one ring will be rather modest (4 — 10 nodes), however there might be many point-to-point demands - so our experiments were restricted to 5, 8 and 10 node rings and varying number of demands. Demand "from-to" pattern have been randomly generated using the node list; the demand bandwidths randomly selected from the [10 — 500M6/s] range. Heuristics 1 and 2 have been programmed in MATLAB; the IP solution was obtained by modifying LP/MIP solver from the MATLAB function library. The results of the experimental runs and comparisons with the optimal (IP) solution are presented in Tables 1 and 2. Table 5.1. Heuristic 1 (Greedy) data set 1 2 3 4 5 6

7 8 9 10 11

N/M 5/12 5/12 5/15 5/20 5/25 5/25 8/12 8/25 10/20 10/20 10/25

Heu.1 2560 3060 1510 1810 2660 3740 2090 4230 3000 2650 4860

% from z*IP 0.0 10.5 0.0 2.3 5.1 0.0 11.8 4.4 30.4 6.0 3.0

Table 5.2. Heuristic 2 (LP relaxation)

IPB&B data set 1 2 3 4 5 6 7 8 9 10 11

N/M 5/12 5/12 5/15 5/20 5/25 5/25 8/12 8/25 10/20 10/20 10/25

Heu_2 2590 3070 1790 1770 2930 4110 2090 4520 2300 2700 5000

% from z*IP 1.2 10.8 18.5 0.0 15.8 9.9 11.8 11.6 0.0 8.0 5.9

iterations 49 56 41 92 371 2047 16 4059 179 143 58

z*IP 2560 2770 1510 1770 2530 3740 1870 4050 2300 2500 4720

z*LP 2207 2655 1435 1695 2530 3740 1840 4045 2030 2490 4440

5 Balancing Traffic Flows in Resilient Packet Rings Both heuristics are fast and will work with any number of demands, however, the number of Branch and Bound (B&B) iterations of the IP solver becomes excessive very quickly; so does the IP running times. Thus for larger number of demands, comparison of the heuristics with the optimum IP solution becomes difficult with the current B&B optimal algorithm implementation, however, one can still compare the results with the LP solution (lower bound). Heuristic 1 seems to perform reasonably well, the average percentage gap from the optimal solution was 6.7% (range 0%-30.4%, 11 runs). The numbers for heuristic 2 were similar - average % gap is 8.5% (range 0%-18.5%, 11 runs). Heuristic 1 can easily be modified to take into account multiplexing, over-subscription, and queuing priorities (QoS) effects. The research on this latter problem is in progress and will be reported in a subsequent article. Heuristic 2 will be harder to implement in the real time system because it is more complex to program and the running time of the algorithm may not satisfy the real time specification threshold.

7.

Summary and conclusions

The new generation Internet applications are expected to generate a substantial amount of bandwidth which have to be delivered to consumers with only minimal delays. This, in turn, will require modernization and redesign of the entire packet transport networks. When managed properly, the RPR based networks are uniquely suited to deliver a large amount of bandwidth reliably and inexpensively. An optimal load balancing is of a paramount importance, as it increases system capacity and improves overall RPR performance. We have formulated deterministic and stochastic models for optimal load balancing on Resilient Packet Rings, and have shown that under some simplifying conditions, both models can have the same Integer Programming problem formulation. Two heuristic have been proposed to solve the RPR load balance problem and compared with the optimal solution. Both heuristics seem to perform fairly well. Even though, for rings with a small number of nodes the fully optimal IP solution is achievable, the algorithm execution takes many iterations (and time) and may be considerably difficult to implement in the actual RPR operating system. Heuristic 1 is fast, reasonably accurate, easy to implement in the RPR management and provisioning software, and has an additional advantage - not all demands have to be reshuffled when a new demand arrives.

139

140

NEXT GENERATION INTERNET

References Aybay, G., O'Connor, M., Vasani, K., and Wu, T. (2001). An introduction to RPR technology. White Paper - Resilient Packet Ring Alliance. Available: http://www.rpralliance.com/articles/ACF16.pdf. Bertsekas, D. and Gallager, R.G. (1992). Data Networks. 2 n d Edition. Prentice Hall, Englewood Cliffs, NJ. Carpenter, T., Cosares, S., and Saniee, I. (1997). Demand routing and slotting on ring networks. DIM ACS Report 97-02, Rutgers University, NJ. Cosares, S. and Saniee, I. (1994). An optimization problem related to balancing loads on SONET rings. Telecommunication Systems, 3:165-181. Garey, M.R. and Johnson, D.S. (1979). Computers and Intractability - A Guide to the Theory of NP-Completeness. Freeman, San Francisco. Girard, A. (1999). The common structure of packet- and circuit-switched network synthesis. In: B. Sanso and P. Soriano (eds.), Telecommunication Network Planning, pp. 101-119. Kluwer Academic. Green, M. and Schlicht, L. (2002). Maximize the metro with resilient packet ring. Communication System Design, Sept.:38-42. IEEE 802.17 Resilient Packet Ring Working Group. Available: http://www.ieee802.org/17/. April 2003. Kubat, P. and Smith, J.M. (2001). A Multi-period network design problem for cellular telecommunication systems. European Journal of Operations Research, 134:439456. Okamura, H. and Saymour, P.D. (1983). Multicommodity flows in plannar graphs. Journal of Combinatorial Theory, 31B:75-81. Ramaswami, R. and Sivarajan, K.N. (1999). Optical Networks: A practical perspective. Morgan Kaufmann Publishers, Inc., San Francisco. Sanso, B. (1999). "Issues in ATM network planning: An operations research perspective. In: B. Sanso and P. Soriano (eds.), Telecommunication Network Planning, pp. 79-99. Kluwer Academic. Schrijver, A., Seymour, P., and Winkler, P. (1998). The ring loading problem. SIAM Journal on Discrete Mathematics, 11(1):1—14.

Vachani, R., Schulman, A., Kubat, P., and Ward, J. (1996). Multicommodity flows in ring networks. INFORMS Journal on Computing, 8(3):235-242. Wu, T.-H. (1992). Fiber Network Service Survivability. Artech House, Boston.

Chapter 6 GAME-THEORETIC RESOURCE PRICING FOR THE NEXT GENERATION INTERNET Bobby M. Ninan Michael Devetsikiotis Abstract

1.

We present a model for incorporating pricing in connection oriented networks. The allocation of bandwidth between competing users take the form of a noncooperative game mediated by a usage based billing policy. The network on the other hand adjusts the bandwidth price to maximize its revenue generated from the users. We analyze the Nash equilibrium for the resource allocation game and present an algorithm for attaining the optimal network price. The users' rate algorithms are extended to include the case of non-Poisson arrival processes. Possible applications include circuit switched optical networks, next generation wireless and label-switched networks.

Overview

The meteoric rise of Internet companies and their concomitant meltdown have brought renewed attention to the concept of network pricing. It has been touted as a possible panacea for all the ills plaguing the networking sector ranging from revenue generation, congestion control to enabling Quality of Service criteria for network traffic. The success of microeconomic policies in controlling a noisy, distributed system like the global markets quite akin to the Internet bears testimony to this notion. Efforts are ongoing to hasten and render seamless the transition from a simple, flat pricing scheme to a socially efficient, usage based regime. Economists have traditionally employed game theory to analyze the behavior of users in markets regulated by supply and demand. The users are modeled as rational agents striving to maximize their individual utility functions while competing/cooperating with their counterparts. Such

142

NEXT GENERATION INTERNET

assumptions appear much more reasonable for a collection of computing machines interacting with each other through dedicated communication channels as in the case of the Internet. Paralleling the ascendancy of resource pricing has been the growing awareness of the shortcomings of the current Internet in combating congestion and supporting differentiated services and novel traffic streams. This has led to the emergence of the field of traffic engineering which looks into the design and adaptation of today's Internet for meeting the demands of the future. Multi Protocol Label Switching (MPLS) and its optical analogue Generalized MPLS (GMPLS) (or "Multi Protocol Lambda Switching, MPAS") have been proposed to be the foundation of the network of the future. Connection oriented networks are thus back in vogue with the realization that any next generation network would need to be a hybrid of packet and circuit switched networks. Integration of a feasible pricing strategy into such an architecture entails modeling the benefits accrued and possible repercussions on user behavior and network stability. We introduce and analyze a model for utility differentiated users operating in a connection oriented setting. Resource pricing is treated as a bi-level optimization problem with • Network and users behaving as Stackelberg leader and followers • Users interacting among themselves under the aegis of noncooperative game theory We focus on the a single bottleneck scenario, analyze the resulting Nash equilibria and provide results for their existence and uniqueness. Possible applications include Label Switched Paths in MPLS networks, circuit switched, optical, wireless and Virtual Private Networks.

2.

The evolution of pricing

The Internet has come a long way from its humble beginning of a military controlled research network. Today it spans the globe and has made a successful transition into a vibrant social and commercial infrastructure. The spectacular success of the "Net" can be attributed to a combination of open standards, interoperable architectures and the "end-to-end" principle. Thus the complexity of the network has been pushed to the edges thereby promoting scalability be ensuring a simple network core. The connectionless datagram principle was developed mainly to ensure network reliability, a key concern for ARPANET'S survivability in the face of a nuclear war. But it was also a driven by the evolving economics of transmission costs and switching devices. Many years prior to

6 Pricing for the Next Generation Internet

143

the advent of the Internet, when faced with the prospect of cheap transmission lines relative to switches, the telephone industry came up with connection oriented networks where a large number of lines interfaced with few switches to create end-to-end circuits. However, as routers became inexpensive and bandwidth prices soared it made more sense to increase utilization by means of statistical multiplexing. This led to the development of packet switched networks like the present Intesrnet. Government funding of the Internet came to an end when on December 23rd 1992, the National Science Foundation expressed its intention to stop supporting the ANS T3 backbone in the near future, Telecom companies like Sprint and AT&T soon jumped into the infrastructure bandwagon hoping to garner a slice of the Internet backbone pie. Several business plans based on future earnings were proposed and lapped up in the exuberant investment climate predating the 'bubble'. The projected demand never materialized and when the bubble broke, the companies were straddled with large quantities of dark fibre. Surprisingly as of 2003, the $80 billion revenues from wireless far outstrip the $35 billion from Internet (Odlyzko, 2003a,b). This illustrates one of the principal reasons for Internet pricing, namely the imperative for drawing up a feasible revenue generation plan for mitigating the cross subsidization of data traffic by voice traffic. While telecom companies were driven by the objectives of price discrimination and thereby return on investments, academia's interest was piqued by the relevance of pricing as a tool for network control. The fledgling Internet experienced its first severe congestion in 1987 prompting the development of a congestion control algorithm as part of the Transmission Control Protocol (TCP) suite. It was realized that such problems would be exacerbated by the rise of applications with ever increasing appetite for bandwidth. The severe delay problems in NSFNET during November 1992 due to some audio/video broadcasts served to illustrate these concerns. The need of the hour was a mechanism designed to encourage a socially optimal solution wherein high value bits (for example, telemedicine packets carrying life saving information) would be given preference over others. The field of providing Quality of Service (QoS) in the Internet was thus born. The Internet was designed as a best effort system with the network not providing any guarantees on the timeliness or even the arrival of packets. The QoS paradigm however required a network that could carry out service differentiation with packets serviced depending upon their value. But incentives were necessary to prevent users from inflating their packet values and requesting better services. Price discrimination of services was found to be ideal for encouraging service differentiation

144

NEXT GENERATION INTERNET

with the associated revenues paying for any needed network expansions. The prevalence of inelastic applications like interactive audio/video necessitated the introduction of admission control schemes akin to those deployed in telephony. The need for extending the service model with users explicitly requesting service is detailed in Shenker (1995). It was also suggested that the basic best-effort architecture be left intact with QoS schemes solely reserved for resource intensive high quality real-time services. Since the notion of differentiated services demanded changes to the prevailing network architecture, a section of the networking community offered overprovisioning as a possible panacea for congestion. Bandwidth was becoming increasingly cheap due to economies of size as more and more users were joining the Internet. Further the advent of novel optical technologies like DWDM could squeeze more and more bandwidth into the same fibre. Under the assumption of "almost free" bandwidth it was believed that huge overprovisioning would be economically feasible. The startling implications of measurements from the BellCore network (Leland et al., 1993) pointed to the high variability and possible self-similarity of data traffic. This burstiness thus indicated that any overprovisioning of capacity based on peak characteristics would be far costlier than the usual average based allocation. As the Internet was a public good, the academic community tried to follow the footsteps of economists by resolving to maximize the social welfare of its users. Congestion was seen as the playing out of the classic "tragedy of commons" where individual users with unrestricted access overgrazed the system to the detriment of others. This could be alleviated by a usage based scheme with users getting charged for the amount of traffic they consume. For maintaining social optimality these charges would have to be set equal to the marginal cost of usage. Since bandwidth scarcity occurs only during congestion, this marginal cost essentially the same as the congestion cost. The notion of congestion pricing was developed to account for the social costs imposed by the user on the rest of the population during periods of congestion. Several usage based schemes (Falkner et al., 2000) were introduced to promote social optimality. The encouraging results from the INDEX study (Edell and Varaiya, 1999) further lent credence to the claim that users were willing to pay for better service. One of the schemes which caught attention was the Vickrey auction based "smart market" developed by Mackie-Mason and Varian (1994). This required that each user indicate the value of her packets by incorporating a bid in the packet header. The routers would then allow all packets whose bid exceeded the marginal cost to enter the network. This marginal cost would be equal

6 Pricing for the Next Generation Internet

145

to the congestion cost imposed by the next arriving packet. The router would however charge all the admitted packets only the marginal cost maintaining optimality. Users have no incentive to under report their bids as admission on the network depends on an unknown and possibly higher price. Packets which were rejected could wait for transmission at a less congested period thereby trading dollars for delay. On the other end of the spectrum was the idea of flat pricing with users enjoying unlimited access after paying an access fee. This scheme though suboptimal was conceptually simpler than usage based pricing as it was compatible with the existing architecture obviating the need for extensive monitoring and accounting mechanisms. It is argued that usage based schemes ran counter to the risk aversion and need for predictability of consumers. Odlyzko cites examples involving several networks like mail, telegraph and telephone services (Odlyzko, 2001), railroad and highways (Odlyzko, 2004) to argue for the inexorable march towards simplicity. When prices are kept simple and low, more and more users migrate leading to profits from increased revenues. The increase in user population increases the value of the network as expounded by Metcalfe's law thus leading to a positive spiral. A critique of the optimality paradigm pervading the pricing literature was provided by Shenker in Shenker et al. (1996). Most of the Internet backbone is owned by profit maximizing companies with little interest in socially optimal schemes. Since most of the costs for maintaining the infrastructure consisted of fixed costs, it is not clear whether marginal costs would be able to recover the operating costs. The utility derived by users from individual packets depend on the delay faced by them, a variable inherently difficult to predict. If they are part of a flow, their individual utilities would be influenced by the delivery of the rest of the flow. The inaccessibility of marginal cost severely curtail the implementation of schemes like the "smart market" forcing researchers to look for alternative schemes. Any optimal pricing mechanism would need to be deployed globally, an idea which would require extensive standardization and runs counter to the idea of Internet being a collection of heterogenous networks. Edge pricing proposes to reduce complexity by shifting the mechanisms to the edge. Monitoring and billing policies are simplified by employing a scheme based on expected congestion (like time-of-day pricing) and expected path. It rejects the perceived dichotomy of usage and flat pricing by considering them as competing design choices for pricing at the edge. Research on connection-oriented networks, akin to loss networks, the staple model of the teletraffic community, has received increasing attention owing to its rising relevance in the field of computer networking.

146

NEXT GENERATION INTERNET

The new incarnations span a diverse spectrum of technologies ranging from circuit switched wireless/PCS networks, WDM optical networks and virtual circuit switched networks such as the (G)MPLS and ATM networks. Encouraged by the dotcom boom of the 90s, several telecom and Internet companies made colossal investments in these networks to enhance bandwidth availability. The subsequent meltdown has saddled them with exorbitant nonperforming assets thereby severely restricting their room to maneuver in an increasingly sober investment climate. Resource pricing, often considered as an afterthought during the boom times now occupies center stage as the way out of this conundrum. Apart from its principal function of revenue generation, pricing also serves as a fairly low dimensional control parameter to optimize system properties and control network congestion. This has been coupled with the realization that most of the proposed techniques for service differentiation and QoS would need to be commercially viable for widespread deployment. Currently Diff-Serv is considered to be the most feasible architecture for Quality of Service differentiation. MPLS and MPAS networks are ideally suited for the Diff-Serv model as they combine aspects of statistical multiplexing and provisioned paths. A pricing scheme tailored for such networks is imperative to accelerate the fusion of traffic management and sound economic principles. The INDEX study (Edell and Varaiya, 1999) gave a fillip to pricing research by providing empirical proof that users were willing to pay based on usage for genuinely better (and strictly guaranteed) qualitiesof-service under certain circumstances. The billing policy allows certain users to obtain more or less bandwidth depending on both their willingness to pay and their need. The users compete with each other for the limited resources relying only on local information. The dynamics of such a system are typically studied in the setting of noncooperative games. Stackelberg games for communication networks were used in, e.g., (Douligeris and Mazumdar, 1989; Economides and Silvester, 1990). In Korilis et al. (1997), the authors consider a Stackelberg game in which users choose routes in a wired network after the leader has chosen routes for its own traffic; in choosing, the leader controls user behavior to optimize some network utility or to achieve some other "global" goal. In Saraydar et al. (2000), the authors formulate a CDMA power data-rate control games for which the equilibrium point is studied. In the framework of Kelly Kelly (1997); Kelly et al. (1998), TCP users in a wired Internet are studied in La and Anantharam (2000). Rate-based flow control is also studied in Altman et al. (1999).

6 Pricing for the Next Generation Internet

3,

147

Outline

Network congestion can be tackled at various time scales and protocol layers. In the short term, applications can modify their TCP packet transmission rates by switching to a lower bit-rate codec. Much of the above literature pertains to such a scenario since user applications running above the TCP/IP layer are the first to notice the debilitating effects of congestion. In this paper, we study this problem at the connection level involving much larger time scales. Our focus of attention will be connection-oriented networks such as ATM, optical and MPLS networks. Unlike human users who perceive increased delay and jitter in the presence of congestion, our users would most likely be computer agents endowed with machine learning. These would then be trained to capture a certain number of connections based on past observations and local policy. In this paper, we apply the game theory approach for the case of connection-oriented networks. We consider a noncooperative community of users that share a network and specify an example user utility function that corresponds to elastic bandwidth requirements. We study the equilibrium point reached by the users for a fixed charge per circuit per unit time imposed by the network for each successfully transmitted packet. We start with a single link fed with Poisson traffic to elucidate our model. We provide mathematical results to characterize the Nash equilibria and investigate their numerical solution. We then study the user dynamics under varying price and network scenarios. The following notation is used throughout this paper. Vectors are represented in upper case. If O is a vector, @l represents its i th iterate, 9i its i th component and ©* its optimal value. The rest of the paper is organized as follows: in the next section, we develop a utility function to represent the behavior of an elastic user. We then present our bi-level bandwidth pricing model in Section 5. After detailing the noncooperative user game in Section 6, we propose novel rate control strategies for user optimization under both Poisson and nonPoisson regimes in Section 7. Numerical results for different scenarios are presented in Section 8. We conclude by providing a summary and a discussion on future work in Section 9.

4.

A user utility function

Since users are portrayed as, entities designed to maximize their individual utilities, we now proceed to develop a utility function to model user behavior. This is a generalization of the popular logarithmic function employed, tailored for a connection oriented setting. QoS and

148

NEXT GENERATION INTERNET

m Y/TI

Market price (M)

m

Figure 6.1. Desired bandwidth

"bandwidth" in this setting are interpreted as the average number of circuits obtained from the network. Depending upon the quality of service requested, each user would require a minimum bandwidth 7 to satisfy its customers. Fewer circuits than 7 on average are of no utility to the user. The law of diminishing marginal utility ensures that the user derives the same amount of satisfaction from any bandwidth more than the maximum n. Let the network's current charge be $M per circuit per unit-time. The user under consideration is "willing to pay" a maximum of $m per circuit per unit time. When the network price equals the maximal price ra, the user will desire only the minimum acceptable 7 circuits. Any price beyond the maximal price reduces the user's desired number of circuits 0*(M) to zero. Over the interval 0 < M < ra, the desired bandwidth decreases linearly with price with 0*(O) = TT. The user's desired throughput can then be explicitly written as: 0*{M) =

0, if M > m min{rn7/M, IT} if 0 < M < m

(6.1)

In practice, the user will try to choose her throughput 6 so as to maximize her net benefit (i.e., utility minus cost), U{6) — M0. The maximizing value of 9 is (f//)~1(M). Since this is the desired bandwidth,

149

6 Pricing for the Next Generation Internet

/ //

bandwidth (6)

71 Figure 6.2. User utility function

we obtain (6.2)

When the user optimal bandwidth obeys (6.1), her utility function can be shown to be mO,

U(9) =

1), •1),

if 0 < 6 < 7 if 7 < 6 < TT if

(6.3)

7i < 6

Note that this utility function is concave and nondecreasing. Also, U is strictly increasing on [0, oo) only when TT = oo. The case of elastic users considered in Shenker (1995); Kelly (1997) can be recovered by setting 7 — O and TT — oo thereby rendering U strictly concave on [0, oo). Therefore, the utility function above encompasses a wider spectrum of user behavior by incorporating the range of user bandwidth requested.

5.

Problem formulation

Consider a single resource link consisting of K circuits connecting an origin node to a destination node in a communication network. Users belong to the set J\f = { 1 , . . . , N} parameterized by the tuple (ran, M n , 7 n , KmUni A™aa:) where \™ax is the maximal call request rate of user n. QoS requirements prompt each user n to request a bandwidth of 9{ from the

150

NEXT GENERATION INTERNET

network. The network employs a usage based pricing policy by charging $M per unit bandwidth consumed. Both the network and the users are rational, profit maximizing entities. Further they are assumed to be noncooperative and refuse to divulge their utility functions to one another in the fear of being exploited. We analyze the system behavior determined by network and user interactions as a bi-level optimization problem detailed below.

5,1

Level I: User optimization

Ideally any resource allocation between competing users should ensure that the total user utility is maximized. The optimal bandwidth allocation is obtained by solving problem A: N

,xY^

e

N

This is a constrained optimization problem with feasible vectors allocating bandwidth no greater than the available capacity C. Employing Lagrangian multipliers, we convert it into the following unconstrained problem: N

N

where L is the Lagrange multiplier for the constraint. The additive nature of the objective function indicates that this problem is separable. The linearity of the supply-demand constraint lends an economic interpretation to the Lagrangian L being the shadow price for the available bandwidth. The corresponding Karush-Kuhn-Tucker (KKT) conditions for optimality are:

N

N

6 Pricing for the Next Generation Internet

151

However, the noncooperative setting precludes any computation of the objective function as it requires knowledge about the utilities of individual users. We thus need to devise a distributed algorithm which can be used by each user to update her allocation without divulging utility information to other users or the network. Thus individual users can solve the simpler one dimensional problem B max t/i(0i) - M6i This is the classic utility maximization problem of the individual user elucidated in Section 4 with M being a usage based bandwidth price. Although the objective function depends only on the bandwidth consumed by the user, we need to ensure that the bandwidth vector © lies in the constraint set TV

S = {Ge$

N

Any solution of problem B would also solve the original problem A when the market price M set by the network satisfies the KKT conditions 1.4-1.6. In fact when the network-user system settles to an equilibrium, the demand matches supply of bandwidth at the market clearing price M**. Problems A and B comprise a primal-dual pair. We seek to remove the coupling between users by performing a transformation akin to the Jordan canonical form employed in the control of linear systems. Our aim is to shift the user problem from the bandwidth space 0 to the arrival rate space A. We define the feasible arrival rate set D as

Consider the following inverse mapping T " 1 : D -> C

with Bi(.) being the blocking probability faced by the ith user. Problem B is guaranteed to have a unique maximum 0* when the individual utilities are strictly concave. Furthermore, a unique solution A* exists in the transformed space if the transformation T : C —> D is a one-to-one mapping. Here A* is the solution of the system of equations

152

NEXT GENERATION INTERNET

The bandwidth game of the next section is an example of such a transformation.

5,2

Level II: Network optimization

The network's utility T(M, 0) depends on the total revenue generated and hence is a function of the market price and the bandwidth allocated to the various users. It is assumed to be monotonically increasing and strictly concave. The network chooses an appropriate market price by solving the optimization problem C: umx N e

i 0

Due to the monotonicity of the utility function, this problem is identical to the revenue maximization problem C: max N

M

The network cannot explicitly carry out this optimization as it is not cognizant of the dependence between the requested bandwidth and market price. The noncooperative users may loathe to part with this information or could even be unaware of it themselves. The scenario thus reduces to a Stackelberg game with the network being the leader and the users acting as follows. The network initializes its algorithm by assigning an initial price M° randomly or based on historical data. Note that while the initial conditions do not alter the final solution, they can significantly affect the number of iterations needed to reach the optimum. The users then treat the market price as given and compute the equilibrium bandwidth vector © by solving the user problem A. The network infers this allocation vector by observing the amount of bandwidth consumed, which is then employed to solve problem B with the updated ©. Figure 6.3 illustrates this network-user interaction.

6

153

Pricing for the Next Generation Internet

Figure 6.3. Network-User interaction

For the total user demand Q = J2i=i ^(M), the price sensitivity can be computed using the implicit-function rule as dQ

N

z=l

de.

1

(6.7)

dU(Oj) i=l dM

owing to the concavity of the utility functions. The demand curve thus has a downward slope. The network then adjusts its charge M according to the following feedback based dynamics: V

(6.8)

where K > 0 is the step length. The network's revenue equals zero when the market price is either zero or infinity. The first order condition is satisfied when dMQ(M) = 0 dM Q >0 dM

By 6.7 the feasibility of M** is assured. Hence the complementary slackness conditions of problem C are satisfied only when all the bandwidth is utilized. Thus the optimal price is also the market clearing price. This ensures that the above tatonnement process 6.8 results in the classic microeconomic love story - supply meets demand. The system eventually settles to a market clearing price M** with user consumption equalling available capacity as in Figure 6.4.

6,

The non-cooperative user game

In this section we elucidate the distributed strategy employed for solving problem B. It is assumed that all the users are aware of each others'

154

NEXT GENERATION INTERNET

c

Price

Figure 6.4- Bandwidth Supply and Demand

arrival rates 1 . This assumption may not always hold, as in the context of imperfect information or arbitrary delays. In such scenarios the users may estimate this information as in Aresti et al. (2004). In addition each user's arrival rate is bounded by a maximum arrival rate \max' The user initializes its "call connection" or bandwidth request arrival rate either randomly or based on historical data. Each user observes its blocking probability every time one of its arrivals enter the network. For the ith user with a blocking probability2 of J3i(A), the net arrival is Aj(l — Bi(A)). By Little's formula, the mean number of occupied circuits for the ith user is 9<(A) =

^(l-

(6.9)

We denote the current arrival rates vector by A* = [A*,..., A ^ ] . This vector is considered as common knowledge and may be measured by the network for dissemination to its consumers. The i t h user will choose a new arrival rate A^+1 so as to maximize her net benefit Ui(9i) — 1 This is the arrival rate for the connection setup requests and not the transport layer transmission rate. Our model looks at the resource allocation problem at a more abstract "connection" or application layer. 2 In the case of single link fed with Poisson traffic, the blocking observed by all users is the same.

6 Pricing for the Next Generation Internet

155

under the assumption that others do not vary their arrival rates. By (6.2), this means that the i t h user will choose A*+1 that satisfies

where (A^+1, A^_J represents an N-vector equal to A* except that the i th entry is A*+1 instead of A*. If no such arrival rate exists in the interval [0, A™aaj], the user will instead choose the maximal rate A™aa\ So, a more compact expression is

From 6.9 this reduces to the following fixed point iteration:

(6.10) This results in a noncooperative game where the users try to adjust their rates seeking to maximize their returns. The concavity of the utility functions and the perfect information scenario ensure that this game settles to a Nash equilibrium A* (Nash, 1950):

\

—

At the Nash equilibrium, no user has an incentive to unilaterally modify his arrival rate without reducing his benefit. When there are no caps on user arrival rates (Amax = oo) and user demand for bandwidth is less than the available capacity TV

^ ( t / ^ - ^ M ) < C,

(6.12)

i=l

then, it can be shown that the game globally converges to a unique equilibrium. Furthermore, at this equilibrium, each user's demand is satisfied. However the network may not be able to satisfy the demands when the bandwidth price M is set low enough. The Nash equilibrium will then move to the boundary of the box x ^[0, A™ax]}. Such boundary Nash equilibria may also occur in a bandwidth surplus scenario when the stationary arrival rate is higher than the maximum A* < A max .

156

7.

NEXT GENERATION INTERNET

Applications

The future Internet has been envisioned as consisting of a large number of edge networks connected to a transparent high-speed optical core. For scalability reasons, the onus of handling congestion and bandwidth allocation would be handled by networks at the periphery. The links connecting the edge network to the Internet core would be bandwidth constrained due to the inexorable rise of resource intensive applications and users. We thus apply our pricing scheme to administer the assignment of scarce resources among competing users. We consider two cases where Poisson users compute their blocking probability using the Erlang formula, or, alternatively, its upper bound.

7,1

Erlang system

We begin with a network element ubiquitous in networking literature - a single link furnished with Poisson traffic. The system of users and the network is modeled as a stationary M/GI/K/K queue with total traffic intensity N

.

p = 2^—An interesting application of this model would be in the case of optical networks switching wavelengths using MPAS (Ninan et al., 2002). The aggregate and per-user connection blocking probability in steady state is then given by Erlang's formula (Wolff, 1989),

£(p,K) EE Equation (6.11) thus reduces to

7.2

Employing upper bounds

The Erlang-B formula computes the blocking probability through a recursive procedure. The total number of recursion calls is equal to the number of circuits present in the link. This could be prohibitive for large capacities thereby hampering the deployment of our rate control algorithm in real time scenarios. Instead, a nonrecursive formula is preferable, which could approximate and provide an upper bound for

6 Pricing for the Next Generation Internet

157

the blocking probability. While the performance of such a bound is independent of the capacity C, thus leading to fewer computations, their nonrecursive nature does away with the storage of previously computed values. We study the bound proposed by Farago (2000) using results from large deviations theory. Following Farago's convention, we define the time varying instantaneous bandwidth demand of a traffic flow as &• The set of active flows is denoted by At — {(£* | £^ > 0} with indices r G < At > where < At > = {r | £P > 0}. The offered load <j>(t) to the link is a function of all the active calls through it,

re

Here we make the assumption that the offered load is the sum of the individual active flow bandwidth demands. Given the expected value of the offered load Ft = E[(p(At)] and the link capacity C satisfying the stability condition (6.12), the link blocking probability is bounded as

P{4>(At)>C) < (§) C e C - F t Note that the bound tends to unity as the demand approaches the resource capacity. Further, the bound is meaningful only when the traffic intensity is less than C. The fixed point relation thus reduces to

* = ™{T=imc-yxrl-

(614)

where

8. 8.1

Numerical results Computing arrival rates

While performing numerical computations to determine the fixed point using (6.11), we suggest that the termination condition be based on A 0 rather than AA. Thus for the optimal bandwidth vector 6* = {0*,..., 0^} and a tolerance <5, the iterations are terminated when

Denoting the excess bandwidth as e = K — Xln^iC^n)"" 1 ^)) w e c o m ~ pare the number of iterations, arrival rates and mean bandwidth for a

158

NEXT GENERATION INTERNET

system consisting of identical users. Figure 6.5 indicates that the mean bandwidth values consumed for the two termination criteria are identical. However the arrival rates (Figure 6.7) and number of iterations (Figure 6.6) increase linearly with decrease in e for AA while it remains practically constant for the A© condition. We present below the results of our investigation into the dynamics of a two-user game. Using the fixed point iteration of (6.11), we arrive at the Nash Equilibrium Point (NEP) starting from the arrival rate vector (1,1). The utility parameters are chosen as 7 — (2, 2),TT = (4,4),/x = (1,1).

8.2

Convergence to Nash equilibrium

Figure 6.8 shows the convergence of the fixed point iteration (6.11) for various starting values of the user arrival rates, A. The A state space was also exhaustively scanned for other prospective candidates. It was observed that all the iterations converged to the same NEP irrespective of their starting values. For a sequence {@k} converging to O* in 3ftn, the Q-rate convergence is linear if there exists q > 0 and (3 G (0,1) such that for all k

I I © * - © 1
Figure 6.5. Mean bandwidth consumed

159

6 Pricing for the Next Generation Internet - WithAG - WithAA

Figure 6.6. Iterations

- A - WithAG - - WithAA

Figure 6.7. Arrival rates

The linear rate of convergence of the contraction mapping is evident from Figure 6.9.

160

NEXT GENERATION INTERNET

70 -

60

CM

50

/

« 40 -

<30

|

20 '

~> ~ ~~ " ^ ^ ^

10 '

— starting point (1,1) — starting point (25,75) - - starting point (75,25)

/

n 30

Figure 6.8.

40 50 Arrival rate of user 1

Convergence to NEP.

— starting point (1,1) • • •• starting point (25,75) - - starting point (75,25) 5 6 7 Number of iterations (k)

Figure 6.9.

8.3

Speed of convergence to NEP.

Effect of price on bandwidth allocation

We consider the m-asymmetric (m = (50,100)) Erlang system below, similar to Ninan et al. (2002). The network enjoys 100% utilization

6

161

Pricing for the Next Generation Internet 1

I I

i

i

i

i

i

i

\ \ \ \ \ \ \ \ \ \

-

.

I

• - • Arrival Rate of user #1 — Arrival Rate of user #2 .

i

,

10

20

30

I

i 40 50 6C Network Price (M)

Figure 6.10. Bandwidth utilization for Poisson arrivals.

20-

1

181614-

12108-

64-

v •

2-- ___ Arrival rate of class 1

- - Arrival rate of class 2 20

V

1

40 60 Network Price (M)

80

100

Figure 6.11. Bandwidth utilization under upper bound.

as long as the network price M is not greater than min(mijz/7Ti). In Figure 6.10 as M increases further, user 1 enters the inelastic or price sensitive region and thus reduces its arrival rates to maximize its utility. This decreases the blocking probability of the 2nd user which decreases its

162

NEXT GENERATION INTERNET

bandwidth allocation rate so as to keep the desired throughput constant. User 2 continues to request a maximum desired throughput TT2 until the network price exceeds 771272/^2- It then enters the price inelastic zone and encounters a jump at m\ when users 1 drops out due to the excessive network price. For the upper bound scenario, the variation of bandwidth allocation rates versus market price is shown in Figure 6.11. The equilibrium allocation rates are higher in this case as the users base their decision on the upper bound rather than the true blocking probability.

9*

Summary

In this paper, we introduced a pricing strategy specifically catered to connection-oriented networks. The network-user interaction was modeled as a bi-level optimization problem with the inter-user behavior analyzed as a noncooperative game. Specifically, we concentrated on a group of elastic users sharing bandwidth under afixedcharge per bandwidth regime. We proposed rate control strategies for resource sharing under Poisson processes by employing both the Erlang formula and its upper bound. The user dynamics for various network scenarios were then demonstrated. User adaptation in the absence of perfect information about the system and arbitrary delays can be modeled through robust control strategies. The full scope of resource pricing is realized when the user mediated bandwidth allocation gets complemented by schemes for adapting the network price to optimize system properties. The interaction between users and the network can then be seen as a feedback signal steering the system towards optimality.

References Altman, E., Basar, T., and Srikant, R. (1999). Nash equilibria for combined flow and routing in networks: Asymptotic behavior for large number of users. In: Proceedings of IEEE CDC, pp. 4002-4007. Aresti, Aristos, Ninan, B.M., and Devetsikiotis, M. (2004). Resource allocation games in connection-oriented networks under imperfect information. In: Proceedings of ICC 2001 Douligeris, C. and Mazumdar, R. (1989). Multilevel flow control of queues. lohns Hopkins Conference on Information Sciences. Baltimore, MD. Economides, A.A. and Silvester, J.A. (1990). Priority load sharing: An approach using Stackelberg games. In: Proceedings of Allerton Conference on Communication, Control, and Computing. Edell, R. and Varaiya, P. (1999). Providing Internet access: What we learn from index. IEEE Network, 13(5).

6

Pricing for the Next Generation Internet

163

Falkner, M., Devetsikiotis, M., and Lambadaris, I. (2000). An overview of pricing concepts for broadband IP networks. IEEE Communications Surveys, 3(2). Farago, Andras (2000). Blocking probability estimation for general traffic under incomplete information. In: Proceedings of ICC2000. New Orleans. Kelly, F.P. (1997). Charging and rate control for elastic traffic. European Transactions on Telecommunications, 8:33-37. Kelly, F.P, Maulloo, A.K., and Tan, D.K.H. (1998). Rate control for communication networks: Shadow prices, proportional fairness and stability. Journal of the Operational Research Society, 49:237-252. Korilis, Y.A., Lazar, A.A., and Orda, A. (1997). Achieving network optima using Stackelberg routing strategies. IEEE/ACM Transactions and Networking, 5(1). La, R. and Anantharam, V. (2000). Charge sensitive TCP and rate control in the Internet. In: Proceedings of IEEE INFO COM. Leland, W.E., Taqqu, M.S., Willinger, W., and Wilson, D.V. (1993). On the selfsimilar nature of Ethernet traffic. In: D.P. Sidhu (ed.), ACM SIGCOMM, pp. 183193. San Francisco, CA. Mackie-Mason, J.K. and Varian, H.R. (1994). Pricing the Internet. In: International Conference on Telecommunication Systems Modeling, pp. 378-393. Nash, J.F. (1950). Equilibrium points in n-person games. In: Proceedings of National Academy of Sciences. Ninan, B.M., Kesidis, G,, and Devetsikiotis, M. (2002). A simulation study of noncooperative pricing strategies for circuit-switched optical networks. In: Proceedings of A CM/IEEE MA SCO TS. Odlyzko, A.M. (2001). Internet pricing and the history of communications. Computer Networks, 36:493-517. Odlyzko, A.M. (2003a). Internet traffic growth: Sources and implications. In: W. Weierhausen, A.K. Dutta, and K.I. Sato (eds.), Optical Transmission Systems and Equipment for WDM Networking II, volume 5247, pp. 1-15. Odlyzko, A.M. (2003b). The many paradoxes of broadband. First Monday, 8(9):1-15. Odlyzko, A.M. (2004). Pricing and architecture of the Internet: Historical perspectives from telecommunications and transportation. Saraydar, C.U., Mandayam, N.B., and Goodman, D.J. (2000). Power control in a multicell CDMA data system using pricing. In: IEEE VTC, pp. 484-491. Shenker, S. (1995). Fundamental design issues for the future Internet. IEEE JSAC, 13:1176-1188. Shenker, S., Clark, D., Estrin, D,, and Herzog, S. (1996). Pricing in computer networks: Reshaping the research agenda, ACM Computer Communication Review, 26:19-43. Wolff, R.W. (1989). Stochastic Modeling and Theory of Queues. Prentice-Hall, Englewood Cliffs, NJ.

Chapter 7 A NEW APPROACH TO POLICY-BASED ROUTING IN THE INTERNET Bradley R. Smith Jose Joaquin Garcia-Luna-Aceves Abstract

The current Internet routing architecture is based on a distributed, address-based, hop-by-hop routing model. In contrast, current proposals for policy-based routing are based on an on-demand, sourcespecified routing model. The benefits of the Internet model are that it is very robust, efficient, and responsive. However, it only supports a single forwarding class per destination. As a result, the Internet does not efficiently support quality-of-service (QoS) or traffic engineering, which are critical capabilities for the future of the Internet. In this chapter we review these previous solutions, review their limitations, and propose a new distributed policy routing model which we call distributed label-swap routing.

1.

Introduction

The architecture of today's Internet is based on the catenet model of internetworking defined in Cerf (1978); Cerf and Cain (1983). In the catenet model, networks are built by the concatenation of disparate networks through the use of routers. The primary goals of the catenet model, and therefore the Internet architecture, were to support packetswitched communication between computers over internets composed of networks based on diverse network technologies, and to encourage the development and integration of new networking technologies into these internets. To achieve these goals, a simple but powerful routing architecture was adopted. The Internet routing architecture is based on a best effort communication model in which traffic is forwarded through an internet along paths that minimize a single, typically delay-related metric (which often is simply hop-count), and such that packets may be dropped or deliv-

166

NEXT GENERATION INTERNET

ered out of order. These paths are constructed by a distributed routing computation where destination address-based packet forwarding state is computed autonomously by each router for a single forwarding class which provides minimum delay delivery. There are a number of strengths of the Internet routing architecture. It is robust in the sense that it co-locates the routing process with the state it computes, manifesting a design principle called fate-sharing first described by Clark (1988). This ensures that the failure of any single component of an internet does not invalidate state located elsewhere in the internet, effectively localizing the affects of any failures. The Internet routing architecture is efficient and responsive for a couple of reasons. By implementing distributed control of forwarding state it requires only simplex communication of topology change events. Specifically, since the routing process is co-located with the forwarding state it controls, a router only requires one-way (simplex) notification of the event from a remote router local to the event that detects it. By assuming a distributed, hop-by-hop routing model, the Internet routing architecture enables the use of more efficient and responsive routing algorithms that can operate with partial information regarding the topology of the network. This best-effort, distributed, hop-by-hop routing architecture has proven surprisingly powerful. Indeed, much of the success of the Internet architecture can be attributed to its routing model. However, largely as a product of its own success, limitations of this model are being encountered as it is applied to more demanding applications (see Braden et al. (1994)). The primary limitation of this routing model is that it only supports a single path between any given source and destination. Specifically, Internet forwarding state is composed of a single entry for each destination. Each entry is composed of the next-hop router on the chosen path to the destination. As a result, the Internet routing architecture only supports one path for any given destination, and that path is computed to optimize a single metric, typically delay or hop-count. This model has been extended to multi-path, equal-cost routing, which improves robustness but retains the limitations of only supporting a single performance class. Therefore, assumptions of uniform network performance requirements and network usage policies have been "hard-coded" into the Internet architecture. Specifically, the Internet routing architecture assumes all applications using an internet require minimization of the same performance parameter, and traffic from these applications may traverse any link in the internet to reach their destination. Clearly, such a model is not adequate for many of the demanding applications to which the Internet is currently being applied.

7 A New Approach to Policy-Based Routing

167

It is easy to find examples of diverse network performance requirements in the Internet today. While the minimum delay paths used in the best-effort communications model are well suited for the data services (e.g. e-mail, telnet, http, etc.) prevalent in the early Internet, they are inadequate for new applications of Internet technologies. For example, the on-demand delivery of isochronous streams of data (i.e. data requiring delivery within specific time constraints, such as video and audio) requires low delay variance (called jitter), while the interactive delivery of isochronous data (e.g. Internet telephony) requires both low delay and low delay variance. Similarly, the delivery of streaming video requires high bandwidth, and is relatively loss-tolerant, while streaming audio requires relatively low bandwidth, but is loss-intolerant. Due to the single-class forwarding model used in the Internet architecture, only one of such a set of diverse service models can be effectively supported in an internet today. While some service models satisfy the requirements of others (e.g. a high-bandwidth, low delay, and low delay variance model can satisfy the requirements of both video and audio conferencing) this approach does not utilize the network resources as effectively as a set of custom service models. Similarly, it is easy to find examples of network resource usage polices. The inability to provide differentiated services has become a stumbling block to realizing the commercial potential of Internet technologies. Commercial Internet Services Providers (ISPs) would benefit from the ability to provide different levels of services (e.g., bronze, silver, or gold), and a suite of service options (e.g., on-demand video or audio, and interactive video or audio conferencing) that would allow them to extract additional revenue from existing infrastructure. Non-commercial application of similar capabilities would enable the management of network resources. For example, portions of a network could be allocated along departmental (e.g., accounting, engineering, or sales), functional (e.g., instruction vs. research), and usage (e.g., video, audio, web, or e-mail) lines. Such service differentiation and resource management capabilities are not, in general, possible in the single forwarding class communications model used in the Internet today. As a special case of service differentiation, the issues of security and trust have become critical for many modern applications of Internet technology. While security was important in the design of the Internet architecture, its implementation and deployment took lower priority to the implementation and deployment of the basic technology for what was still a very proof-of-concept communications architecture. More recent work (e.g. SSL, Allen and Dierks (1999), and SSH) has focused on application-level, end-to-end security. This has left network-layer secu-

168

NEXT GENERATION INTERNET

rity and trust largely unresolved. In general, security and trust in the network layer revolve around questions of who can see traffic as it traverses an internet, and who can generate traffic load targeted at some point in an internet. The former represents a disclosure threat even for end-to-end protected traffic where traffic analysis may result in significant disclosures. The latter represents a critical denial-of-service threat as has been demonstrated in the many large-scale DDoS attacks perpetrated in the Internet. Given the single forwarding class communications model underlying the current Internet architecture, these vulnerabilities are fundamentally unresolvable. While end-to-end solutions like those mentioned above can help mitigate the problems, the basic vulnerabilities remain. The fundamental challenge of policy-based routing is to enhance the Internet routing architecture to support diverse network performance requirements and usage policies, without compromising the robustness, efficiency, and responsiveness of the existing distributed, hop-by-hop routing model. The remainder of this chapter reviews previously proposed solutions to this problem and identifies their limitations. It then presents an enhanced Internet routing architecture that supports these requirements for diverse policies without sacrificing the strengths of the original architecture. Lastly, it presents a new family of path-selection algorithms required by the new architecture that efficiently compute paths in the context of network performance and usage policies.

2,

Previous work

We define policy-based routing as the routing of traffic over paths in an internet that honor policies defining performance and resource utilization requirements of the internet. Based on this definition, qualityof-service routing (QoS) is the special case of routing in the context of performance policies, and traffic engineering is routing in the context of resource-utilization policies. This definition of traffic engineering is a generalization of that in current use. The current definition of traffic engineering can be stated as the management of network resources to minimize or eliminate congestion without the use of per-flow resource reservations. The generalized definition used here is the management of network resources to implement arbitrary policies without the use of per-flow resource reservations. Historically, QoS and traffic engineering have been addressed separately. The solution presented here is the first integrated solution to both. There are three main components to a policy-based routing solution: resource management, a routing architecture, and path-selection algorithms.

7

A New Approach to Policy-Based Routing

2.1

169

Policy-based resource management

Two QoS architectures have been developed representing fundamentally different approaches to solving the problem of resource management in the context of performance requirements. The goal of the integrated services (intserv) architecture (Braden et al. (1994)) was to define an integrated Internet service model that supports best-effort, real-time, and controlled link sharing requirements. Intserv made the assumption that network resources must be explicitly controlled, and defines an architecture where applications reserve the network resources required to implement their functionality, and an infrastructure of admission control, traffic classification, and traffic scheduling mechanisms which implement the reservations. In contrast, the differentiated services (diffserv) architecture provides resource management without the use of explicit reservations. In diffserv, a small set of per-hop forwarding behaviors (PHBs) is defined within a diffserv domain which provide resource management services appropriate to a class of application resource requirements. Traffic classifiers are deployed at the edge of a diffserv domain which classify traffic for one of these PHBs. Inside a diffserv domain, routing is performed using traditional hop-by-hop, single-forwarding class mechanisms. Resource management for traffic engineering involves the specification of traffic classification rules to identify the policy-significant traffic in an internet, and the definition of resource utilization policies in terms of these traffic classes. The resource utilization policies are used as constraints in the path selection function to compute paths for difference traffic classes. Current proposals (Awduche et al. (1999)) define resource-utilization policies by assigning network resources to resource classes, and then specifying what resource classes can be used for forwarding each traffic class.

2.2

Policy-based routing architectures

A policy-based routing architecture defines path-selection mechanisms for computing paths through an internet that honor performance and resource-utilization policies, and forwarding mechanisms for forwarding traffic over these paths. While both the intserv and diffserv QoS resource management solutions support the use of single-forwarding-class routing models, the use of policy-based routing solutions results in significantly improved resource utilization. In contrast, traffic engineering resource management solutions require the use of a policy-based routing architecture.

170

NEXT GENERATION INTERNET

Currently proposed policy-based routing architectures are based on an on-demand, virtual-circuit routing model where routes are computed on-demand (e.g. on receipt of the first packet in a flow, or on request by a network administrator), and forwarding is source-specified through the use of source routing or path setup (Davie and Rekhter (2000)) techniques. These solutions are less robust, efficient, and responsive than the original distributed, hop-by-hop Internet routing architecture. Specifically, these solutions are less robust due to their use of centralized control of state. For example, the forwarding paths in on-demand routing are brittle because the ingress router controls remote forwarding state in routers along paths it has set up. Furthermore, these solutions are less efficient and responsive due to their use of centralized control of state, and requirement of overly complex mechanisms for implementing some functions. Due to its centralized nature, on-demand routing requires the use of duplex communication of topology change events. Since the routing process controls remote forwarding state, a router requires two-way (duplex) communication to receive notification of an event, and then send forwarding state updates back into the internet. Additionally, on-demand routing is less efficient and responsive due to its requirement of complex mechanisms to implement their functionality. On-demand routing requires the use of full-topology routing algorithms to ensure that every router can compute optimal paths to any destination in an internet. Lastly, on-demand routing requires the use of more complex state management mechanisms, such as soft-state timers and repair mechanisms to manage forwarding state.

2*3

Policy-based path selection

Policy-based path-selection supports traffic engineering by the computation of paths in the context of administrative constraints on the type of traffic allowed over links in an internet. Analogously, policy-based path-selection supports QoS by the computation of paths in the context of multi-component weights (Sobrinho (2002)) assigned to the links in an internet. The metrics used in these computations are assigned to individual links in the network. For a given routing application, a set of link metrics is identified for use in computing the path metrics used in the path-selection decision. Link metrics can be assigned to one of two classes based on how they are combined into path metrics. Concave (or minmax) metrics are link metrics for which the minimum (or maximum) value (called the bottleneck value) of a set of link metrics defines the path metric of a path composed of the given set of links. Examples of concave metrics include residual bandwidth, residual buffer

7 A New Approach to Policy-Based Routing space, and administrative traffic constraints. Additive metrics are link metrics for which the sum (or product, which can be converted to a sum of logarithms) of a set of link metrics defines the path metric of the path composed of the given set of links. Examples of additive metrics include delay, delay jitter, cost, reliability, and packet loss.. The foundational work on the problem of computing paths in the context of more than one additive metric was done by Jaffe (1984), who defined the multiply-constrained path problem (MCP) as the computation of paths in the context of two additive metrics, which is known to be NP-Complete (Garey and Johnson (1979)). He presented an enhanced distributed Bellman-Ford algorithm that solved this problem with time complexity of O(n4frlog(n6)), where n is the number of nodes in a graph, and b is the largest possible metric value. Since Jaffe, a number of solutions have been proposed for computing exact paths in the context of multiple metrics for special situations. Wang and Crowcroft (1996) were the first to present the solution to computing paths in the context of a concave and an additive metric discussed above. Ma and Steenkiste (1997) presented a modified Bellman-Ford algorithm that computes paths satisfying delay, delay-jitter, and buffer space constraints in the context of weighted-fair-queuing scheduling algorithms in polynomial time. Cavendish and Gerla (1998) presented a modified Bellman-Ford algorithm with complexity of 0{rfi) which computes multi-constrained paths if all metrics of paths in an internet are either non-decreasing or non-increasing as a function of the hop count. Recent work by Siachalou and Georgiadis (2003) on MCP has resulted in an algorithm with complexity O(nWlog(n)), where W is the maximum link weight. This algorithm is a special case of the policy-based pathselection presented in Section 4 of this chapter. As described in Section 4, we have developed special case versions of the algorithm presented there for QoS and traffic engineering that provide significant improvements on existing solutions such as that presented in Siachalou and Georgiadis (2003). However, comparisons with these solutions are not presented here due to space constraints. While significant improvements in the performance of solutions to the general MCP problem have been obtained (as described above), these solutions still suffer from worst-case runtime that is exponential in the size of the graphs (specifically, worst-case runtimes are pseudopolynomial, Garey and Johnson (1979)). To address this problem several algorithms have been proposed that compute approximate solutions to the MCP problem. Both Jaffe (1984) and Chen and Nahrstedt (1998) propose algorithms which map a subset of the metrics comprising a link weight to a reduced range, and show that using such solutions, the cost of a

171

172

NEXT GENERATION INTERNET

policy-based path computation can be controlled at the expense of the accuracy of the selected paths. Similarly, a number of researchers (Jaffe (1984); Mieghem et al. (2001)) have presented algorithms which compute paths based on a function of the multiple metrics comprising a link weight. In summary, the drawbacks of the current policy-based path selection solutions are that they have poor average case performance, they implement inflexible path selection models, and those based on algorithms that compute approximate solutions result in significant loss in fidelity of the path costs. In summary, while existing proposals for QoS resource management will work with the traditional single-forwarding-class routing architecture, their effectiveness is severely limited. Furthermore, traffic engineering resource management cannot work with such routing architectures at all. To realize the potential of policy-based resource management, policy-based path-selection mechanisms must be used. Existing policy-based path-selection architectures are based on on-demand, virtual-circuit routing models which are inherently less robust, efficient, and responsive than the distributed, hop-by-hop model adopted by the Internet architecture. Furthermore, the performance of proposed exact solutions will not support the requirements of a distributed architecture.

3,

Distributed label-swap routing

Policy-based routing requires the ability to compute and forward traffic over multiple paths for a given destination. This is clearly the case for traffic engineering where multiple paths may exist that satisfy disjoint network usage policies. It is also true for QoS due to the fact that there may not exist a universally "best" route to a given node in a graph. For example, which of two paths is best when one has delay of 5ms and jitter of 4ms, and the other has delay of 10ms and jitter of lms depends on which metric is more critical for a given application. For FTP traffic, where delay is important and jitter is not, the former would be more desirable. Conversely, for video streaming, where jitter is very important and delay is relatively un-important, the latter would be preferred. Such weights are said to be incomparable. In contrast, it is possible for one route to be clearly "better" than another in the context of multi-dimensional link weights. For example, a route with delay of 5ms and jitter of 1ms is clearly better than a route with delay of 10ms and jitter of 5ms for all possible application requirements. Such weights are said to be comparable. The goal of routing in the context of multi-dimensional link weights is to find the largest set of paths to each destination with weights that

7 A New Approach to Policy-Based Routing

173

are "better-than" all other routes in the graph with comparable weights (this definition will be made precise in Section 4.2). The weights in such a set are called the performance classes of a destination. (Formally, the "better-than" relation is a partial ordering on the set of path weights, and the goal of a routing computation is to find the paths with the maximal set of path weights). Such a set of routes is not supported by the current Internet routing architecture because, as described above, the Internet only supports a single path between any given source and destination. The solution proposed here is to use label-swap forwarding technology as a generalized forwarding mechanism to implement multiple paths per destination computed by policy-based path-selection algorithms. The path-selection algorithms compute a set of paths per destination that provide all combinations of performance and use policies existing in an internet. By replacing IP addresses with semantically neutral labels, routes can be assigned a local label, and label-swap forwarding can then be used to forward traffic for each class along an appropriate path. This combination of distributed, hop-by-hop routing with label-swap forwarding is called distributed label-swap routing (DLSR).

Traditionally, label-swap forwarding has only been seen as an appropriate match with an on-demand, source-driven routing model. Indeed, to a large extent, the virtual-circuit nature of these previous solutions has been attributed to their use of label-swap forwarding. Contrary to this view, we submit that host addresses and labels are largely equivalent alternatives for representing forwarding state, and that the virtual-circuit nature of prior architectures derives from their use of a source-driven forwarding model. The primary conceptual difference between address and label-swap forwarding is that label-swap forwarding provides a clean separation of the control and forwarding planes (Swallow (1999)) within the network layer, where address-based forwarding semantically ties the two planes together unnecessarily. The distinguishing characteristic of DLSR forwarding, as compared with MPLS (Davie and Rekhter (2000)) is that label-swap forwarding state is pre-computed in a distributed manner for all destinations, as compared with on-demand route computation in MPLS. This separation provides what might be called a topological anonymity of the forwarding plane that is critical to the implementation of policy-based routes. Chandranmenon and Varghese (1995) present a similar notion, which they call threaded indices, where neighboring routers share the indexes into their routing tables for specific routes which are then included in forwarded packets to allow rapid forwarding table lookups. In addition, they present a modified Bellman-Ford algorithm that exchanges these indices among neighbors. Distributed label-swap forwarding gen-

174

NEXT GENERATION INTERNET

eralizes the threaded index concept to use generic labels (with no direct forwarding-table semantics), then uses these labels to represent routing policies computed by the routing protocols, and defines a family of routing protocols to exchange local labels among neighbors. As illustrated in Figure 7.1 for traffic engineering, distributed labelswap forwarding can be used in the context of traditional address-based forwarding. This figure shows a four node network, with the forwarding tables at each node. In this example the forwarding table is referenced for both traffic classification (through the "address prefix" field), and for label-swap forwarding (through the "local label" field). The difference between DLSR forwarding and virtual-circuit mechanism (e.g. ATM and MPLS) is the use of topology-driven vs. on-demand routing computation. Specifically, in DLSR forwarding routes are pre-computed for all destination and traffic classes. The benefit of this mechanism for traffic forwarding is that it can be generalized to handle policy-based forwarding. Specifically, distributed label-swap forwarding can be used to implement traffic engineering via the assignment of traffic to administrative classes that are used to select different paths for traffic to the same destination depending on the labeling of links in the network with administrative class sets. For example, Figure 7.2 shows a small network with four nodes, two administrative classes A and JB, and the given forwarding state for reaching the other nodes. The benefits of this architecture are that it is based on forwarding state that is agnostic to the definition of forwarding classes, which allows the data forwarding plane to remain simple yet general. Second, it concentrates the path computation functions in the routing protocol, which is the least time critical, and most flexible component of the network layer. This concept can be generalized to handle QoS in a straightforward manner. The resulting routing architecture can be seen as analogous to the Reduced Instruction Set Computer (RISC) processor architecture in which researchers shifted much of the intelligence for managing the use of processor resources to the compilers that were able to bring a higher-level perspective to the task, thus allowing much more efficient use of the physical resources, as well as freeing the hardware designers to focus on performance issues of much simpler processor architectures. Similarly, the communications architecture proposed here requires a shift in intelligence for customized (i.e. policy-based) path composition to the routing protocols and frees the network layer to focus solely on hop-by-hop forwarding issues, adding degrees of freedom to the network hardware engineering problem that allow for significant advances in the performance and effectiveness of network infrastructure.

175

7 A New Approach to Policy-Based Routing w

1 W 1 X 2 - 3 z z 4 Y 4 z 3

w

1 _ _ Y 2 Y 1 Z 3 Y 2 X 4 X 2

W X Y Z

1 2 3 4

X X Y -

1 2 1 -

Figure 7.1. Labels with Address-Based Forwarding w X Y Z Z

w X Y Z Z

AB AB AB A B

1 2 3 4 5

AB AB AB A B

1 W 1 2 - _ 3 W 3 4 z 6 5 w 5

_ _ X Y X Y

W A W B X A B X Y A Z AB

2 3 4 4

Y

W AB X AB Y AB Z B Z A

1 2 3 4 5

W W Z

w

B

1 2 3 4 5 6 7

X Y X Y X _ Y

1 1 2 2 3 _ 3

1 2 6 4

Figure 7.2. Labels with Policy-Based Forwarding

The enhancement of traditional unicast routing systems with the policy-based routing technology presented above is straight forward. The routing protocol must be enhanced to carry the additional link metrics required to implement the desired policies. This requires the use of either a link-state or link-vector routing protocol (Garcia-Luna-Aceves and Behrens (1995)) that exchanges information describing link state.

176

NEXT GENERATION INTERNET

Figure 7.3. Traffic Flow in Policy-Enabled Router

However, for a system depending on on-demand routing computations, a topology broadcast protocol is required to ensure an ingress router has the information it needs to compute an optimal route. In contrast, hop-by-hop based routing systems can work with partial-topology protocols as each routing process is ensured of learning a topology from its neighbors containing optimal routes for reaching all destinations in an internet. The forwarding state must be enhanced to include local and next hop label information in addition to the destination and next hop information existing in traditional forwarding tables. Traffic classifiers must be placed at the edge of an internet, where "edge" is defined to be any point from which traffic can be injected into the internet. Since each router represents a potential traffic source (for CLI and network management traffic), this effectively means a traffic classification component must be present in each router. As illustrated in Figure 7.3, the resulting traffic flow requirements are that all non-labeled traffic (sourced either from a router itself, or from a directly connected host or non-labeling router) must be passed through the traffic classifier first, and all labeled traffic (sourced either from the traffic classifier or a directly connected labeling router) must be passed to the label-swap forwarding process.

4,

Policy-based path-selection algorithm

We model a network as a weighted undirected graph G — (N,E), where TV and E are the node and edge sets, respectively. By convention,

7 A New Approach to Policy-Based Routing

177

the size of these sets are given by n — \N\ and m — \E\. Elements of E are unordered pairs of distinct nodes in N. A{%) is the set of edges adjacent to i in the graph. Each link (i,j) G E is assigned a weight, denoted by tuij. A path is a sequence of nodes < #1,2:2, • • • > xd > such that (#2, £i+i) G £? for every i — 1,2,..., d— 1, and all nodes in the path are distinct. The weight of a path is given by UJP = X^=i ^aw+i- The nature of these weights, and the functions used to combine these link weights into path weights are specified for each algorithm.

4.1

Administrative policies

We use a declarative model of administrative policies in which constraints on the traffic allowed in an internet are specified by expressions in a boolean traffic algebra. The traffic algebra is composed of the standard boolean operations on the set {0,1}, where a set of p primitive propositions (variables) represent statements describing characteristics of network traffic or global state that are either true or false. The syntax for expressions in the algebra is specified by the BNF grammar: ip ::= 0 I 1 I vi...vp

! (-ny?) I (ip A
The set of primitive propositions, indicated by V{ in the grammar, can be defined in terms of network traffic characteristics or global state. For example (referring to Figure 7.4) gold could be a variable that is true (has the value 1) if the source or destination IP address of the current packet is that of a "gold" customer who has subscribed to a premium level of service from their Internet service provider (ISP), with silver and bronze defined similarly. In this context, Figure 7.4 illustrates the portion of the ISP's network connecting customers and the ISP's server farm that provides the services offered by the ISP. By assigning the link predicates as shown, the service provider has defined a network resource usage policy that grants gold customers access to all client-server interconnections, silver customers access to two-thirds of this connectivity, and bronze customers to one-third. Consequently, premium customers get quantifiably improved service relative to lower-priority customers, both in terms of reliability and performance. In addition, a SAT(ip) primitive is required for expressions in the traffic algebra which is the SATISFIABILITY problem of traditional propositional logic. Satisfiability must be tested in two situations by the algorithms presented below for the implementation of traffic-engineering computations. First, an extension to a known route should only be considered if classes of traffic exist that are authorized to use both the path represented by the known route and the link used to extend the

178

NEXT GENERATION INTERNET

path. This is true iff the conjunction of these expressions is satisfiable (i.e., SAT(ei A Sij) where si is the predicate for the path to i, and Sij is the predicate for the link from i to j). Second, given that classes of traffic exist that are authorized to use a path represented by a new route, the algorithms must determine whether all traffic supported by that route has also been satisfied by other previously discovered shorter routes. This is true iff the new route's traffic expression implies the disjunction of the traffic expressions for all known better routes (i.e., {e% —» £^, Si2) ..) is valid, which is denoted by (si —> Si) in the algorithms). Determining if an expression is valid is equivalent to determining if the negation of the expression is unsatisfiable. Therefore, expressions of the form e\ —> £2 are equivalent to -iSAT(-i(ei -> e 2 )) ( o r -*SAT(ei A -16:2))• SATISFIABILITY is the prototypical NP-complete problem (Garey and Johnson (1979)). As is typical with NP-complete problems, it has many restricted versions that are computable in polynomial time. An analysis of strategies for defining computationally tractable traffic algebras is beyond the scope of this paper; however, we have implemented an efficient, restricted solution to the SAT problem by implementing the traffic algebra as a set algebra with the set operations of intersection, union, and complement on the set of all possible forwarding classes. In summary, administrative policies are specified for an internet by a set of link and global predicates. These predicates define a set of forwarding classes, and constrain the topology that traffic for each forwarding class is authorized to traverse, as required by the administrative policies.

4,2

Performance characteristics

As described in Section 2.3, path weights are composed of multicomponent metrics that capture all important performance measures of a link such as delay, delay variance ("jitter"), available bandwidth, etc. As discussed in Section 3, there is not a universally "best" route between two nodes in a graph in the context of multi-component weights. To address this fact, the routing algorithm presented here is based on an enhanced version of the path algebra defined by Sobrinho (2002), which

gold silver v gold

bronze v silver v gold

Figure 7.4- Traffic Engineering Example

7 A New Approach to Policy-Based Routing

179

supports the computation of a set of routes for a given destination containing the "best" set of routes for each destination. Formally, the path algebra P = < W, ©, ^, C,0,oo > is defined as a set of weights W, with a binary operator 0, and two order relations, -< and C, defined on W. There are two distinguished weights in W, 0 and oo, representing the least and absorptive elements of W, respectively. Operator © is the original path composition operator, and relation ^, called "lighter-than," is the original total ordering from Sobrinho (2002). Operator © is used to compute path weights from link weights. The routing algorithm uses relation -< to build the forwarding set, starting with the minimal element, and by the forwarding process to select the minimal element of the forwarding set whose parameters satisfy a given QoS request. A new relation on routes, C, called "better-than," is added to the algebra and used to define classes of comparable routes and select maximal elements of these classes for inclusion in the set of forwarding entries for a given destination. Relation C is a partial ordering (reflexive, antisymmetric, and transitive) with the following, additional property: PROPERTY 7.1

(ux C uy) => (UJX y

u)y).

This relation is equivalent to the concept of dominated paths (Henig (1985)). A route rm is a maximal element of a set R of routes in a graph if the only element r G R where rm C r is rm itself. A set Rm of routes is a maximal subset of R if, for all r G R either r ^ i? m , or r E Rm and for all s € R — {r}, -i(r C s). The maximum size of a maximal subset of routes is the smallest range of the components of the weights (for the two component weights considered here). An example path algebra based on weights composed of delay and cost is as follows: 0

=

oo

=

(oo,oo)

UJJ

=

[di + dj, C{ + Cj)

Ui dt Vj

=

(di < dj) V ((di = dj) A (a < Cj))

&i Q ujj

=

(dj < di) A

Ui ©

(0,0)

(CJ

< ci)

Figure 7.5 is a graphical depiction of the relation ^ on a set of weights for routes (labeled A through /) to a given destination in an internet where x -< V is depicted as x —> y. Figure 7.6 illustrates the relation C where each route is represented as a subset of the plane with upper left-hand corner at the coordinates for the route. The intuition communicated here is that a route satisfies any constraint pair contained in its

180

NEXT GENERATION INTERNET

sub-region of the plane. Building on this intuition, the relation C defines an ordering on routes in terms of the containment (subset) of one route's region within another's, i.e. if ^ Q tjj, then the set of constraint pairs that route i can satisfy is a subset of those satisfiable by route j . The maximal subset of a set of such routes (the set of routes shown with solid lines in Figure 7.6) contain routes that satisfy all constraint pairs satisfiable by any route in the internet, and is the goal of the routing computation. Clearly, any pair of routes in the maximal subset of routes overlap, and can both satisfy some set of constraint pairs. The relation •< is used to select one of the set of satisfying routes for a given constraint. As defined in this example, the relation ^ has the effect of truncating the extent of a route's region at the first overlapping route to the right in the maximal subset of routes (as shown in Figure 7.7). As a result, forwarding table lookups in this example involve choosing the lowest delay route with acceptable cost. Cost

0 0

oo A

Latency

C

^

H

oo Figure 7.5.

•< relation

Cost

0 0

OO

A B

c 5 I H

iI

8

Latency

[

Figure 7.6.

C relation

7

181

A New Approach to Policy-Based Routing

0

Cost

oo

0 D f

OO

Figure 7.7. Forwarding table

The goal of routing in the context of multi-component link weights is to find the largest set of paths to each destination with weights that are "better-than" all other routes in the graph with comparable weights. (Formally, the "better-than" relation is a partial ordering on the set of path weights, and the goal of a routing computation is to find the paths with the maximal set of path weights). In summary, the goal of policy-based routing is to compute the maximal set of routes to each destination in an internet for each traffic class for which a path to the destination exists. In terms of the DLSR architecture, this translates to computing the set of performance classes for each forwarding class in an internet. Realizing this goal requires the ability to efficiently compute paths in the context of link predicates and multi-component link weights, and to efficiently forward traffic over the multiple paths per destination resulting from such a computation. 4,3

Path selection

The notation used in the algorithms presented in the following is summarized in Table 7.1. In addition, the maximum number of unique truth assignments is denoted by A = 2P, the maximum number of unique weights by W — min(range of weight components), and the maximum number of adjacent neighbors by amax = max{| A{%) | | i 6 N}. Table 7.2 defines the primitive operations for queues, heaps, and balanced trees used in the algorithms, and gives their time complexity used in the complexity analysis of the algorithms (where d is the degree of the tree implementing the heap structure). The algorithm presented in Figure 7.9 implements an SPF-style search through the paths in a graph in order of increasing weight (in terms of ^), adding the paths with maximal weights (in terms of C) for each

182

NEXT GENERATION INTERNET

Balanced Tree

Queue

Figure 7.8. Model of Data structures for Basic Algorithms

traffic class (defined by the traffic algebra). This algorithm is based on the data structure model shown in Figure 7.8. In this structure, a balanced tree (Bi) is maintained for each node in the graph to hold newly discovered, temporary labeled routes for that node. The heap T contains the lightest weight entry from each non-empty Bi (for a maximum of n entries). Lastly, a queue, P^, is maintained for each node which contains the set of permanently labeled routes discovered by the algorithm, in the order in which they are discovered (which will be in increasing weight). The general flow of this algorithm is to take the minimum entry from the heap T, compare it with existing routes in the appropriate P^, if it is incomparable with existing routes in Pi it is pushed onto P$, and add "relaxed" routes for its neighbors to the appropriate JB^'s. The correctness of this algorithm is based on the maintenance of the following three invariants: for all routes I € P and J € B*, I < J, all routes to a given destination i in P are incomparable for some set of satisfying truth assignments, and the maximal subset of routes to a given destination j in Pj U Bj represents the maximal subset of all paths to j using nodes with routes in P. Furthermore, these invariants are maintained by the following two constraints on actions performed in each iteration of these algorithms: (1) only known-non-maximal routes

Table 7.1. Notation. P Pn T Tn Bn Sn

= = = = = =

Queue of permanent routes to all nodes. Queue of permanent routes to node n. Heap of temporary routes. Entry in T for node n. Balanced tree of routes for node n. Summary of traffic expression for all routes in Pn.

7

A New Approach to Policy-Based Routing

183

Table 7.2. Operations on Data Structures (Ahuja et al. (1993)). Notation Description Queue Push(r, Q) Insert record r at tail of queue Q (O(l)) Head(Q) Return record at head of queue Q (O(l)) Pop(Q) Delete record at head of queue Q (O(l)) PopTail(Q) Delete record at tail of queue Q (Q(l)) d-Heap Insert(r, H) Insert record r in heap H (0(logd(n))) IncreaseKey(r,rh) Replace record rn in heap with record r having greater key value (0(dlog d (n))) Replace record rn in heap with record r having lesser key DecreaseKey(ryrh) value (O(logd(n))) Min(H) Return record in heap H with smallest key value (0(1)) DeleteMin(H) Delete record in heap H with smallest key value (O(dlog»)) Delete(rh) Delete record rn from heap (O(d\ogd(n))) Balanced Tree Insert record r in tree B (0(log(n))) Insert(r, B) Min(B) Return record in tree B with smallest key value (0(log(n))) DeleteMin(B) Delete record in tree B with smallest key value (0(log(n)))

are deleted or discarded, and (2) only the smallest known-maximal route to a destination i is moved to Pi. In effect, this algorithm computes routes in the virtual graph induced by the link predicates existing in the internet. This virtual graph is composed of all nodes reachable by some path with a satisfiable path predicate, and all links composing these paths. The virtual graph is "discovered" as needed by the algorithm as the computation progresses. The time complexity of Policy-Based-Dijkstra is dominated by the loops at lines 4, 11, and 15. The loop at line 4 is executed nWA times, and the loop at line 15 mWA times. The loop at line 11 scans the entries in Pi to verify a new route is best for some truth assignment. For a given destination, this loop is executed at most an incrementally increasing number of times, starting at 0 and growing to WA — 1 (the maximum number of unique routes to a given destination) for a total of Y^Ji'1 i ~ A ^ times. For completeness, the statements in lines 6 and 21 take time proportional to \og{amaxWA) for a total of nWAlog(amaxWA) and mWA\og(amaxWA), respectively; and those in lines 7-9 and 17-20 take time proportional to log^(n) for a total of nWA\ogd(n) and mWA\ogd(n), respectively. Therefore, the worst-case time complexity of Policy-Based-Dijkstra, dominated by the loop in line 11, is O(nW2A2). In practice, the cost of this algorithm is limited by

184

NEXT GENERATION INTERNET algorithm Policy-Based-Dijkstra begin Ps); 1 Push(<8is,0,l>i 2 for each {(sj) G A(s)} 3 Insert(<j,s,ujSj,£Sj>, T); 4 while (\T\ = 0) begin 5 <— Min(T)\ 6 DeleteMin(Bi)] 7 if (| i*i | = 0) 8 then DeleteMin(T) 9 else IncreaseKey(Min(Bi), X*); 10 etmp <- e%\ ptr <- Tail(Pi); 11 w h i l e ((£ t m p / 0) A (ptr ^ 0)) 12 £tmP <(— £tmP A -iptr.s; ptr <— ptr.next] if (etmp 7^ 0) 13 t h e n begin Push(, Pi); 14 15 for each {(i,j) G A(i) | 5 A T ( ^ A 6 ^ } begin 16

Uj <~ c^i 0

17 18 19

if(Tj - 0) then Insert(<j,i,u)j,Sj>, else if (wj -< Tj.w)

20

then

21

cc^-/; £ j <— 6 ^ ;

T)

DeereaseKey(<j,i,uj,€j>,

Insert(<j,i,Uj,€j>i

T);

Bj);

end end end end

Figure 7.9. General-Policy-Based Dijkstra.

the actual number of distinct weight paths in the graph. The loop at line 11, which dominates the cost of Policy-Based-Dijkstra, is required because there is no way to summarize the permanent routes for a destination. However, for special-case traffic engineering and QoS variants of this algorithm, the permanent routes can be summarized by a summary traffic expression (formed by the disjunction of permanent route path predicates) and the weight of the last route, respectively. Using these shortcuts, the complexity of the traffic engineering and QoS algorithms are O(mAlog(A)) and O(raWlog(W)), respectively. Lastly, for these variants, refinements in the data structures result in O(m^41og(n))and O(mWlog(n)) complexity. The details of these variants and their analysis can be found in Smith (2003).

7

A New Approach to Policy-Based Routing

185

d

Figure 7.10. Next Hop Problem with Policy-Based Routing

4.3.1 Routing protocol changes. Lastly, the routing protocol must be enhanced to exchange information needed to compute the label swap components of its forwarding tables. The output of the routing algorithm is forwarding information described in terms of a destination, traffic expression, and path weight for each computed route. To be used for forwarding, this information must be augmented with local and next hop labels. To determine the next hop label for a given route the routing process requires the forwarding tables of its neighbors. Therefore, the final enhancement required of routing protocols is that they exchange local forwarding tables and use this information to compute the next hop label for their routes. One challenge presented by this requirement is that the routes computed by the routing algorithm must be assured of matching an active route in the selected next hop neighbor. As illustrated in Figure 7.10, this is not guaranteed by the algorithms presented above. Specifically, in this internet there are a number of equally "good" routes from nodes s and i to node d. For example, it is possible that the routing process at node i selects the paths through its neighbors I and j to provide two hop paths for traffic classes A) B, and C, while node s selects the paths that go through nodes k and m. In such a case there is no next hop label that can be chosen at s for routes to d that will satisfy the traffic policies. To address this problem, Figure 7.11 presents an enhanced version of the algorithm for use in the context of hop-by-hop forwarding. In this algorithm, routes are augmented with two additional fields; Ud is the next hop neighbor for a route to destination d) and Id is the next hop label for d. As described above, a partial forwarding table is maintained

186

NEXT GENERATION INTERNET

for each neighbor, specified by .Fn[(i], containing an array of routes for each destination in the internet. Each entry in this array, denoted by < d,o;^,£^, Id >, gives the weight, traffic expression, and next hop label for each route in the neighbor's forwarding table. In this algorithm, new paths are only considered if they are extensions of paths chosen by the neighbor which is the next hop to the predecessor to the path's destination. For example, from Figure 7.10, node s will only consider paths to destination d that are extensions of node i's paths to d through nodes / and j . A fringe benefit of this enhancement is the next hop label algorithm HbyH-Policy-Based-Dijkstra begin Push(<sys,0y l,s,0>, Ps); 1 2

for each {(s, j)

G A(s)}

3 4

Insert(<j,s,ujsj,esj,j,®>, T); while (\T\ = 0) begin 5 <— Min(T); 6 DeleteMin(Bi); if (\Bi\ - 0) 7 8 then DeleteMin(T) 9 else IncreaseKey(Min(Bi), Ti)\ 10 etmp +- si; ptr <- Tail(Pi)] while ((etmP + 0) A (ptr ^ 0)) 11 12 etmp <— Stmp A -yptr.e; ptr <— ptr.next; 13

14 15

if (etmp + 0)

then begin Push(, Pi); for each {(ij) e A(i) \

(3<jX,4,^>e Fni\j] | (Ssn.-, A e'j = £i A€ij) A (wsni + Oj'j = U)i + OOij)) A

SAT(ei/\etj} begin 16

a/? <— Ui 0

17

if (Tj = 0)

18 19 20 21

u>ij; £; ^— £r/;

then Insert(< j,i,u>j,ej,ni,l'j >, T) else if (wj -< Tj.u) then DecreaseKey^j^i.Uj^ej.riiJ^, Insert(<j,i,Uj,ej,rii,lj>, Bj)\ end end end end Figure 7.11. General-Policy-Based Dijkstra.

T);

7 A New Approach to Policy-Based Routing

187

computation can now be integrated with the routing computation (as shown by the inclusion of the next hop label in the routes computed by the algorithm).

5,

Conclusions

In this chapter we have defined policy-based routing as the computation of paths, and the establishment of forwarding state to implement paths in the context of diverse performance requirements and network usage policies. We showed that a fundamental requirement of policybased routing is support for multiple paths to a given destination, and that the address-based, single-forwarding-class Internet routing model can't support such a requirement. Furthermore, while proposed QoS resource management solutions are defined to work in a single-forwardingclass environment, their effectiveness is significantly limited by such a constraint. We then showed that previously proposed policy-based routing solutions, which are based on an on-demand, virtual-circuit model, result in significant compromises in robustness, efficiency, and responsiveness in comparison to the Internet's distributed, hop-by-hop routing model. We then presented the DLSR routing architecture in which the Internet's distributed, hop-by-hop routing model is combined with a label-swap forwarding plane. DLSR is the first distributed, hop-by-hop policy routing architecture, and the first policy routing solution that provides integrated support of QoS and traffic engineering. Lastly, we presented a new family of efficient path-selection algorithms for use in a DLSR-based routing architecture that compute paths in the context of diverse performance requirements and network resource usage policies.

References Ahuja, R.K., Magnanti, T.L., and Orlin, J.B. (1993). Network Flows - Theory, Algorithms, and Applications. Prentice Hall. Allen, C. and Dierks, T. (1999). The TLS Protocol Version 1.0. RFC 2246. Awduche, D.O., Malcolm, J., Agogbua, J., O'Dell, M., and McManus, J. (1999). Requirements for Traffic Engineering Over MPLS. RFC 2702. Braden, B., Clark, D., and Shenker, S. (1994). Integrated Services in the Internet Architecture: An Overview. RFC 1633. Cavendish, D. and Gerla, M. (1998). Internet QoS routing using the Bellman-Ford algorithm. In: Proceedings IFIP Conference on High Performance Networking. IFIP. Cerf, V.G. (1978). The Catenet Model for Internetworking. IEN 48. Cerf, V.G. and Cain, E. (1983). The DoD Internet architecture model. Computer Networks, 7:307-318. Cerf, V.G. and Kahn, R.E. (1974). A protocol for packet network intercommunication. IEEE Transactions on Communications, COM-22(5):637-648.

188

NEXT GENERATION INTERNET

Chandranmenon, G.P. and Varghese, G. (1995). Trading packet headers for packet processing. IEEE ACM Transactions on Networking, 4(2):141-152. Chen, S. and Nahrstedt, K. (1998). An overview of quality of service routing for next-generation high-speed networks: Problems and solutions. In: IEEE Network, pp. 64-79. Clark, D.D. (1988). The design philosophy of the DARPA internet protocols. Computer Communications Review, 18(4): 106-114. Davie, B. and Rekhter, Y. (2000). MPLS: Technology and Applications. Morgan Kaufmann. Garcia-Luna-Aceves, J.J. and Behrens, J. (1995). Distributed, scalable routing based on vectors of link states. IEEE Journal on Selected Areas in Communications. Garey, M.R. and Johnson, D.S. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co. Henig, M.L (1985). The shortest path problem with two objective functions. European Journal of Operational Research, 25:281-291. Jaffe, J.M. (1984). Algorithms for finding paths with multiple constraints. Networks, Ma, Q. and Steenkiste, P. (1997). Quality-of-service routing for traffic with performance guarantees. In: Proceedings J^th International IFIP Workshop on QoS. IFIP. Mieghem, P.V., Neve, H.D., and Kuipers, F. (2001). Hop-by-hop quality of service routing. Computer Networks, 37:407-423. Siachalou, S. and Georgiadis, L. (2003). Efficient QoS routing. In: Proceedigns of INFOCOM'03. IEEE. Smith, B.R. (2003). Efficient Policy-Based Routing in the Internet. Ph.D. thesis, University of California at Santa Cruz. Sobrinho, J.L. (2002). Algebra and algorithms for QoS path computation and hop-byhop routing in the Internet. IEEE/ACM Transactions on Networking, 10(4):541550. SSH. SSH Communications Security, http://www.ssh.com/. Swallow, G. (1999). MPLS advantages for traffic engineering, IEEE Communications Magazine, 37(12):54-57. Wang, Z. and Crowcroft, J. (1996). Quality-of-service routing for supporting multimedia applications. IEEE Journal on Selected Areas in Communications, pp. 12281234.

Chapter 8 ADVANCED METHODS FOR THE ESTIMATION OF THE ORIGIN DESTINATION TRAFFIC MATRIX Sandrine Vaton Jean-Sebastien Bedo Annie Gravey Abstract

1.

For lots of traffic engineering tasks, telecommunications operators need good knowledge about the traffic which transit through their networks. This information is fully represented by the matrix of the volumes of data which go from any entry node to any exit node during a period of time. This matrix is called the origin-destination (OD) traffic matrix. However such a matrix is not directly available. Only measures of the volumes of data which transit through a link between routers can be obtained easily with the help of Simple Network Management Protocol (SNMP). These measures are called link counts. Lots of techniques have been proposed to estimate the traffic matrix from the link counts. Among those, statistical methods propose to model the demand for each OD pair in order to chose a possible traffic matrix which fits the reality of networks. Nevertheless, the model is often arbitrary and don't take into account the temporal dimension of traffic. In this paper, we claim that a temporal model of the traffic for each OD pair could be trained from the link counts only. We prove the validity of our approach on a one router network on which direct measurements of the OD counts were made available. Then, we compare our results to other methods and show that their accuracy is the best.

Introduction

With the diversification of network applications, the volume of data carried on international backbones is increasingly sporadic and unpredictable. As a consequence, traffic engineering, dimensioning and routing are outstanding issues. A good knowledge of traffic behavior is necessary for an efficient deployment of traffic engineering tools.

190

NEXT GENERATION INTERNET

To improve their quality of service (QoS) while reducing their costs, operators must find ways to measure and predict the traffic offered by their customers and peers. That is to say, they need to know the volume of traffic which transits from any edge node to any other edge node of their own network, where an edge node is a point where traffic enters and/or exits the operator's network (i.e. a router or a point of presence (POP)). This volume expressed in a number of bytes or packets observed during a given period of time (typically 5 or 10 minutes) is called origindestination (OD) counts. These OD counts are usually represented in the form of a matrix where lines represent origin nodes and columns represent destination nodes. This N x N matrix where TV is the number of edge nodes connected to the network is called the OD traffic matrix. Unfortunately, the OD traffic matrix cannot usually be measured directly on large commercial backbones. Indeed, commercial software e.g. CISCO Netflow first sample the packets transiting through a given router and then infer their origin and destination by analyzing their headers. Using this method presents several drawbacks. Firstly, it requires deploying the same software on all core routers which may not be feasible due to cost and heterogeneity of networking equipment. Secondly, such a software generates a significant CPU demand on the router that may negatively impact on the global router performance. Thirdly, synchronizing individual measures and storing measurement data is not simple. Finally, these methods provide no guarantee that the sampling procedure provides a good representation of the real OD traffic matrix due to the heterogeneity of traffic demands. On the other hand, SNMP (Simple Network Management Protocol) counters routinely provide global counts of traffic aggregates observed on router interfaces. These are naturally called link counts since they correspond to the amounts of traffic transiting on a link between two routers. Furthermore, the network operator is obviously aware of its routing policy and may easily infer the paths followed by packets between origins and destinations. The routing information can be summed up in terms of another matrix, the routing matrix, where each line corresponds to a link in the network, and each column is an OD pair. A typical routing matrix element is a weight between 0 and 1 that represents the probability that a packet sent between the origin and destination transits through this link. For example, a weight of 1 (respectively 0.5) means that all (respectively half of) the traffic of the OD pair transits through this particular link. In this paper, we address the issue of inferring the OD matrix from the above easily obtained information, i.e. links volume and routing matrix.

8 Statistical Methods for the OD Traffic Matrix Problem

191

The trafRc matrix problem is typically an ill-posed problem. If y denotes the column vector of the link counts and x denotes the column vector of the OD counts then y - A x

(8.1)

where A is the routing matrix. Note that A is not invertible and that the linear system is actually underdetermined. Indeed, the number r of link counts is usually much smaller than the number c of OD pairs; typically, if TV is the number of edge nodes, then r is of the same order of magnitude as TV, whereas c is of the order of TV2, where N may be as large as 200. This implies that the number of solutions to the above linear system is infinite. To be more precise, the dimension of the solutions state space is c — r. To illustrate this issue, consider the example of a very simple network shown in Figure 8.1 with two links and three OD pairs. Obviously y\ = x\ + xs and y<± — x\ + #2- Let us suppose for example that y\ — 3 and y2 = 5. In this case, there are four integer positive solutions (and even more non-integer solutions): (#1,2:2, #3) = (0, 5, 3), (1,4, 2), (2, 3,1) and (3,2,0). These solutions have nothing in common, except that they all satisfy the equations x\ + £3 = 3 and x\ + X2 = 5. Solving the problem of inferring the OD matrix x from links volume y and routing matrix A thus consists of selecting one "good" solution among the ones which satisfy Equation (8.1). The rest of the paper will be organized as follows. In Section 2, we will present a general taxonomy of existing methods for the trafRc matrix problem. Then, we will present two non-statistical methods: penalization method in Section 3 and gravity models in Section 4. In Section 5, we will begin our exploration of statistical methods for the traffic matrix problem with the Expectation Maximization algorithm. Section 6 is a detailed tutorial on Markov Chain Monte Carlo (MCMC) techniques applied to the estimation of the OD traffic matrix. Then comes the main Section 7 which describes our own algorithm for traffic matrix estimation. It consists of an improvement over the MCMC technique. Last but not least, we show in Section 8 the numerical results we obtained on datasets from a real network and compare the performance of our algorithm with others. Section 9 is a conclusion.

2,

A taxonomy of OD matrix estimation methods

As we saw in the introduction, the OD matrix can be estimated from the link values as a solution of the linear system y = Ax. As this system is underdetermined, there is an infinite number of solutions and

192

NEXT GENERATION INTERNET

xi

X2

x\

Figure 8.1. A simple example network with 2 monodirectional links and 3 OD pairs

one, or a selection, of solutions of that system must be selected as better than others. To do so, a variety of methods have been considered in the literature. Most of the time, these methods have been validated on synthetic datasets produced with different modelling assumptions (Poisson, Gaussian, e t c . ) . Some of these methods have also been compared on measured backbone traffic (Medina et a l , 2002) or on LAN traffic (Cao et al., 2000a). In some cases, cross validation for some OD flows was made possible by running special software (for example CISCO Netflow) on a few routers. In other cases, a validation was done on synthetic traffic. In this section, we orientate the reader through the variety of existing and other possible techniques. Various criteria can be considered to classify existing OD matrix estimation methods, as well as some other methods that have not yet been published but whose study could be of interest.

2.1

Single OD matrix versus set of OD matrices

OD traffic matrix is an old issue in dimensioning telephone networks. Classically, to dimension a circuit switched telephone network, only the mean of each OD flow is considered. This results in the definition of one single OD traffic matrix. The rationale for doing so is probably the Poisson modelling of telephone networks, since the distribution of a Poisson random variable is completely specified by its mean A. Since the 1990's it has nevertheless been recognized that data network traffic is not consistent with the Poisson model (Paxson and Floyd, 1995). Internet data traffic has indeed properties such as fractal or multi-fractal behaviour, self-similarity at different timescales, heavy tails, etc... It is recognized that the Poisson model would result in serious failures in modelling, predicting performances, and enabling relevant network design. In practice, on a data network, SNMP measurements routinely provide links volume at a rate as high as one SNMP request to the router MIB (Management Information Base) every 5 minutes. This provides up to 12 OD matrices per hour, 288 per day, 2016 per week ... Each of

8 Statistical Methods for the OD Traffic Matrix Problem these OD matrices x^,£ = 1,...,T is a solution of the linear system yt — Axi where t refers to a given period of time of typical duration 5 or 10 minutes. This set of traffic matrices can then be mapped into a set of constraints for traffic engineering. Taking into account this set of constraints rather than a single average OD matrix will result in a more accurate traffic engineering than a simple overdimensioning based on the average OD matrix would do.

2,2

Statistical versus non-statistical methods

As mentioned previously, SNMP measurements routinely provide link information yt) and the OD matrix Xt must be estimated with the constraint that yt = Ax£, or yt = Ax^ + n^ if some measurement noise (due to desynchronization of SNMP and/or IGP measurements for example) is incorporated into the model. As the routing matrix A is not invertible, some additional criteria are needed to select one or a set of solutions as better than others. To do so, it is possible to consider the OD matrix x^ as a random variable: this is the Bayesian approach. The distribution of that random variable is the so called prior distribution ; it incorporates various prior beliefs that the system modeller has about x^ such as its mean, variance, tails, etc... Various distributions can be considered (Poisson, Gaussian, Pareto, mixture of distributions, ...). It is also possible to take into account some dependence (correlation, etc.) between the different OD couples, or some dynamic model such as hidden Markov models, fractional Brownian motion, etc. although most of the time, independence between the different OD couples, as well as independence between successive traffic matrices x^ is supposed. When a statistical model is specified (Bayesian approach), two problems are typically considered: • One problem is to estimate the parameters of the prior distribution (means, variances, etc..) from the constraints yt — Axt induced by successive link values yt. Those parameters are typically mean values A in the Poisson model, or means and variances in the Gaussian model, but it can be extended to more complicated models such as dynamic models in which case parameters can be transition probabilities for example. Different algorithms can be used to estimate the model parameters, each algorithm corresponds to a different criterion and results in a different estimate. Classical criteria are first and second order moments and maximum likelihood. In the case of first and second moments, the parameter are chosen so as to fit the observed means and variances/covariances between link counts. In the case of maximum likelihood estimation, the pa-

193

194

NEXT GENERATION INTERNET

rameters are chosen so that the link values y t , t — 1,..., T are the most likely, that is to say that their probability density function is maximum. • In some cases the parameters (means, variances, etc..) of the model are known quantities (or considered so). Nevertheless, SNMP link measurements y^ provide additional information on the OD matrices xt (linear constraints). These additional constraints modify the estimated values of the OD flows x^. If the constraints yt — Ax$ are taken into account, the distribution of the OD matrix is modified. The resulting distribution is called 'posterior distribution. In the Gaussian case, the posterior distribution is still Gaussian and its mean and variance/covariance are obtained analytically (regression line). In the other cases, the posterior distribution is not a classical one, but it can nevertheless be sampled from if one uses specific algorithms, namely Markov Chain Monte Carlo (MCMC) algorithms. MCMC algorithms produce a set x.t , i — 1, 2,3,... of traffic matrices x*, and this set is approximately distributed as x^ given that yt = Ax$. Various quantities such as mean, variances or quantiles can then be calculated from that set as sample means. Apart from Bayesian methods, non-statistical methods can be used to estimate the OD matrix. Bayesian methods produce good estimates of the OD matrix if the prior distribution is accurate. If not, this can cause serious bias. In that case, non-statistical methods such as gravity models or pseudo-inverse methods can be used to estimate the OD traffic matrix. The statistical methods based on the first and second moments only are also more robust to model misspecifications than the Bayesian ones, which are based on a prior distribution that can be inaccurate. Moreover the numerical complexity of non-statistical methods is usually lower than that of statistical methods. On the other hand, if the prior distribution is accurate, Bayesian methods produce better estimates than non-statistical ones. Therefore, non-statistical methods can be used to produce a first estimate ict of the OD traffic matrix ; then a statistical model can be calibrated on x^,t = 1,... , T and Bayesian methods, with inputs the prior distribution and the SNMP measurements yt can be used to improve the estimates x*.

2,3

Off-line versus on-line estimation

Existing works have looked into the problem of off-line estimation of the OD matrix, that is to say that SNMP measurements yt are stored, in a database for example, or in text files, for t = 1,... ,T and later

8 Statistical Methods for the OD Traffic Matrix Problem

195

on, the OD matrix x$, t = 1 , . . . , T is estimated. Therefore, using this information for traffic engineering assumes some form of stationarity of the OD matrix. Nevertheless, on a backbone network, the OD demands are not stationary. There are for example, some day/night effects, as well as some variations of the traffic demand over the successive days (working/non-working days for example). Nevertheless, it is true that the OD demands values are similar if one considers the same hour of the same day in two different weeks for example, or, to a lower extent, the same hour on two consecutive days, so that some information on the present OD demand can be obtained from past SNMP measurements. Nevertheless, if dynamic traffic engineering is considered the OD demands must be estimated on-line, that is to say that, each time new SNMP measurements yt are obtained, the estimate x$ of the OD matrix must be brought up to date. This means that the variations of the OD matrix on the time scale of SNMP measurements are followed on-line, on the basis of the SNMP information. For doing so, various estimation techniques could be used, such as the Kalman filter, the extended Kalman filter or particle filters, although the existing literature has focused on the off-line estimation of the OD matrix. The estimation technique essentially depends on the supposed dynamic model of the OD flows. In the case of a linear Gaussian model for example (typically a gaussian AutoRegressive Moving Average, ARMA, model), the Kalman filter must be used, since it produces the optimal estimate of the OD matrix given the past measurements (minimum mean squared error estimate). In the Gaussian case, this optimal estimate is the expected value xt = E(xt | yt>yt_i>yt-2> • • •) given all past SNMP measurements. This is not computed from scratch each time new SNMP measurements yt are obtained. On the contrary, x^ is obtained from the new SNMP measurements ytJ the previous estimate xt-\ and joint quantities (typically filter variances,...) that are computed on-line. When the dynamic model is Gaussian but not linear, the extended Kalman filter can be used ; in that case, the model equations are linearized locally around xt a n d the Kalman equations are applied to the linearized model. In the general non-linear non-Gaussian case, particle filters can be used (Doucet et al., 2001). This last algorithm is based on the iterative simulation of N different trajectories (xj , t — 1,2,3,...), i — l , . . . , i V , so called "particles", and jointly the iterative estimation of their likelihood p(x[ , . . . , x^ | y 1 ? . . . , yt) given the SNMP measurements. Once more, the simulation is not started again from scratch each time new SNMP information is obtained but, on the contrary, the N particles xW = (X| , X r , , . . . , x ^ ) , i — 1 , . . . , N are continued with a new sam-

196

NEXT GENERATION INTERNET

pie x^ each time new SNMP measurements yt are obtained. Various quantities such as an estimate of the average OD matrix, or an estimate of its quantiles, at time i, can then be obtained from these particles and their likelihood.

3.

Penalization methods

As we have previously seen, the number of solutions of the system y = Ax is infinite since A is not invertible. It is therefore necessary to select one solution (or a set of solutions) of that linear system as "better" than the other ones. To do so, different criteria can be considered, among which the minimum Euclidean norm (Section 3.1) or the second order moments (Section 3.2).

3.1

Pseudo-inverse solution

A classical method (Golub and Van Loan, 1996) in the case of underdetermined linear systems is to select the solution with minimum Euclidean norm, that is to say, to solve the following problem: Minimize ||x||2 = YA=I{XI)2 with y = Ax

/g#2)

The solution of that convex optimization problem with linear constraints is x = A*y where A* is the pseudo inverse of the routing matrix A. Note that this solution can be applied to obtain each OD matrix X£ = A*y^ for each time period i, but it can also be used to obtain an average OD matrix A = A*y from the time averaged link values y, where y = {X/T)Y^t=\yt *s the average link values vector, and A = E(x^) is the average OD demands vector. Unfortunately, minimizing the Euclidean norm ||x|| is not a good criterion in that case, since this criterion strongly penalizes the mice/elephants configurations, which are the practical case. Indeed it is common on a backbone network that some OD demands (so called elephants) are 1000 times bigger than other OD demands (so called mice). Moreover, there is no positiveness guaranteed in the sense that, some of the OD pairs might have negative values. This possibly explains why this classical pseudo-inverse method has been left apart to estimate the OD demands matrix, and why no reference is done to that method in the OD matrix estimation literature.

8

Statistical Methods for the OD Traffic Matrix Problem

3.2

197

A method based on the first and second order moments

The vector of average OD demands A is not identifiable from the average link values y = AA since A is not invertible (too few linear relations between the average links and OD values). In order to increase the number of relations between links volume and OD demands, it is possible to take into account not only the average links volume, but also their variances and covariances. Let £ y be the variance/covariance matrix of the links volume, £ y = E((y^ — y)(y^ — y)0 and let equivalently S x = E((xj — A)(x£ — A)') be the variance/covariance matrix of the OD values. It can be deduced from yt = Ax* that £ y = AXIxA'. If moreover the OD demands are supposed to be independent and if some relation between the mean and variance of each OD demand is supposed, then the problem is identifiable with respect to A, that is to say that an unique value of A can be obtained from the means, variances and covariances values of the links. Usually a power law relationship between the mean and the variance of the OD values is supposed (Cao et a l , 2000a) in the form of (a*)2 = (/){\i)a, with typically 1 < a < 2 (in the Poisson case, a is equal to 1, but some statistical analysis of measured traffic have proved that in practice 1 < a < 2). With these assumptions, E x = (/) diag((Ai) a ,..., (A c ) a ) and the mean OD values A are the solution of the following (non-linear) system: == A A

^o ox

* , . . . , (A c )°)A'

^6)

A is identifiable from that system. Indeed, if N is the number of network nodes, the number of OD couples is N(N — 1). The number of network links is only of the order of N (say 57V if there are 5 links per node on the average), so that in y = A A the number of unknowns is much higher than the number of linear constraints, and therefore A is not identifiable from the system y = AA. On the contrary, the number of link variances/covariances is N(N + l)/2 and moreover, the network operator knows most of the time which OD demands are always null. Therefore, A is identifiable from the System (8.3), that is to say that a single value of A can be obtained from (8.3), since the number of unknowns and the number of constraints are of the same order iV2 of magnitude in the System (8.3). As a conclusion, introducing a mean/variance relationship for the OD demands therefore makes the average OD demands identifiable from the means and the variances/covariances of the links volume. Nevertheless, this method has a number of limitations. Thefirstlimitation is the mean variance relationship on which the method is based.

198

NEXT GENERATION INTERNET

This is indeed a strong assumption and there is no guarantee that this relationship is always verified. If not, the method can produce very rough results. The second limitation is the assumption of second order stationarity (stationarity of the means and variances/covariances of OD demands). Indeed, this method requires that there is a sense in speaking of means and variances/covariances of OD demands, or, in other words, that the OD demands are second order stationary. Indeed, on the timescales considered (usually several days, with SNMP measurements each 5 or 10 minutes) backbone traffic is not stationary but cyclostationary (Soule et al., 2004), that is to say that means and variance/covariances are periodic with time, with a typical period of 24 hours. And this problem cannot be solved by a local stationarity argument. Indeed, SNMP requests cannot be sent at a faster rate than approximately one every 5 minutes, which makes 12 link measurements y^ per hour. A minimum of 10 to 20 successive samples are necessary to estimate the average link values and their covariances, that is to say approximately one or two hours of measurements. On a timescale of several hours, second order stationarity is a doubtful assumption. A third limitation for using this method is routing instability. Indeed, in that method, the routing matrix A is not time dependent. This is a major restriction since the considered timescales go from a few hours to several days of traffic and on these timescales, routing is likely to change. In order to overcome these limitations, various improvements can be imagined, although none of these improvements have been published in the literature. For example, one solution to solve the second order nonstationarity problem would be to consider the OD values as cyclostationary signals, and use standard estimation techniques of cyclostationary signals to estimate the OD matrix. Another solution would be to consider the series of SNMP measurements at the same time (say, 2 pm) on successive days because this time series is stationary (the same time of the day is always considered), and to produce the average OD demands vector A by solving System (8.3) where y and S y are the means and variances/covariances of the links volume at that time h (say, h = 2 pm). The same process can be repeated for each time h of the day so that a time dependent average OD matrix with periodicity 24 hours can be obtained. Different time granularities can be chosen, the finest granularity corresponding to the rate at which SNMP requests are sent. This method needs the SNMP and routing measurement campaign to be sufficient long (at least a few weeks) for the means and variances/covariances of the links volume to be computed with a sufficient accuracy for each time of the day.

8

Statistical Methods for the OD Traffic Matrix Problem

4, 4.1

199

Gravity models and their generalization Gravity models

There is a long history of gravity models in telephone networks. Indeed, these models have been used for a long time to dimension circuit switched telephone networks. Gravity models can be used to estimate Internet OD volume as well. Gravity models are based on the assumption that the Origin (source) and the Destination (sink) of a traffic flow are independent. As a result, in these models, the volume of traffic flow XOD with origin O and destination D is proportional to xo. the total volume of traffic with source 0, and to X.D the total volume of traffic with destination D: XOD OC Xo% X X*D

(8-4)

In the case of circuit switched telephone networks, the total traffic with origin O (respectively with destination D) is proportional to the population of that zone. In the case of an IP backbone network, xo* and X.JJ can be obtained easily from SNMP measurements. Gravity models are therefore easy methods to estimate Origin Destination volume on the network. Apart from their simplicity, gravity models have a number of advantages to estimate the OD demands matrix. First of all, these models are not based on any assumption on the probability distribution of the traffic (for example, Poisson or Gaussian traffic). From that point of view no misspecification can occur. A second advantage is that the routing information is not necessary to use gravity models. This is a major advantage, since the measurements of routing information can cause a series of difficulties, among which the synchronization between SNMP and routing measurements, heterogeneous databases, routing instabilities. A possible limitation of gravity models is the Origin Destination independence assumption. This assumption is probably inaccurate to a given extent: for example, France and Quebec are French speaking areas, and therefore the volume of traffic between France and Quebec is maybe greater than what gravity models would estimate. This is true, in particular, on telephone networks but probably to a lower extent on IP networks, where http traffic is the majority. Gravity models are nevertheless able to produce easily rough estimates of the OD demand matrix, that can be used as starting points in more sophisticated estimation methods. As one can see from Equation (8.4), the OD matrix estimate x^ produced by gravity models is not a solution of the linear system yt = Ax^. Indeed, gravity models make use of the SNMP volume measured on the edge nodes, but not of the SNMP volume on the core nodes and of the routing information of Origin Destination flows in the core. If the rout-

200

NEXT GENERATION INTERNET

Figure 8.2. Projection on the hyperplane yt = Axt

ing information and if the SNMP information in the core are available, it is possible to improve xt by an orthogonal projection on the set (hyperplane) of all matrices Xt such that yt = Ax$. Indeed, the projected value kt of the gravity estimate x^ is closer to the real value xt than the gravity estimate x^ as one can see from Figure 8.2. The orthogonal projection of the gravity estimate Xt on the set {xt : yt = Axt} results in practical cases in an important reduction of the root mean squared error, as we will see in Section 8 (in that case, 43% of root mean square error reduction is obtained).

4,2

Gravity models generalization

Another method based on the Origin Destination independence assumption has been proposed by Zhang et al. (2003). This method can be seen as a generalization of gravity models. Similarly to the methods of Section 3, it is also a penalization of the linear system y = Ax, but in this case the penalization criterion is the dependence between source and destination of traffic. To be more precise, the distribution of traffic over the different source-destination pairs is searched for, with the links volume y = Ax and the source-destination independence as constraints. Source-destination dependence can be measured as the mutual information 7(S,D) between source S and destination D. Let p{s)d) be the probability that one traffic unit (typically one byte) has source s and destination d. Then the source-destination mutual information is equal to

where p{s) = YldP(s^) (respectively p{d) = Y1SP(S^)) ls ^e probability that one traffic unit has source s (respectively destination d). The source-destination mutual information is also the Kullback-Leibler divergence between the joint source-destination distribution p(s,d) and the product distribution p(s)p(d). For example, when source and destination are completely independent, then 7(5, D) = 0 since p(s,d) — p(s)p(d).

8 Statistical Methods for the OD Traffic Matrix Problem

201

Minimizing the dependence between source and destination therefore comes to minimizing the source-destination mutual information 7(5, D). Let now N be the total volume of traffic on the network (sum of all source-destination volume), then the average traffic volume on link / is N iL/s d A(s,d;l)P{s) d) where A^s^i) is the probability that one traffic unit with source s and destination d is routed through link I. Divergence from the SNMP links volume constraints y = Ax can be measured as the quadratic distance between the measured SNMP links volumes y(l) and their average values NJ2S d A(s,d\i)P{sid), that is to say that 5Z/[y(0 ~ NYl,s d^(s4;l)P(s^d)]2 must be minimized with respect to

p(M)If now both the SNMP links volume y(l) constraints and the sourcedestination independence are balanced, the estimation of the sourcedestination probabilities p(s, d) comes to minimizing a balanced sum of the two criteria: Minimize ZiW) ~ iVE s ,^( s ,d;i)P(M)) 2 ] 2 + \2I(S,D)

(8.6)

where the minimization is performed with respect to p(s,d), and where A2 is a coefficient that can be tuned to give more importance to the first or to the second criterion. Contrary to gravity models which are based on SNMP information on the edge nodes only, this method includes SNMP information on the core nodes as well. Therefore, the results of this method are more accurate than the results of the gravity method and the gravity method can be seen as a particular case of that method. But this method needs routing information that are not always easy to obtain, as mentioned previously.

5. 5,1

Expectation Maximization (EM) algorithm Missing data

The estimation of the OD demands matrix is a typical case of a problem with missing data, that is to say a problem in which some data, so called missing, are not observed and the observations are a function of the missing data (usually a random function). Typical cases of problems with missing data are mixture models (mixture of distributions) and hidden Markov models (discrete or continuous time Markov chains in noise). In our case, the missing data are the OD volume x$, and the observations are the links volume yt = Axj. Indeed, the OD matrix x^ is not observed directly but through the links volume yt. In the problems with missing data, the missing data are considered as random variables,

202

NEXT GENERATION INTERNET

and the observations are usually a random function of the missing data, although deterministic functions like yt = Axt is also possible. Usually the distribution of the missing data is parametric (Poisson with mean A, Gaussian with mean m and covariance matrix £, etc.), but the values of the parameters 9 are unknown. One problem is then to estimate the parameters 9 (mean, variance, covariances, etc.) from the observed data only, as these data are the only data available to statistical analysis. Different estimation criteria can be considered, as second order moments or maximum likelihood, in which case one obtains the value of the parameters for which the observations are the most likely.

5.2

EM tutorial

Maximum likelihood estimation of parameters in the case of missing data is usually obtained by carrying out an Expectation Maximization (EM) algorithm (McLachlan and Krishnan, 1997). Let XI:T = (x^,t = 1,..., T) be the missing data, and let y1:T = (yt)t = 1,..., T) be the observations. The likelihood p{y\:T\0) °f the observations cannot be computed, since this would require considering all possible values x^j 1 of the missing data, and adding up the joint likelihood p(xi : y,yi : T)^) of XI:T and y 1:T with respect to all possible values x i ^ : p(y\:T\Q) ~ X)Xi T p(xi : T,y 1:T ; 9). This cannot be done in practice since the number of values X\:T is exponential with T (indeed, if each x^ takes K values then the number of values for XI:T is KT). Rather than maximizing the observed likelihood p{y\:T^) directly, since this likelihood cannot be practically calculated, the EM algorithm is based on the maximization of the joint likelihood p(xi:T,yi:7S $)• In~ deed the joint likelihood P(XI : T, YI-.TI $) c a n t>e computed easily or, to be more precise, with a linear complexity with respect to T. In the M1-M0 hidden Markov chain case for example, yt depends on x^ only, and x^ depends on x^-i only, so that the joint likelihood can be factorized as P(xi:T,yi:T;0) = P( x i! °\ 11^2P( X * I X t-i! #) EE=i Pfrt I *t\0) and, as one can see, the complexity is therefore linear with T. Therefore, if x i ^ were not missing, it would be possible to compute the joint log-likelihood logp(xi:T, YI-.T'I $) a n d to maximize this loglikelihood with respect to 9 with different optmization techniques (Newton algorithm for example...). But, as previously mentioned, x$ are missing data and therefore, it is not possible to compute logp(xi : ^,y 1:T ; 9) and to maximize it with respect to #, even though the complexity is linear with respect to T. This is why an intermediate function Q is introduced. Since the data x i ^ are missing, their distribution given the observations y1: ^

8 Statistical Methods for the OD Traffic Matrix Problem

203

serves as a substitute. The function Q is put in place of the joint log-likelihood p(xi : T,yi : 7S $)• Q *s the expected value of the joint loglikelihood logp(xi : T,yi:T)^)> where the expected value must be understood in the meaning of the distribution of the missing data XI:T given the observations y ^ : Q(0,6') = E(logp(x 1 : T ,yi : T ; 0) I yi:r; V)

(8-7)

Q is a function of two variables 8 and 6', where 8 is the parameter of the joint log-likelihood logp(xi:T,yi:T>0)> whereas 91 is the parameter of the distribution of x i ^ given y ^ . The EM algorithm then consists of maximizing iteratively Q(0,8') with respect to the first parameter 8. More precisely, the EM algorithm is an iterative algorithm, and if 8k is the parameter estimate produced by the previous iteration, then one additional iteration of the algorithm produces a new estimate where: = Arg m a x Q ( # A ) 9

where Arg max# means that d^+i is the value of 8 for which Q{8, 8k) is maximum. Each iteration of the EM algorithm is usually decomposed into two steps, the E step (E is for Expectation) and the M step (M is for Maximization). The E step consists of computing the distribution of XI : T given y1:T and #&, since that distribution is required to calculate Q(8) #&). This can be performed in different manners, depending of the statistical model considered. For example, in the case of hidden Markov models, the E step consists of a forward backward algorithm (Rabiner, 1989), but as previously mentioned, the exact form of the E step depends of the statistical model considered. Once Q(0,9k) has been computed (E step) it is maximized with respect to 8: 8k+\ — Arg max# Q(0, #&). This second step is the Maximization step (M step). In some cases, the maximization step is analytical, that is to say that an analytical formulation exists for #&+].. In other cases, numerical methods must be used to obtain 0&+1- Once the E step and the M step have been performed, the new value 8k+i takes the place of the previous estimate #&, and the process of iterating between E step and M step starts over again...

5,3

EM convergence and initialization

It can be proven from Equation (8.8) that the observations log-likelihood logp(y1:T;8) is increased by each new iteration of the algorithm, that is to say that logp(y1:j-;0fc+i) > Iogp(y 1: ^; 8^). The parameters

204

NEXT GENERATION INTERNET

estimate 0k therefore converges to a maximum of the observations likelihood (or a saddle point) when the number of iterations of the algorithm increases. In practice, between 5 and 10 iterations of the algorithm are usually needed for convergence. The EM algorithm is therefore an algorithmic manner to find the maximum likelihood estimate of parameters when some data are missing. One restriction of this algorithm is that it can converge to a local but not global likelihood maximum, if this likelihood is not convex with respect to 0, This practical case can occur in particular when real (measured) data are processed rather than synthetic ones. In that case, model misspecifications result in several local maxima of the observations likelihood p(yi : 7S#) a n d the EM algorithm will in practice converge to one of these local maxima if the starting point 0o is far from the global maximum 0*. Therefore, the EM algorithm initialization deserves a special care, especially when measured and not synthetic data are processed. To initialize the EM algorithm, it is usually convenient to produce a first estimate #o by a non Bayesian method (for example, a second order moments method or a non statistical method), since non Bayesian methods are usually less sensitive to models misspecifications. Then the first iteration of the EM algorithm produces 6\ = Arg max<9 Q(0, 0Q) from 0o and the observations y ^ , a second iteration of the algorithm produces 02 from 0\ and y^y, etc... and after 7 or 8 iterations convergence to a maximum #* of the observations likelihood p(yi:T*>$) ls obtained.

5,4

EM algorithm to estimate the OD demand matrix

As previously mentioned, the OD matrix estimation is a classical case of missing data problems, and the EM algorithm is a natural solution for this kind of problems. Indeed, it has been proposed to estimate the OD demand matrix, or , more precisely, its statistical parameters (OD flow means, variances, etc..) for various OD flow models. Cao et al. (2000a) have described the EM estimation in the Gaussian case: in that case, the mean m and variance E of the OD demands matrix is estimated. Vardi (1996) has proposed the EM algorithm to estimate the average OD flow values A in the Poisson case. In the Gaussian case (Cao et al., 2000a), the OD demands are supposed to be independent and Gaussian. Moreover, a mean-variance power law relationship is supposed, in the form of var(xz) = 4>(\i)c where xl is the volume of one OD couple. The distribution of x* is therefore , S) with E = (j) diag(Ac). The value of c is supposed to be known

8 Statistical Methods for the OD Traffic Matrix Problem

205

(a typical value is c = 2) and the values of A and (/> are estimated from the SNMP values y 1:T = (y 1 ? ..., yT) by an EM algorithm. In this case, the E step of the algorithm is analytical. Indeed, the E step consists of calculating the distribution of x^ given yt and 9k = (A^, E ^ ) and this distribution is Gaussian, with mean m\ and covariance matrix mjfc) = E(xt | y t ; 8™) = \W + Y.^A'(AY.^A')-1(yt - A\) W = Var(xt | y t ; # « ) = £(fc) - Y™ A1 (AY,™ A')~l AY™

(R {

, '

where A^) and E^) are the estimates of A and E produced by the previous iteration of the EM. As in this case the successive values x^, t = 1,..., T of the OD matrix are independent, and as y^ is a deterministic function of x^ since y^ = Axt, the function Q(9)9jc) of Equation (8.7) is also equal to Q(9,0k) = Ylt=i ^(l°gp(x£>#) I yu@k)> If n o w the Gaussian log-likelihood is put in place of logp(x^;^) and if the expectation E(# | y^;#^) is taken in the meaning of N(mt \R^) (distribution of x^ given yt and 9^), the function Q(9,9^) can now be written as: T {k)

Q(0,8 ) = ~ p z

1

|S| + tCE" ^*))]

{)

\ z

^

5

t=i

(8.10) The maximization step (M step) of the algorithm then comes to maximizing (8.10) with respect to A and (f) with £ = diag(Ac), so as to produce A^+1^ and £^ + 1 ). This maximization is performed by a numerical method since there is no analytical solution to this maximization problem. The EM algorithm in the case of a Gaussian model with a meanvariance power-law relationship has been proposed and validated by Cao et al. (2000a) on some traffic measured on a router in a Local Area Network of Lucent. As the stationary Gaussian assumption is not true on this dataset, Cao et al, have used a local stationarity argument: the above described EM algorithm is carried out on a sliding window, thus producing a time varying estimate of (A,). The major limitation of this approach is the fact that the EM algorithm must be carried out for each window, and that the computational load is therefore very heavy, since the EM algorithm is already an iterative algorithm and since moreover, in each iteration of this algorithm, the maximization step must be performed numerically. In some previous work, Vardi (1996) had described the EM estimation of the mean A of the OD demands matrix in the Poisson case. We will not describe in all its details the EM algorithm to estimate the OD

206

NEXT GENERATION INTERNET

demands matrix from the routing and SNMP information, in the case of Poisson traffic, since this is very similar to the Gaussian case. It is nevertheless important to remark that in the Poisson case, the E step is not analytical, that is to say that the distribution of x^ given yt and A^ ' is not easily obtained. To overcome that problem, Vardi approximates, in the E step of the EM algorithm, the Poisson distribution with mean A by a Gaussian distribution with mean m = A and with covariance matrix S — diag(A), so as to make the E step analytical. The rationale for doing so is the fact that, when the average OD demands A have big values the Gaussian distribution is a valid approximation for the Poisson distribution.

6-

Markov Chain Monte Carlo (MCMC) algorithms

As mentioned in Section 2, a possible approach to infer the OD matrix x from the links volume y and the routing matrix A is the Bayesian one (Tebaldi and West, 1998). Tebaldi and West (1998) use Markov Chain Monte Carlo (MCMC) algorithms to infer x given y and A. In this section, we describe and discuss MCMC algorithms and, in particular, the Gibbs algorithm and the Hastings Metropolis algorithm, into all their details, and then we discuss their possible use to estimate the OD demands matrix (Tebaldi and West, 1998).

6.1

MCMC tutorial

In general, MCMC algorithms are intensive computer simulation algorithms that can be used to solve a variety of problems, one of these problems being the inference of unobserved data (so called hidden or missing data or latent data) in the case when some observations are at disposal and these observations are deterministic or random functions of the hidden data. Hidden data models are also the framework in which EM algorithm can be used. The difference between the EM algorithm and MCMC algorithms is that the EM algorithm estimates the ODflowsstatistical parameters (mean, variance, etc.), whereas MCMC algorithms produce, by simulation, a set of traffic matrices x that are approximately distributed as x given y and A. Another major difference is that EM estimation of the traffic matrix parameters has been studied in the Gaussian and Poisson cases, and other models would need additional work (the exact form of the E and M steps indeed differs from one model to another), whereas MCMC algorithms can be used for any OD flows distribution, without any additional work, provided that the probability density function (pdf) of the OD flows can be calculated.

8 Statistical Methods for the OD Traffic Matrix Problem The principle of any MCMC algorithm (Gibbs algorithm, Hastings Metropolis algorithm, ...) is to generate, by computer simulation, a discrete time Markov chain that converges in distribution to a given target distribution (specified by the software developer). By doing so, one obtains a series of samples that are approximately distributed as the target distribution, if one lets the MCMC algorithm be carried out a sufficiently long time and if the first samples produced by the algorithm are left apart {warm up period). The target distribution is specified by the software developer and it can be changed at will as this is the case for any parameter of a function (for example in C programming, pointer to functions can be used). In the case of hidden data, this target distribution is the so called posterior distribution, that is to say the distribution of the hidden data x given the observed data y. This posterior distribution can be characterized by its probability density function (pdf) p(x | y), that is equal to the following ratio: / x ,p( In Equation (8.11), p(x) is the pdf of the prior distribution, that is to say the distribution of x, when no additional information is provided by y. This prior distribution is the assumed distribution of the hidden data x when no additional information is provided by some observations y. The prior distribution p(x) expresses the software developer's prior beliefs about the range of possible values for x and the likelihood of these different values. It can be based on some prior knowledge of the studied system (previous measurements, etc..) if such knowledge is available, and if not it can be more or less arbitrary (we will discuss that point later). In Equation (8.11), p(y | x) is the probability density function of y given x, since the observations y are, in general, a random function of the hidden data x. However, in the particular case of traffic matrix inference, the observations y (link counts) are a deterministic function of the missing data x (OD counts) in the form of y = Ax so that in that case p(y | x) is a Dirac. Most of the times, the integral Jx,_p(x/)p(y | x7) in Equation (8.11) cannot be computed, neither analytically nor by numerical methods, since it is a non-classical integral over a multidimensional set (typically MN). Therefore the posterior distribution is specified up to some multiplicative factor, and this factor is an unknown. In practice, one can only write: p(x|y) ocp(x)p(y | x) (8.12)

207

208

NEXT GENERATION INTERNET

where the oc sign means proportionality. MCMC algorithms are able to produce a Markov chain that converge in distribution to the posterior distribution p{x | y) even though the pdf of the posterior distribution is defined up to a multiplicative factor only. This is one of the very positive points of MCMC algorithms. Another positive point is that the prior distribution p{x) can be changed at will, so that different models for the hidden data (say Poisson, Gaussian, Pareto, ... with or without dependence between the different components of x, etc..) can be tested by the software developer. Doing so, it is possible to adapt to different situations, corresponding to a different a priori knowledge of the studied system, or to compare these different situations with each other. As we have seen before, MCMC algorithms produce, by intensive computer simulations, a series of samples x n , n = 1,2,... that are approximately distributed as p(x | y). This means that a histogram of this series would fit well with the pdf of the hidden data given the observations p(x | y). This histogram, or equivalently the series x n , n — 1, 2 , . . . itself, sums up all the information concerning the hidden data x that one is able to get from the observations y, and not only partial information such as the mean or the variance. Nevertheless various information can be obtained from that histogram, for example mean, variance but also covariance or quantile like quantities. For example the probability that one component x% value is bigger than a specific threshold, or the probability that, jointly, two components X{ and Xj are over two specific thresholds, etc... can be estimated from the series x n , n — 1,2,.... When x is the OD matrix, quantiles for example could be very useful for capacity planning or load balancing since from them, one can compute a "worst case" (biggest OD pair value, biggest OD pair couple values, etc..) with a given fiability. Derivate quantities such as mean, variance, quantiles are usually computed from the series x n , n = 1,2,,.. as sample averages. For example the mean is approximated as rh = (1/T) ^ n = : 1 x n , the variance/covariance matrix as S = (1/T) Yln=i(xn ~~ r ^) / ( x?7 ' ~~ r ^ ) where •' denotes transposition, and the quantiles of component X{ as G%{x) = (1/T) Y2n=i ^-xn>x where TLxn>x is 1 if the value of xf is greater than x and 0 elsewhere (counts of the number of samples such that xf > x), etc... MCMC algorithms is a family of algorithms based on the same principle, the simulation of a discrete time Markov chain that converges, in distribution, to a target distribution (for example the posterior distribution p(x | y) in the case of hidden data. Different algorithms belong to that family, among which the Metropolis algorithm and the Gibbs algorithm. Both algorithms are used by Tebaldi and West (1998) to estimate

8 Statistical Methods for the OD Traffic Matrix Problem

209

the OD demands matrix from SNMP measurements. This is the reason why we describe the Gibbs algorithm (Section 6.1.1) and the Metropolis algorithm (Section 6.1.2). The combination of the Gibbs algorithm and the Metropolis algorithm (so called Metropolis within Gibbs algorithm) to estimate the OD demands matrix is then described in Section 6.2. 6.1.1 Gibbs algorithm. In a general setting, the Gibbs algorithm is used to simulate a multidimensional random variable z — (zij..., ZN) under its joint distribution p(z). Usually it is impossible to sample directly from the joint distribution p(z). But on the contrary, in many applications, the conditional distributions Pi(zi | z_i) are easy to simulate, where Pi(zi | z_i) is the conditional distribution of Z{ when the other components (vector z_$) have fixed values. The principle of the Gibbs algorithm is to simulate one component z% at a time. Each component z% is simulated under its posterior distribution pi(zi | Z-i) where z_^ takes current values. The different components of z are swept iteratively as follows:

(8.13) Z

N

PN{ZN

I Z\

, •••,

z

N-l)

Equation (8.13) is a description of iteration n + 1 of the Gibbs algorithm. The upper indices n (respectively n +1) refer to the quantities produced after n (respectively n + 1) iterations of the algorithm. When n is large then z n = (^fj^j • • • j^iv) *s approximately distributed under the joint distribution p(z). The Gibbs algorithm is therefore a very simple yet powerful algorithm to simulate multidimensional random variables. 6.1.2 Metropolis algorithm. The Metropolis algorithm makes the simulation of a random variable when its probability density function is defined up to a multiplicative factor, as it is the case in Equation (8.12) for example. The principle of the Metropolis algorithm is to draw a random variable with a distribution that one can simulate, and to accept that random variable (so called candidate) with a probability that is equal to the Metropolis ratio. This ratio is a function of the likelihoods of the draw and the previous sample under the distribution that one wants to simulate (the target distribution) and under the distribution that one can simulate (the instrumental distribution). The sequence of random variables that is produced by the Metropolis algo-

210

NEXT GENERATION INTERNET

rithm converges to the target distribution when the number of iterations of the algorithm is large. Let p(z) be the pdf of the target distribution (the one from which one would like to obtain samples), and let q(z) be the pdf of the instrumental distribution (the one from which one is able to produce samples easily). The Metropolis algorithm is an iterative algorithm, and if zk is the sample produced by the previous iteration of the Metropolis algorithm, one new iteration can be described as follows: 1 Draw zk+1 with distribution q(z). zk+l is the candidate. 2 Accept zk+l with probability

' p(zk)

q(zk+1)Jj Metropolis ratio

(8.14)

3 If zk+l is accepted then zk+1 = zk+1 ; if not, then zk+1 = zk. zk is the sample produced by the previous iteration of the Metropolis algorithm, and zk+1 is the sample produced by the new iteration. zk+1 is equal to the candidate zk+1 if the candidate has been accepted, and if not, it is equal to the output zk of the previous iteration. When the number k of iterations is large, then zk is distributed as the target distribution p(z). The Metropolis algorithm is therefore an algorithm to produce samples distributed as a target distribution that cannot be sampled from directly, provided that the pdf p(z) of this distribution is specified (up to a multiplicative factor). It is based on a recycling principle, the recycling process being in that case the random acceptance or the rejection of the candidates. The candidates are produced by the sampler with an instrumental distribution from which values are drawn easily. As the pdf p{z) of the target distribution is used in the Metropolis ratio only, and as it appears both in the upper and lower parts of that ratio, it is possible to use the Metropolis algorithm when p(z) is defined up to a multiplicative factor only.

6,2

Metropolis within Gibbs algorithm to estimate the OD traffic matrix

This section is a detailed description of the MCMC approach for computing the OD traffic matrix (Tebaldi and West, 1998). The goal is to simulate x under its posterior distribution p(x | y) with the constraint that y = Ax (Equation 8.1). The inputs are the vector of link counts y, the routing matrix A, as well as a prior distribution on the OD pairs

8 Statistical Methods for the OD Traffic Matrix Problem

211

that are supposed to be independent: c

p(x) = I J

ft^i)

(8.15)

i=l

In this section we follow Tebaldi and West (1998). The routing matrix A is full line rank r. Then, up to some linear combinations of the lines of A and to some permutations of the columns of A one can write:

A = [ Ax | A2 ]

(8.16)

where A\ is an invertible r x r matrix and where A2 is a r x (c — r) matrix. Naturally the same linear combinations should be applied on the components of y and a reordering of the OD pairs should also be performed so that Equation (1) is still true, x is similarly split up into an upper part of size r and a lower part of size c — r: x = (#1/, X2)1 * Then it results from Equations (1) and (2) that: xx = A'1

(y - A2 x2)

(8.17)

where A^* is the inverse matrix of AxTherefore, X2 is a set of free variables and the simulation of p(x | y) reduces to simulating #2 under the posterior distribution p(x2 \ y), and then getting xx from Equation (4). In Tebaldi and West (1998), the authors suggest using a Metropolis within Gibbs algorithm to simulate x 2 given y. The simulation of X2 under the posterior distribution p(x2 | y) can be performed by running a Gibbs algorithm. As the principle of the Gibbs algorithm is to update one component X2 at a time each component x?> is drawn from the distribution p(x.l2 | x^ z , y) of that component given all the other components x^1 = (x<j>> • • • > X2~X> X2+1> • • • > X2~r) a n d y- It therefore necessary to compute the probability density functions p(x\ y , x ^ ) . After a few straightforward computations it is proved that:

where oc means proportionality. Equation 8.18 is taken directly from Tebaldi's paper (Tebaldi and West, 1998), eq.6, see also Vaton and Gravey (2003) for a demonstration. Each iteration of the Gibbs algorithm amounts to sequentially update the different components x\ of vector #2, and then to calculate xx from Equation (8.17). As the posterior distribution p{x\ \ y, x^z) is not a classical distribution (Poisson, Gauss, etc..) the samples x1^ can not be obtained

212

NEXT GENERATION INTERNET

directly. As one can see from Equation (8.18) the probability density function p(x\ | y, x^z) is defined up to a multiplicative factor. Therefore, it is necessary to use a Metropolis algorithm to draw each component 2'

The target distribution for the Metropolis algorithm is the distribution of x\ conditionally to y and to x^z, which probability density function is: p(x{) (8.19) 3= 1

In the equation (8.19), the quantities x\ implicitly depend on x\ since x\ — A^1 (y - A2 X2) where xJ2, j ^ i and y have fixed values and x\ varies. As described in Section 6.1.2, candidate samples are drawn from an instrumental distribution and these samples are randomly accepted or rejected, the acceptation ratio being the Metropolis ratio. The instrumental distribution can be, for example, the prior distribution p(xl2) of x\. Indeed, with this choice of prior distribution, the expression of the Metropolis ratio is simpler than in Section 6,1.2 : (8-20)

where X\ — A^

(y — A2 X2) and where x\ — A^[ (y — A2 X2).

6.2.1 Pros and cons of Metropolis within Gibbs OD demands matrix estimation. As a conclusion, it is possible to use a Metropolis within Gibbs algorithm to produce a set of traffic matrices x^n\n = 1, 2 , . . . given the vector of SNMP measurements y. For that, it is necessary to specify the prior distribution p(x) of the OD traffic matrix cc, that is to say the distribution of the OD traffic matrix when no information is provided by SNMP measurements. This prior distribution can be a classical traffic model (for example Poisson or Gaussian), but it can also be a traffic model of any type, provided that its probability density function is specified. The traffic model flexibility is one positive point of that method, together with the ability to produce a set of OD demands matrices and not only one single estimate, since various quantities such as quantiles can be calculated from this set. On the other hand, the computational load to carry out a Metropolis within Gibbs algorithm is very high, since an iterative algorithm (Metropolis) is carried out a high number of times within each iteration of another iterative algorithm (Gibbs). Computational load is the main con of this

8 Statistical Methods for the OD Traffic Matrix Problem

213

method, and its practical application to backbone networks of 50 nodes or more is probably doubtful.

7.

Our method to estimate the OD traffic matrix

In Sections 3, 4, 5 and 6 various existing OD demands matrix estimation techniques have been presented. In particular, pros and cons of the different estimation techniques have been discussed. In the present section, we present the algorithm that we have developed. This algorithm is an improvement over the Metropolis within Gibbs algorithm presented in Section 6. The improvement relies on the fact that with our method, it is possible to estimate the prior distribution pxt) of the OD flows xt from the SNMP measurements yt) whereas in the classical Metropolis within Gibbs algorithm this prior distribution is arbitrary. In addition, a dynamical model is explicitly taken into account for each OD flow. Taking into account the dependence between the successive values taken by the same OD flow results in an important reduction of the OD matrix estimation error as this will be demonstrated in Section 8.

7*1

Importance of the prior distribution

First of all, we will show that in Bayesian estimation it is fundamental to base the estimation on a sufficiently accurate prior distribution. Let us come back to the toy example network of Figure (8.1) to develop an intuition of this fact. Suppose that y\ = x\ + x% — 3 and V2 — xi + #2 — 5 and let us compare the likelihood of the different solutions (xi,#2,£3) — (0, 5, 3), (1,4,2), (3, 2,0). Two prior distributions are considered, the independent Poisson distribution with means A = (2,2,2), and the independent Poisson distribution with means A = (1,3,2). The likelihoods of the different solutions are summed up in Table 8.1. As one can see from Table 8.1, if for example xi, X2 and X3 are Poisson with means Ai = A2 = A3 = 2 (left part of Table 8.1) then the most likely solution is (xi,o?2>X3) = (2,3,1) while if Ai = 1, A2 = 3 and A3 = 2 (lower part of Table 8.1) the most likely solution is this time (xi,x 2 ,x 3 ) = (1,4,2). Moreover when Ai = A2 = A3 = 2 the most likely solution is (xi, X2,X3) = (2,3,1) with a likelihood value p{x\ A) = 0.0132 and this solution has two direct competitors (xi,X2,X3) = (1,4,2) and (xi,X2,X3) = (3,2,0) with same likelihood values p(x\X) = 0.0066, that is to say two times less likely than (2:1,0:2,£3) = (2,3,1). When Ai = 1, A2 = 3 and A3 = 2 the most likely solution is (xi,X2,X3) — (1,4,2) with likeli-

214

NEXT GENERATION INTERNET

Table 8.1. Likelihood of the 4 solution traffic matrices for the network of Figure 8.1 with link values yx = 3 and y2 = 5. Left table: xi ~ P(2),x2 ~ P(2) and x3 ~ P(2). Right table: xi ~ P(1),X2 ~ P(3) and x3 ~ P(2) Poisson with means A = (2,2,2) x = (xi,x2,x3) likelihood p(x; A) (0,5,3) 0.0009 (1,4,2) 0.0066 (2,3,1) 0.0132 (3,2,0) 0.0066

Poisson with means A = (1,3, 2) likelihood p(x; A) (0,5,3) 0.0067 (1,4,2) 0.0167 (2,3,1) 0.0112 (3,2,0) 0.0019

x = (#1,2:2, #3)

hood value p(x; A) = 0.0167 and this solution has one direct competitor (^1,^2,^3) = (2,3,1) with likelihood value p(as;A) = 0.0112 that is to say 33% less likely than (#i,£2>#3) — (1>4, 2). As one can see from this toy example, the prior distribution has a strong influence on the likelihood of the different solutions. Two different prior distributions therefore induce a different selection of one solution x of y = Ax as the most likely one. The prior distribution should therefore not be an arbitrary choice but, on the contrary, it should be an accurate picture of the real distribution. By real distribution, we mean the distribution that one would observe if it was possible to measure the OD demands directly. But obviously, as direct measurement of the OD values is not possible, designing an accurate prior distribution is a challenging task. This is a fundamental problem in OD matrix estimation by Bayesian methods that has not been sufficiently studied in the literature. Solving that problem is absolutely necessary to apply these methods to real traffic. In this section, we develop a new estimation method that one can see as an improvement over the MCMC method of Tebaldi and West (1998). Our method is original in that: 1 the prior distribution p(x) is estimated from the link measurements yu t — 1,2, 3 , . . . , T rather than being an arbitrary choice, and 2 a dynamic model (hidden Markov model) can be introduced explicitly to take into account dependence between successive OD matrix values x%, rather than considering time independence.

7.2

A divide and conquer method

The principle of our method is to deal with the influence of both dimensions (time £, and OD pair number xi) by a divide and conquer method. We divide the whole problem into two subproblems which are each treated independently (one for time dependency and one for OD

8 Statistical Methods for the OD Traffic Matrix Problem pair dependency). A global convergence is nevertheless obtained by exchanging some information between the two subproblems. By doing so, we manage to keep the numerical complexity of solving the global complex problem to a reasonable degree. Our algorithm consists of a loop (feedback) as it is displayed on Figure 8.2. This loop can be divided into two boxes. The first box [Metropolis within Gibbs) deals with the OD pair dependency. The second box (EM algorithm) deals with the time dependency. The process of exchanging information between the two boxes is iterated until the convergence to a fixed point is obtained. We will discuss about some issues of the loop process in the Section 7.4. The first box runs MCMC methods proposed by Tebaldi and West (1998) to simulate the traffic matrix. Its inputs are the link counts and a prior distribution for each time measurement period. It produces as an output the estimated traffic matrix for each time measurement period. We will not go further in the description of this box since it relies on the Hastings Metropolis within Gibbs algorithm which has already been presented in Section 6.1.2. The second box consists of an Expectation Maximization algorithm with a BIC criterion. It fits a mixture of Gaussian distributions with an unknown number of components on the successive values taken by each OD pair. Its inputs are an estimate for the successive values of each OD pair. It produces the parameters (weights, means and standard deviations) of the mixture that best fit that OD pair for each time period. The parameters of the priors are then provided as an input to the first box (Metropolis within Gibbs) in a feedback loop (in the form of a new prior distribution for each OD pair and each time measurement period). Note that the weights of the Gaussians of the mixture change over time. This means that the supposed distribution of each OD pair varies throughout the successive measurement periods. By letting the weights vary over time non stationarities in the traffic (Cao et al., 2000a,b) can be taken into account. We will explain some details about the second box in Section 7.3. The algorithm has already been validated on simulated traffic in some previous papers (Vaton and Gravey, 2002, 2003). In this paper the algorithm is validated on real traffic data on a single router network for which direct measurements of the OD counts were made available through specific software (Cao et al., 2000a,b). This is the purpose of Section 8.

215

216

7,3

NEXT GENERATION INTERNET

Estimating the prior distribution

In this section, we detail the second box which deals with the subproblem of time dependency. Our goal is to propose a suitable model for the prior distribution of each OD pair at each time measurement. As one can see from various applications, many distributions on real data can be modelled successfully by a mixture model (Titterington et al., 1985). Moreover, the mixture model makes it possible to adapt to different distributions at each time measurement period by playing on the weights of the components in the mixture, since these weights can change at each time period. Mixtures therefore provide a flexible, general and yet very simple framework for taking into account the traffic variability. Let consider the successive values taken by only one OD pair in time. The successive values x\, x<2,..., XT of that OD pair are distributed as a mixture of K Gaussians: K

Xk)

(8.21)

In Equation (7) G^a is the probability density function of the Gaussian distribution with mean /j, and standard deviation a. Wj(k) is the probability that Xk is a sample from the component j of the mixture. The problem now is to estimate the number K of components, as well as their means /ij, standard deviations Gj and the weights Wj(k) from the successive values taken by that OD pair. Of course the real values of the OD pair are unknown. Only estimates of these values are available. Therefore, in our global iterative algorithm we do not use the real values of that OD pair to compute the parameters of the mixture, but their estimates. At the first iteration of the algorithm, the estimates are provided by, for example, a generalized gravity model (Duffiels et al., 2003), then for the next iterations the estimates are these provided by the MCMC method described in Section 6.1.2. Let us consider that K has a fixed value and we want to determine the parameters of the model which best fit the distribution of one OD pair. To be more precise, we adjust the parameters (weights, means and standard deviations) in order to maximize the likelihood of the estimated OD pair x = (#1,2:2, • • • >XT)« An Expectation Maximization (EM) algorithm (Dempster et al., 1977) is used for this. We have described the principles of the EM algorithm in Section 5. As a consequence, we will not insist on this point. The main difference here is that we use the estimates of the OD pair traffic as the observations and the hidden parameters are the weights of the Gaussian components at each time.

8 Statistical Methods for the OD Traffic Matrix Problem

217

Various approaches can be used (reversible jumps MCMC, BIC, etc..) to evaluate K. In order to keep a low computational cost we used the BIC criterion. The principle is to try different values for K (from K — 1 to K — i^max) and to select the one which maximizes the BIC criterion: BIC = 2 * L(K) - v(K) log(T)

(8.22)

In Equation (11) L(K) = logp(x; 6 K) is the log-likelihood of the mixture with K components which best fits the OD pair (6K is given by the EM algorithm), and v(K) is the number of free parameters of the model. In the case of a mixture of Gaussians v(K) = 3 K — 1. T is the number of available measurements. The choice of Kmax is quite partial but some experiments in that field have proved that T 0 3 seems to be sufficient where T is the number of time periods. For example, the program MIXMOD uses this value and is considered as a reference in the field. In the case of the mixture of Gaussians, the EM algorithm is very simple and extremely fast. The computational load that is added to Tebaldi's method by the EM algorithm is about 10 percent of the total load, if one considers the overall cost of estimating K, the \i^ &j and

7.4

The global loop

In this section, we detail the mechanism of feedback between the two boxes of our algorithm. In the first box only one time period is considered and we simulate the various OD pairs for that time period by MCMC methods. In the second box only one OD pair is considered and a mixture model is fitted to the successive values taken by that OD pair. In this section the full problem is dealt with in the sense that we estimate the successive values of the various OD pairs jointly, but with a reasonable numerical complexity (1 hour of computation on a Pentium IV 2GHz for analyzing 5 days of traffic when the time period is 5 minutes). 7.4.1 Initialization of the global loop. We want our method to be as deterministic as possible, in the sense that we do not wish to fix anything arbitrarily. In particular, the prior distributions of the OD pairs during the first iteration of our global loop must not be arbitrary. In order to launch our iterative process (global loop), we would need either a first estimate of the traffic matrix or a prior distribution for the OD pairs. Unfortunately these quantities are not available directly. The solution that we have found is to produce an initial estimate of the successive values of the traffic matrix by a very simple and non-

218

NEXT GENERATION INTERNET

informative method. This initial estimate will be used to calibrate the first mixture model and then the iterative process is launched. This first estimate of the traffic matrix should not rely on strong statistical assumptions on the OD pairs since arbitrary assumptions will possibly do more harm than good. For these reasons, we have considered two very simple methods: the gravity model (Roughan et al., 2002) and a method based on the second moments of the traffic flows (Van Zwet). These methods have been explained respectively in the Section 4 and 3.2. 7.4.2 Smoothing the exchange of information between boxes. In the first iterations of the global loop the estimates of the traffic matrix produced by the MCMC samplers as well as the distributions produced by the EM algorithm are not very reliable. These quantities are nevertheless exchanged between the two boxes. If the exchanges of information were too strong during the first iterations, then the algorithm could converge to a fixed point that would not reflect the actual distribution of the OD counts, as this information is possibly unreliable. It is therefore convenient to "smooth" the information exchanged between boxes during the first iterations properly. This is done as follows: (i) The prior distributions p(xti) (OD pair i, time t) that are fed back into the first box are not reliable in the first iterations. To smooth these priors we replace p(xf) by [p(^)] a where a < 1.0. In practice, one only has to use Equation (15) instead of Equation (6):

p(4 | y,xj<) - [p(4)P n[=ib(4)] a

(8-23)

where the scaling factor is a < 1.0. (ii) Similarly, the estimated OD pairs x\ produced by the Metropolis within Gibbs algorithm are not reliable in the first iterations. These estimates are nevertheless fed as inputs to the second box. It is therefore convenient to smooth this exchange as well. As the quantity x\ is used in the EM algorithm only through its likelihoods G^^{x\) under the various Gaussian component distributions, smoothing is obtained by raising these likelihoods to the power a in the Expectation step of the EM algorithm (Equation (8)):

G^ixl) - [Gw&dT

(8-24)

where a < 1.0.

8.

Simulation results

The algorithm that we propose has already been validated on simulated data (Vaton and Gravey, 2002, 2003). In the present paper, the

8

Statistical Methods for the OD Traffic Matrix Problem

219

Gravity Mode]

Link Counts

Traffic

.iteration 1

c> Metropolis within Gibbs iteration > 1

Matrix iteration > 1

(Estimate)

Feedback :

(each OD pair i and each time t) rriij and o\4 (means and variances of the component distributions Weights wjj of the component distributions Figure 8.3.

The global loop

algorithm is validated on real traffic data. The traffic measurements were performed on a single router network for which direct measurements of the OD counts were made available through special software. They were used by Cao et al. as a real dataset to test the validity of their algorithm (Cao et al., 2000a,b) and they were made available to us by the authors. On this simple network, there are 30 Origin Destination pairs and 10 independent "links". To check our simulations only the Origin Destination pairs that represent more than one percent of the total traffic were considered. The other OD pairs play the role of some adjustment variables. We found out by simulation that the exchange of information between the two boxes of Figure 8.3 improves the estimate of the traffic matrix along the successive iterations, as it was expected. This can be seen from Figure 8.4 which represents the correlation between the true OD pair and its estimated counterpart along the successive iterations of the global loop. Then we compared the performance of our algorithm with that of some other algorithms. Various algorithms were tested: EM algorithm with a Poisson OD flows model (Vardi, 1996), a method based on the 1st and 2nd moments (Van Zwet) and the gravity model. Moreover, for the last two algorithms, we have also tested the performance when the estimator is projected over the space {x^;yt = Ax^}. We compared the estimated OD volumes x\ with the OD volumes x\ measured directly on the router by a special software. The Root Mean Square Error RMSE(i) = [(1/T) £ f = i ( ^ - x\)2}1/2 was computed for

220

NEXT GENERATION INTERNET

0.99-

Figure 8.4- x-axis: iteration number, y-axis: correlation between true and estimated OD pairs

each OD pair i and each algorithm. The results are given in Table 8.2. In this table the OD flows are sorted by order of decreasing average volume, so that the first lines correspond to the most important flows. The RMSE values are expressed in terms of a proportion of the average volume for each OD couple. Thus, one can read in line 1 that the RMSE is 15.2% for the OD pair number 1. For the same OD pair, the RMSE is respectively equal to 15.4%, 21%, 57.3%, 33.5%, 246.8% for respectively the method based on the 1st and 2nd order moments (with a projection), the gravity method (with a projection), the gravity method (without a projection), the method based on the 1st and 2nd order moments (without a projection) and the EM algorithm with Poisson assumption. The results for the other OD pairs can be read in Table 8.2. An average performance on the whole set of OD pairs is given in the last line of Table 8.2. The average performance was calculated by giving to the performance oil OD pair i a weight proportional to the average volume of this pair. As it can be read in the last line of Table 8.2, the average RMSE of our method is 24.7%, whereas it is equal to 32.5% for its main "competitor" (method based on the 1st and 2nd order moments with a projection). Thus, for

8

2nd moment +proj.

0,210 0,252 0,525

0,154 0,235 0,389 ©

i-H

0,152 0,174 0,398 0,222 0,205 0,530 1,484 TT

' *

o If,) Tf

©

ao »,„„,,,

"-*

.?•>*"> •,V«M

Tf

f,,;,

ON

v©

so

384

j

000 x -( vo

ON

o

6©

O"v

919 Tf 00

439 489 NO t • CN f^

O\

»;••

s

486 0,247

1 |

221

©

325

,„«(

•-*

©

©

if]

CO

O\

Tf 00

^o

rl ©

160 000 000

o

- : . : : •

i

502 37^

00 ,-::>

'r-

3

*:•

ON

Oji

t-

ON

©

o

©

©

ON

©

,t":" ©

a\

ON ON

241 856 352 641 823

H

<S

ON

v©

ON

ON

O

922 308 080

!

;j;

ON

3\

* . • • •

r •

V)

••

i ••'

o

s

•

o

,663

Vaton etal

Ov

O\ ON

V;

00

00 ON

00

o\

<

00

526

436 384 885 063 yj

S

o

P rt

f'"«ii *i?

©

©

©

ON

O\

«, •

ON

•«O

622 292

812 «90^

©

©

i-H i-H

©

'•"'»''•'

op TT

ff;..

193

249

314 635 798 028 934

©

©

©

«*>

IT-

00

o

©•

sI

130

1=1

Gravity +proj,

©

©

i-H

so

SO

©

CT -;

,835

19

IP

©

©

0,573 ,625 ,700 ,477 ,883

©

*"•

704 609 445 301 870

Gravity

859

009 396 014

©

SO

ON

? .,,t

341

0,335

°

006 777 045

r*

<s

102 269 742 590 324 418

i-H

2,468

482

1

iPi\

2nd moment

Vardi +init

2,663

ON

I-H

m

iS

00

Si;--

00

418

31,3

1-H

so

1017 829 706 236 282 345 295207 667 527 161 470 50 396

Vardi

so

Tf

66 895 52 724 34 440 25 333 24 706 5 802 2 022

115

pa

81

kbits/sec

peak, kbits/sec

average,

Statistical Methods for the OD Traffic Matrix Problem

8)

o

o3

o3

Q O

00

Throughput

222

NEXT GENERATION INTERNET

this dataset, our method improves the RMSE by 8% with respect to the main competitor. The other methods have an average RMSE of respectively 37.9%, 66.3%, 68.4%, 189%, for respectively the gravity method with a projection, the gravity method without a projection, the method based on the first and second moments without a projection, the EM algorithm with Poisson model. As one can see, it is remarkable that the projection improves considerably the estimators as noticed in Roughan et al. (2002). On Figure 8.5 the variations of one OD pair along the time are displayed. The variations are displayed over 5 consecutive days with a time granularity of 5 minutes, that is to say that each dot represents the total volume of traffic for this OD pair during a period of 5 minutes. This makes an overall time series of 1440 successive values for that OD pair. The "true" values of the OD volume (obtained by measuring directly the OD volumes with a special software) are represented as squares, whereas the values that we obtain with our algorithm are represented as stars. As one can see from this figure the deviation of the estimated values

12h

24h

96h

10$h

120h

Time

Figure 8.5. OD pair number 1. 5 days of traffic with a granularity of 5 minutes. Comparison of the measured values of the OD volume with the values produced by our estimation algorithm.

223

Statistical Methods for the OD Traffic Matrix Problem

from the true values is quite small, which is conforming with the low RMSE (15.2% for this OD pair). A reader experienced in time series analysis could see from this figure that a hidden Markov model or a mixture model seems quite relevant for that find of traffic. It is therefore not surprising that our estimate is better than others for this dataset since our algorithm takes a mixture model explicitly into account for each Origin Destination couple. The same results have been displayed on Figure 8.6 for another OD pair, namely the OD pair number 5. Once again, for this OD pair, the estimation is pretty accurate. Similar plots could be provided for the other OD pairs of that network.

9.

Conclusion

Markov Chain Monte Carlo methods are flexible and reliable techniques that can be used to estimate the traffic matrix on a network. One major drawback of these methods is that they require reliable knowledge 700 Dayi Day! 2

Day 3

Day! 4

Day 5

500

f

'400

ft 300

MM

120h Time

Figure 8.6. OD pair number 5. 5 days of traffic with a granularity of 5 minutes. Comparison of the measured values of the OD volume with the values produced by our estimation algorithm

224

NEXT GENERATION INTERNET

of a priori distribution for the OD pairs and that, most of the time, reliable prior distributions are not available. In this paper, we have proposed a method for training these prior distributions from the only available data, that is to say the link counts themselves. Furthermore, we improve our priors in a loop by taking into account both time and OD pairs dependencies. All the steps of our method are explained in detail and every arbitrary choice has been underlined. We have exposed our scientific path through the methods of estimation of the traffic matrix. All the initial values and adjustments of the algorithms used have been shown. As a result, one should easily recover our results with his own program if one follows our instructions. We have used our method on real traffic data on a simple network but with a great burstiness. The results are exposed in this paper and compared to other methods of traffic matrix estimation. This confirms the positive results that we had obtained on simulated data in previous papers (Vaton and Gravey, 2002, 2003). However, the field of investigation remains large. Indeed, we should go further in the exploration of the bounds of precision one can reach in the problem of traffic matrix estimation. This could lead us to identify the information which could help us a lot to know more about the interesting traffics. As a consequence, operators could fully take benefit of new technologies like BGP accounting which allow us to have more measures on smaller aggregates than links between routers. Another interesting subject is the routing matrix. Our method, like lots of others, suppose that the routing matrix is constant during all the experiment. But in real backbone networks, stability is not always possible. That is why we should try to adapt our thoughts to a more general framework, that is to say with a changing routing matrix. In addition, we should explore new methods for moving faster and better than with Metropolis within Gibbs into the space of traffic matrices which satisfy the routing constraints given the link counts. Indeed, a really important improvement of our algorithm would be a decrease in its complexity since it is for now unfeasible for big backbone networks with lots of nodes.

References Cao, J., Davis, D., Vander Wiel, S., and Yu, B. (2000a). Time-Varying Network Tomography: Router Link Data. Journal of the American Statistical Association, 95(452). Cao, J., Vander Wied, S., Yu, B., and Zhu, Z. (2000b). J. Cao, S. Vander Wiel, B. Yu, and Z. Zhu, A Scalable Method for Estimating Network Traffic Matrices from Link

8

Statistical Methods for the OD Traffic Matrix Problem

225

Counts. Technical report, Bell Labs. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data. Journal of the Royal Statistical Society B, 39:1-38. Doucet, A., De Freitas, N., and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Springer. Duffiels, N., Zhang, Y., Roughan, M., and Greenberg, A. (2003). Fast accurate computation of large-scale IP traffic matrices from link loads. In: ACM SIGMETRICS, San Diego. Golub, G.H. and Van Loan, C.F. (1996). Matrix Computations, Third Edition. Johns Hopkins Series in Mathematical Sciences. McLachlan, G.J. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley Series in Probability and Statistics. Medina, A., Taft, N., Battacharya, S., Diot, C, and Salamatian, K. (2002). Traffic matrix estimation: Existing techniques compared and new directions. In: SIGCOMM, Pittsburgh. Paxson, V. and Floyd, S. (1995). Wide-Area Traffic: the Failure of Poisson Modelling. IEEE/ACM Transactions on Networking. Rabiner, L.R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of IEEE Conference. Roughan, M., Greenberg, A., Kalmanek, C., Rumsewicz, M., Yates, J., and Zhang, Y. (2002). Experience in measuring backbone traffic variability: Models, metrics, measurements and meaning. Internet Measurement Workshop, Marseille. Soule, A., Nucci, A., Cruz, R., Leonardi, E., and Taft, N. (2004). How to identify and estimate the largest traffic matrix elements in a dynamic environment? In: ACM Sigmetrics. Tebaldi, C., and West, M. (1998). Bayesian Inference on Network Traffic Using Link Count Data. Journal of the American Statistical Association, 93(442). Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons. Van Zwet, E. Method of Moments Estimation for Origin-Destination Traffic on a Network. Department of Statistics, University of California in Berkeley. Vardi, Y. (1996). Network Tomography: Estimating Source-Destination Traffic Intensities From Link Data. Journal of the American Statistical Association, 91(433). Vaton, S. and Gravey, A. (2002). Iterative Bayesian analysis of network traffic matrices in the case of bursty flows. Internet Measurement Workshop, Marseille. Vaton, S. and Gravey, A. (2003). Network tomography: An iterative Bayesian analysis. In: ITC 18th, Berlin. Zhang, Y., Roughan, M., Lund, C , and Donoho, D. (2003). An Information Theoretic Approach to Traffic Matrix Estimation. In: ACM SIGCOMM.

Chapter 9 ENERGY AND COST OPTIMIZATIONS IN WIRELESS SENSOR NETWORKS: A SURVEY Vivek Mhatre Catherine Rosenberg

Abstract

1,

We present a survey of some of the recent work on energy and cost optimizations in wireless sensor networks. Sensor nodes are characterized by severe energy budget due to limited battery life. We focus on two main problem areas, namely routing and design. In sensor networks in which the nodes use multi-hop communication, routing is a major issue. The routing problem in the context of sensor networks retains some of the features of the routing problem in ad-hoc networks, but also has some specific characteristics to it, in particular with respect to data-aggregation, addressing, and the many-to-one paradigm (each sensor node wanting to send the collected data to a single basestation). We first discuss the work done on energy efficient routing, and the corresponding optimization problems for maximizing the lifetime of the network. We then discuss some of the optimization problems in the design and dimensioning of sensor networks. Since most potential applications envisioned for sensor networks require high node density, node heterogeneity and hierarchical clustering could be used for better scalability of the protocols. We discuss the results obtained on energy and cost minimization problems in the context of such clustered sensor networks.

Introduction

In the past few years, thefieldof wireless sensor networks has become a key area of research. Sensor networks find applications in several military as well as civilian domains. Sensor networks along with the widespread Internet enable a user to remotely monitor a phenomenon of interest (see Figure 9.1). See Akyildiz et al. (2002) for a detailed description of potential sensor network applications. Due to the ad-hoc nature of

228

NEXT GENERATION INTERNET

User Sensor Field Figure 9.1. A typical sensor network topology

sensor networks and severe battery energy limitations, energy efficient protocols are required at all the layers of the protocol stack. A nice overview of the recent results on sensor network specific optimizations at the different layers of the protocol stack can be found in Akyildiz et al. (2002). However, in this paper we survey in greater details two important problems in sensor networks, namely energy efficient routing and cost efficient network design. The field of wireless sensor network is receiving a lot of attention, and is evolving very fast. It is difficult to provide a comprehensive survey about a field which is not fully mature yet. Hence this paper should be seen more as a snapshot of the state of the art for the above two issues. Since a sensor network is deployed with an objective of gathering information, for a given initial battery energy, it is desired that the network continue to function and provide data updates for as long as possible. This is referred to as the maximum lifetime problem in sensor networks. During each data gathering phase, nodes spend a part of their battery energy on transmitting, receiving and relaying packets. Hence the routing algorithm should be designed to maximize the time until the first battery expires, or a fraction of the nodes have their batteries expired. In certain low bandwidth sensor networks, besides the battery energy, the channel bandwidth presents itself as another constraint, and the routing problem has to take this into account. While it is easy to show that such an energy efficient routing problem reduces to a linear programming problem, the real challenge lies in devising lightweight and efficient distributed algorithms for solving it. The problem of cost efficient network design is mainly a problem in the context of clustered networks. In such networks, nodes are organized

9 Energy and Cost Optimizations in Wireless Sensor Networks

229

into clusters with a single cluster head node per cluster. The sensor nodes send their measured data to their closest cluster head node. The cluster head nodes aggregate the received data, and then send it to the base station. The cluster head nodes could either be identical to the sensor nodes (a homogeneous network), or they could be equipped with better hardware and more battery energy than the sensor nodes (a heterogeneous network). In either case, a cost function can be associated with the hardware and the battery cost of each node. From a network designer's perspective, the issue is designing the network in such a way that the overall cost of the network is minimized while guaranteeing the desired network lifetime. This chapter is organized as follows. We first provide a brief overview of wireless sensor networks and some of their salient features in Section 2. In Section 3, we present a survey of some of the important papers on routing optimizations in sensor networks. Section 4 contains a survey of the work on design optimizations in sensor networks. Finally, we conclude in Section 5.

2.

Wireless sensor networks: A brief overview

The purpose of deploying a sensor network is to monitor an area for an event of interest. The advent of affordable wireless technology has led to the vision of empowering small monitoring devices with a wireless network interface that can be used to communicate with other nodes. We discuss some of the salient features of wireless sensor networks in this section. One of the most important salient features of a sensor network is that the application for which the network is to be used, has a big impact on the design and dimensioning of the network. This is unlike the current Internet where the application has to be designed to work well over the given network. The Internet delivers packets using a best effort service, and so the applications cannot be given any bandwidth or delay guarantees (unless some sophisticated tools such as MPLS are used). Thus the designers of new applications have to work within the framework of the current Internet. However for sensor network applications, it is possible to design and dimension a network in such a way that it caters to the specific requirements of the application. However, the range of sensor network applications is vast. At one end of the spectrum there are applications that require periodic data updates from the network, e.g., temperature monitoring and control in buildings. At the other end of the spectrum are applications in which the network is idle for long periods of time, but bursts into activity as soon as the event of interest

230

NEXT GENERATION INTERNET

occurs, e.g., forest fire detection. In the former case, the traffic is more or less uniform, and there is scope for in-network aggregation of data, while in the latter case, the traffic is bursty, delay-sensitive, and there is no scope for in-network data aggregation. From a designer's viewpoint, the issues involved in designing and dimensioning these two types of networks are altogether different. Hence we classify sensor networks into the following two main categories; data gathering sensor networks and event detection sensor networks. Others have classified the sensor network applications in a more granular way (see Tilak et al., 2002). However for the purpose of this survey, this classification suffices. In data gathering sensor networks, the nodes send their measurements periodically to the base station, while in event detection sensor networks, the nodes remain idle most of the time, and spring to activity only when the phenomenon of interest occurs. Most of the work that we describe in this survey paper is about data gathering sensor networks, because routing is an important problem in such networks. On the other hand, in event detection sensor networks, MAC and sleep-wake synchronization are the key issues. The base station could either be located remotely outside the region of interest, or it could be located within the region of interest. In the former case, either all or a few nodes have to perform long range transmission of data to the remote base station. In the latter case, nodes could either use multi-hop communication or direct transmission to reach the base station. The location of the base station is application dependent. For example, in the context of remote surveillance of battlefield, the base station is located far away from the region of interest, while in the context of temperature monitoring and control in buildings, the base station is located in the region of interest. Sensor networks are also characterized by a many-to-one communication paradigm, i.e., most of the nodes in the network send their data to a few sink nodes. This is unlike the ad-hoc network communication paradigm where each node may wish to communicate with any other node in the network. The sensor networks that are to be deployed for environmental monitoring and surveillance are expected to be deployed over rough terrain, and are likely to have a high failure rate due to cheap hardware. Node failure, high node density and ad-hoc deployment are some other salient features of most sensor networks. When nodes are deployed randomly, there is likely to be correlation between the measurements of nearby nodes, and thus there is a scope for data aggregation in the network. Data aggregation helps eliminate redundancy, and reduces the amount of data that needs to be sent to the sink or the base station.

9 Energy and Cost Optimizations in Wireless Sensor Networks

231

All these features need to be taken into account when studying routing and network design in wireless sensor networks.

3.

Energy optimizations in routing and related problems

In this section we look at some of the important results on the problem of energy efficient routing in wireless sensor networks. Since sensor nodes are highly energy constrained, it is essential to choose the most energy efficient routes for transferring data from the source nodes to the sink nodes. With reference to the application classes discussed in Section 2, this problem is more relevant to the data gathering sensor networks than the event detection sensor networks. A seminal work in this context was presented in Chang and Tassiulas (1999, 2000a). This work was later extended in Zussman and Segall (2003). The optimization techniques used in these works use some of the well known results on network flows from Ahuja et al. (1993). While the work in Bhardwaj and Chandrakasan (2002) is not strictly related to energy efficient routing, we discuss it in this section because it uses the same network flow optimization tools as in the above mentioned works for determining the upper bound on the lifetime of a sensor network. In the following subsections we present an overview of these papers and some other related works on energy related optimizations in routing.

3,1

Routing for maximum system lifetime by Chang and Tassiulas

In Chang and Tassiulas (1999), the authors consider the problem of choosing routes between a set of source nodes and a set of sink nodes of an ad-hoc network so that the time until the first battery expires, is maximized. The authors note that choosing a route that results in minimum total energy expenditure as in Baker and Ephremides (1981); Ephremides et al. (1987); Ettus (1998); Gallager et al. (1979); Meng and Rodoplu (1998); Rodoplu and Meng (1998); Shepard (1995); Singh et al. (1998) is not always desirable because some of the nodes may have excessive relaying burden, and hence these nodes may expire too soon. This in turn could lead to loss of connectivity. To overcome this problem, the authors suggest that the routes should be chosen with the ultimate objective of maximizing the time until the first battery expires. For achieving this objective, the minimum energy paths are not necessarily the best choices. Let E% (in Joules) be the initial battery energy of node i, and let the node generate information at a rate of Qi bits per second. Let Si denote

232

NEXT GENERATION INTERNET

the set of nodes that can be reached by node i, and if j € Si, then let eij denote the energy required to transmit a packet from node i to node j . Let qij be the rate at which information flows from node i to node j along link (i,j). Thus the original network topology (set N) can be thought of as a flow network with a set of source nodes (set S) and a set of sink nodes (set D) connected by a set of intermediate or relay nodes. Such flow networks have been focus of many studies when the objective is to maximize the overall flow from the sources to the sink nodes (see Cormen et al., 2001; Ahuja et al., 1993 for more details). The flow conservation requirement at each node i that needs to be satisfied is as follows. z E JV - {D}

X

ieiV

-toD}

(9.1)

(9-2)

where the first constraint says that for a node that is not a destination node, the sum of the rate at which information is received by a node and the rate at which information is generated by the node should be equal to the rate at which information is transmitted by the node. The second constraint is a special case of the first constraint when applied to nodes that are pure relays. For such nodes, the information generation rate Qi is zero. The objective function to be maximized subject to the above constraint is the system lifetime. Or equivalently, we must determine the network flow components on all the links, i.e., q = {q%j} which maximize the following, subject to (9.1) and (9.2). T = Tsys(q) = min — - ^

(9.3)

The above problem can be re-formulated as the following optimization problem. Maximize T subject to:

qij > 0,

E ^ „ . . --v. E j-ieSj

<Eh

Vj € Si, Vi € Af — D, Vi€N-D,

/ .„»*, WeN-D,

(9.4)

keSi

where qij — Tqij is the amount of information transferred from node i to node j along link (i, j) during time T. The above problem needs

9 Energy and Cost Optimizations in Wireless Sensor Networks

233

to be solved to determine {cjij}, i.e., the flow components along each link, to maximize the system lifetime (T). While this is a simple linear programming problem in qij, the real challenge lies in designing a distributed algorithm to solve a lifetime maximization routing problem such as the one above. This is because using a centralized protocol for making the routing decisions is not a scalable approach, especially in large sensor networks. In addition, the control packet overheads of such an algorithm should also be low in order to make judicious use of the scarce battery resources of the nodes. An identical energy efficient lifetime maximization problem has been studied in Kalpakis et al. (2002). However the authors of Kalpakis et al. (2002) do not develop any distributed algorithms to solve the lifetime maximization routing problem. In Chang and Tassiulas (1999), the authors propose two heuristic distributed algorithms to solve this problem. The first of the two algorithms is called the flow redirection algorithm. It makes use of the fact that a necessary condition for lifetime maximization is that if the minimum lifetime over all the nodes is maximized then the minimum lifetime of a node along each path from a source to the destination has the same value as the other paths (see Theorem 1 in Chang and Tassiulas, 1999). Each path originating from a node is associated with the smallest lifetime of the node along that path. This lifetime is computed based on {qij} for the current iteration. Thus the lifetime of a path is the lifetime of the shortest living node along that path, because when the shortest living node along the path expires, the path breaks down. The intuition behind the above theorem is that if the minimum lifetime of a node along two paths is different, then we can increase the lifetime of the node with the shorter life by re-directing some of its traffic to the other path. Using this necessary condition, the authors propose a heuristic distributed algorithm that iteratively usesflowredirection along routes (by adjusting {q%j}) to maximize the minimum node lifetime. During each iteration, each node i compares routes based on the current lifetime of the shortest living node along those routes. During the next iteration node i redirects a part of the flow from a shorter living route to a longer living route by changing qij over its outgoing links. Thus the nodes attempt to "balance" the routing load over all the routes. In the second algorithm, the authors use the heuristic of using routes with the higher residual energy. Distributed Bellman-Ford algorithm is used with the reciprocal of the residual energy as the routing metric. This algorithm performs better load balancing than the flow redirection algorithm because it takes into account the current status of the node energies by looking at the current residual energy of the nodes,

234

NEXT GENERATION INTERNET

One of the limitations of this work is that since the algorithms are based on heuristics, they may not always converge to the global optimum. Another important point to note is that the work deals with pure routing and does not take into account the possibility of data aggregation at the intermediate nodes, a characteristic feature of several data gathering sensor networks. In Chang and Tassiulas (2000b), the authors have extended this work to obtain an approximate solution for this routing problem. The work in Chang and Tassiulas (1999) has also been extended to a multi-commodity flow problem by the same authors in Chang and Tassiulas (2000a). In this case, instead of a single commodity flowing from a set of source nodes to a set of destination nodes, multiple commodities are involved.

3,2

Energy efficient routing by Zussman and Segall

In Zussman and Segall (2003), the authors have formulated a lifetime maximization problem identical to that in Chang and Tassiulas (1999). However, the authors consider one more constraint; that of limited bandwidth. The authors study the problem of routing for maximum lifetime when the nodes have limited bandwidth in addition to limited battery energy. This is particularly true for disaster recovery ad-hoc networks consisting of smart badges. These badges are expected to have bandwidth of a few kilobits per second. Other than the bandwidth constraint, the rest of theflowconservation constraints are identical to Chang and Tassiulas (1999). The authors in Michail and Ephremides (2000) have considered a similar routing problem in the context of connection-oriented networks. However in that work, the authors have developed heuristic algorithms instead of optimal algorithms. The authors formulate a lifetime maximization problem identical to (9.4) along with the following capacity constraint. Qki + J2 *i ^ T '

Vi e N

~

which means that the total flow through a node cannot exceed the maximum node capacity (normalized to 1). This is a linear programming problem in {%}, but as in Chang and Tassiulas (1999), the challenge lies in designing a distributed algorithm to solve the problem. Unlike Chang and Tassiulas (1999) where the authors use heuristics, the authors in Zussman and Segall (2003) provide optimal algorithms along with their distributed implementations. The authors however make a simplifying assumption about the communica-

9 Energy and Cost Optimizations in Wireless Sensor Networks

235

tion model. The authors assume that the nodes do not use power control when communicating with their neighboring nodes, i.e., each node i uses a fixed power level e^ when communicating with its neighbors. With this model, e^- in Chang and Tassiulas (1999) is replaced by e^.1 With this assumption, the authors then break down the lifetime maximization problem into two loops in the algorithm. In the inner loop, the authors consider the original maximization problem in (9.4) without taking into account (9.5). For a given T, a max flow algorithm can be used to determine if there exists a feasible flow, i.e., {q%j} which satisfies all the constraints. This is possible to do because of the assumption of £ij — ^z- Any standard distributed implementation of a max flow algorithm (e.g., preflow-push algorithm Cormen et al., 2001) can be used for determining the feasibility of a given T. The outer loop of the algorithm begins by checking for the feasibility of T — Tmax for the first iteration. If T = Tmax is not feasible, T = Tmax/2 is checked in the next running of the outer loop. Similarly for every running of the outer loop the algorithm uses binary search to further refine the subsequent values of T. Use of binary search ensures O(logTmax) number of iterations for determining the optimal T. Here Tmax represents the maximum possible value of network lifetime which is upper bounded by n times the lifetime of a single battery, where n is the total number of nodes in the network. In the above problem formulation, the authors associate a fixed amount of energy ei with node i for packet transmission. Since other than the source nodes all the other nodes act as relay nodes, instead of accounting for transmission energy and reception energy separately for each packet, the authors absorb the energy spent on receiving a packet in the transmission energy e^. Thus e\ actually represents the energy spent on relaying a packet. However, one of the most important limitations of transceivers used in sensor nodes is their idle mode energy consumption. Transceivers spend a considerable amount of energy when their radio is in idle mode, i.e., neither transmitting nor receiving, and sometimes this energy is as high as the energy spent on transmission or reception (see Shih et al., 2001). As a result, when the transmissions and receptions are not perfectly synchronized, the nodes continue to spend energy on idle listening. This is especially true for a multi-hop network where a relay node does not know beforehand when it is going to receive the next packet. The simplistic model of associating a fixed amount of energy with each packet transx If this assumption is made for the problem in Chang and Tassiulas (1999), the necessary condition in Theorem 1 also becomes a sufficient condition, and then the algorithms proposed in Chang and Tassiulas (1999) converge to the global optimum.

236

NEXT GENERATION INTERNET

mission without accounting for idle mode energy is an idealistic scenario. While it is true that taking into account the underlying MAC protocol makes the analysis difficult to handle, the fact that the MAC layer has a large impact on the energy expenditure of a sensor node cannot be ignored. A natural extension of Zussman and Segall (2003) would be to formulate an identical lifetime maximization problem by accounting for idle mode energy expenditure.

3.3

Bounding the lifetime of a sensor network by Bhardwaj and Chandrakasan, a related problem

The authors in Bhardwaj and Chandrakasan (2002) study the problem of obtaining bounds on the lifetime of a sensor network. The authors use similar networkflowtools as in Zussman and Segall (2003). However they also take into account the possibility of data aggregation at some of the nodes. With data aggregation, the flow conservation constraints have to be modified at the aggregating nodes. However most of the other constraints are identical to (9.4). By formulating a lifetime maximization problem as in (9.4), we obtain a linear programming problem that can be solved for a given network topology. The solution of this problem, i.e., the optimum {q%j} provides an upper bound on the lifetime of the network. However, as the authors themselves state, it is difficult to design a distributed routing protocol that achieves these bounds. As in 3.2, this work assumes perfect transmitter-receiver synchronization in energy analysis. This work can be extended in two directions; a distributed routing protocol that achieves flow rates corresponding to the optimum solution of the problem, {q%j} can be developed, and the problem formulation can be modified to take into account the underlying MAC.

3.4

Other related work

The problem of network lifetime maximization has been addressed in several other works which are not related to routing, but which use network flow tools. In Srinivasan et al. (2002), the authors formulate an optimization problem by associating a utility function with every source node. The utility function is an increasing and concave function of the flow rate out of the source node. The objective is to determine {qij} that maximizes the sum of the utilities of all the sources while ensuring a certain minimum lifetime. There is also an upper bound on the allowable source rates. The authors use a penalty function based approach for the system utility maximization, and also propose a distributed al-

9 Energy and Cost Optimizations in Wireless Sensor Networks gorithm called Optimal Rate Splitting and Allocation (ORSA) that can be implemented at the source nodes to determine the optimum source rates. In Shah and Rabaey (2002), the authors address the problem of lifetime maximization by picking the next hop nodes in a probabilistic fashion. This is a heuristic algorithm, and the probability of choosing a neighbor node as the next hop node is proportional to the inverse of the cost of the link to that node. The cost of a link equals the energy spent on transmitting a packet on that link. This form of randomness in choosing the next hop node ensures some level of load balancing which is better than always choosing the minimum energy route, because the latter results in quick depletion of energy resources along the minimum energy route.

4.

Cost optimizations in network design

Wireless sensor networks are characterized by their high node density and possibility of data aggregation. Since node measurements from neighboring nodes are expected to be correlated, it is possible to perform in-network aggregation of the measured data so as to reduce the amount of data that needs to be sent to the base station. Besides, the high node density requires hierarchical management of the network for better scalability of protocols. Organizing nodes into clusters is one of the ways to achieve these objectives. Several alternatives are available when designing clustered sensor networks. For example, the network could consist of multiple types of nodes, such that the cluster head responsibilities can be assigned to one type of sophisticated nodes, while the rest of the simpler nodes perform sensing. Within each cluster the nodes could use single hop or multi-hop communication to reach the cluster head nodes. The radius of communication for multi-hopping is another parameter at the designer's disposal. These and other structural characteristics of sensor networks need to be taken into account when designing such clustered sensor networks. In this section, we survey some of the work done on energy and cost efficient design of wireless sensor networks. With reference to the application classes discussed in Section 2, this kind of networks fall under the category of data gathering networks. For such networks, it is possible to model the data gathering process as a set of discrete cycles. During each cycle, the nodes send their measured data to the cluster head nodes who perform some data aggregation, and then send the aggregated data to the base station.

237

238

4.1

NEXT GENERATION INTERNET

Design optimizations in homogeneous sensor networks

We first begin by looking at the design of homogeneous sensor networks. In a homogeneous sensor network, all the nodes are identical in terms of their hardware and battery energy. 4.1.1 A single hop homogeneous clustered network, LEACH by Heinzelman et al. In Heinzelman et al. (2002), a distributed data gathering protocol called LEACH, i.e., Low Energy Adaptive Clustering Hierarchy is proposed for a sensor network is in which a fixed number of homogeneous nodes are distributed randomly over a region. There is a remote base station that is located outside the region. Nodes are organized into clusters, and the cluster head nodes are chosen from among the sensor nodes. During each data gathering phase, the nodes send their measured data to the closest cluster head node through a direct transmission. The cluster head node aggregates the received packets into a single packet, and transmits it to the remote base station. Since the cluster head nodes carry the burden of long range transmissions to the base station, they are likely to drain their battery before other nodes. Hence in order to ensure some form of load balancing, the role of the cluster head nodes is rotated randomly and periodically over all the nodes in the network. Since the nodes are homogeneous, all the nodes have the hardware required for performing long range transmissions to the remote base station, and for performing data aggregation computations. The question that the authors address under these settings is: what is the optimum number of cluster head nodes required to minimize the average energy expenditure of each node during a single data gathering cycle? For this, the authors obtain an expression for the energy spent in the entire network during each data gathering cycle, and then minimize it with respect to the number of cluster heads. Because of cluster head rotation, there is a more or less uniform drainage of energy over the entire network, and hence the authors seek to minimize the network wide energy expenditure. Note that this effectively means minimizing the required battery energy of each node for a given system lifetime. The larger the number of cluster heads, the smaller the distance over which the nodes have to transmit to reach the cluster head nodes; however the higher the number of long range energy intensive transmissions to the remote base station. Hence there is an inherent trade-off, which means that there is an optimum number of cluster head nodes. The optimum number of cluster heads is obtained by differentiating the average

9 Energy and Cost Optimizations in Wireless Sensor Networks

239

network-wide energy expenditure with respect to the number of cluster heads, and equating the resulting expression to zero. One of the most important characteristics of LEACH is node homogeneity. In order to use cluster head rotation, it is necessary that every node be equipped with complex hardware for long range communication with the remote base station. This results in an increased hardware cost of the overall network. Thus while the authors minimize the battery energy requirements of each node, the hardware cost requirements are not taken into account in the problem analysis. The data aggregation model that is used by the authors also leaves much to be desired. In general, for most applications, it is not reasonable to assume that irrespective of the size of a cluster (which is the variable over which optimization is performed), the data packets of all the nodes in that cluster can be aggregated into a single packet of fixed size. More elaborate data aggregation models which take into account the extent of correlation in the measured data as discussed in Mhatre and Rosenberg (2003) should be considered. In a related paper by Lindsey and Raghavendra (2002), the authors propose a data gathering scheme called PEGASIS, i.e., Power-EfHcient Gathering in Sensor Information Systems for a homogeneous sensor network. In this scheme there is a single cluster head node, and this role of cluster head is rotated periodically over all the nodes as in LEACH. The difference between PEGASIS and LEACH is that the authors of PEGASIS assume an aggregation model in which nodes are allowed to aggregate data along each hop. Thus each node receives packets from the nodes which are farther from the cluster head, and aggregates these packets along with its own packet to produce a single packet which is then sent to the next hop node. The aggregation model used in PEGASIS is even more restrictive than the model used in LEACH. PEGASIS also requires proper scheduling of transmissions among all the nodes so that hop-by-hop aggregation is possible. 4.1.2 Minimizing communication costs by Bandyopadhyay and Coyle. In Bandyopadhyay and Coyle (2003), the authors consider a sensor network in which the nodes are distributed over a circular region, and the base station is located at the center of the region. The nodes are homogeneous, and are organized in clusters. Each node has the same communication radius. The nodes use multi-hop communication to reach the cluster head node. The authors assume a data gathering network model in which the nodes send their measured data to their respective cluster heads during each data gathering cycle. The cluster head node aggregates the received packets into a single packet,

240

NEXT GENERATION INTERNET

and then sends the aggregated packet to the central base station using multi-hopping. While in LEACH, the communication paradigm between the nodes and their cluster heads, and between the cluster heads and the base station is single hopping, in this case, the communication paradigm at both the levels of hierarchy is multi-hopping. The authors use tools from stochastic geometry to obtain an expression for the energy spent in the entire network during each data gathering cycle, and minimize this to obtain the optimum number of cluster head nodes. No cluster head rotation is used. The energy minimization problem is identical to the LEACH energy minimization problem in the sense that the authors minimize the network-wide energy expenditure with respect to the number of cluster heads. An important observation that can be made about this scheme is that since the nodes and the cluster heads use multi-hopping, the nodes around the cluster heads and the nodes around the base station have the highest energy drainage due to excessive relaying of packets. While in LEACH, role rotation ensures uniform energy drainage over all the nodes, this scheme suffers from the problem of hot spot formation around the cluster head nodes and the central base station. As a result, it is the energy expenditure of the nodes in these hot-spots that determines the lifetime of the system, and this observation needs to be taken into account in the minimization problem. The work also suffers from a restrictive data aggregation model like the work discussed in Subsection 4.1.1.

4,2

Design optimizations in heterogeneous sensor networks

In the previous section, we looked at some of the design optimizations in homogeneous sensor networks. In such networks, the primary objective is minimizing the battery expenditure of each node for a given lifetime. However, there is another class of sensor networks which uses two or more types of nodes. For example, with two types of nodes the type 0 nodes act as sensor nodes, while type 1 nodes act as cluster head nodes. Most of the complex hardware and software functionality can be embedded in a few type 1 nodes, while the type 0 nodes can be designed to be simple. In this section, we look at the design of such heterogeneous networks. 4.2.1 A minimum cost heterogeneous network by Mhatre et al. In Mhatre et al. (2005), the authors consider a heterogeneous clustered sensor network and a periodic data gathering network model.

9 Energy and Cost Optimizations in Wireless Sensor Networks

241

The base station is located outside the region of interest. There are two types of nodes; type 0 nodes which are pure sensor nodes, and type 1 nodes which are cluster head nodes. The sensor nodes use short range multi-hop communication to reach the closest cluster head node. The cluster head nodes receive packets from all the nodes in their respective clusters, aggregate the received packets into a single packet, and transmit the aggregated packet to the remote base station using a direct transmission. Since the cluster head nodes require the hardware to communicate over longer distances as compared to the sensor nodes, the hardware cost of a cluster head node, ai, is higher than the hardware cost of the sensor node, ao- Since type 0 nodes use multi-hopping to reach the closest type 1 node, hot spots are formed around the type 1 nodes. The type 0 nodes which are within these hot spots, i.e., within one hop of the type 1 nodes, are called critical nodes. The critical type 0 nodes expire before other type 0 nodes because all the packets in their cluster have to be relayed by them over the last hop, and this results in a higher relaying burden. In order to ensure a lifetime of at least T data gathering cycles, it is necessary that the type 1 nodes and the critical type 0 nodes have sufficient battery energy to last for T cycles. The authors obtain expressions for the energy expenditure of both types of nodes, and then determine the corresponding battery requirements, E%. They then formulate an optimization problem with the following cost function: C(A, E) = A0(a0 + pE0) + Ai(c*i + pE{)

(9.6)

In the above cost function, A^ is the intensity (number of nodes per unit area) of type i nodes, and /3 is a proportionality constant so that j3Ei is the cost of the battery of a type i node. Thus we note that unlike the homogeneous network where the objective function to be minimized is simply the battery energy, in the case of a heterogeneous network the objective function to be minimized involves battery energy as well as the hardware cost of the multiple types of nodes. As in Bandyopadhyay and Coyle (2003), the authors use tools from stochastic geometry to obtain expressions for E{. The cluster head nodes spend energy on receiving packets from all the nodes in their respective clusters, aggregating the packets, and then making a long range transmission to the base station. The energy expenditure of a critical type 0 node is obtained by first determining the average relaying load on a critical node. The relaying load on a critical node is simply the ratio of the average number of nodes in the cluster minus the average number of critical nodes, to the average number of critical nodes. This ratio is the average number of packets that a critical node must relay. Once

242

NEXT GENERATION INTERNET

the relaying load on a critical node is determined, the required battery energy of type 0 nodes, Eo is known. The minimum lifetime requirement of T data gathering cycles results in the following inequality constraint.

where Po is the average energy expenditure of a critical type 0 node, and P\ is the average energy expenditure of a type 1 node during a single data gathering cycle. The authors also take into account the connectivitycoverage requirements in the form of an additional constraint which requires that the total node intensity Ao + Ai be greater than a threshold to ensure node connectivity and area coverage with a probability of at least 1 — e. Node connectivity is required for multi-hop communication to be possible. Minimizing the cost function in (9.6) with respect to Ao and Ai yields the optimum cluster head and sensor node intensities. The authors use Karush-Kuhn-Tucker theorem to minimize the cost function under the equality and inequality constraints. An important limitation of the results obtained in Mhatre et al. (2005) is that the authors assume an ideal MAC in analyzing the problem, i.e., they assume that there is no energy wasted by the nodes on idle listening, and that there are no packet collisions. While this is reasonable in the case of single hop clusters as in LEACH, it is difficult to ensure in the case of multi-hop clusters. This is because the transmissions and receptions of all the nodes over all the hops need to be synchronized in a multi-hop cluster.

4.2.2 Optimum mode of communication in a heterogeneous network, Mhatre and Rosenberg. It is well-known that in general, multi-hop communication is preferable to single hop communication, since the signal strength over distance d falls as l/dk) k > 2. However in practical transceivers, each packet transmission is also associated with constant overheads due to the energy spent in the digital circuitry. In Mhatre and Rosenberg (2003), the authors consider a heterogeneous network as in Mhatre et al. (2005). However, instead of assuming that the nodes communicate with a fixed radius of communication, the authors let the radius of communication be another variable in the optimization problem. The optimization problem is formulated along the same lines as (9.6) with a minor modification that in addition to the node intensities, the radius of communication is also a variable. There are two constraints on the communication radius. Firstly, it should be greater than or equal to the minimum radius required for node connectivity in order that multi-hop communication be possible. Secondly, the

9 Energy and Cost Optimizations in Wireless Sensor Networks communication radius should be smaller than the average radius of each cluster. The second constraint is required because when the communication radius becomes equal to the average radius of a cluster, the nodes can communicate with the cluster head using a single hop transmission, and such a single hop clustered network can be analyzed separately as is done in Mhatre and Rosenberg (2003). During each data gathering cycle, nodes send their measured data to their respective cluster head nodes who aggregate the received packets into a single packet, and transmit it directly to the remote base station. The Karush-Kuhn-Tucker theorem is used for cost minimization as in Mhatre et al. (2005). In the solution of the optimization problem, if it turns out that the optimum radius of communication is equal to the average radius of a cluster, then clearly single hopping is the optimum choice for in-cluster communication. If not, then the optimum mode of communication is multi-hopping with a radius of communication given by R as follows. / where we assume a radio model in which the energy required to transmit a packet over distance d is I + fidk, Here k is the propagation loss exponent, and I is the fixed amount of energy that is spent in the digital circuitry during the packet transmission. Thus, for heterogeneous networks, there is an optimum radius of communication R given by (9.8) which depends only on the radio parameters of the transceiver and the propagation loss exponent. The authors also propose a hybrid mode of communication in which the nodes periodically alternate between single hopping and multihopping for in-cluster communication. The intuition behind this idea is that when nodes use single hopping to reach the cluster head node, the nodes that are farthest from the cluster head node have the highest energy burden. On the other hand, when the nodes use multi-hop mode, the nodes that are closest to the cluster head node (within one hop) have the highest energy burden due to excessive packet relaying. Hence a periodic mode rotation between single hopping and multi-hopping leads to a more uniform energy drainage pattern. The exact fraction of time for which each of the modes is to be sustained is determined so that the energy expenditure profile of the hybrid mode has the same value at both its end points (see Figure 9.2). This ensures that the nodes that are burdened by single hopping and the nodes that are burdened by multi-hopping expire at about the same time. The cost minimization problem is again solved using the Karush-Kuhn-Tucker theorem.

243

244

NEXT GENERATION INTERNET

rgy

single hop

8 e

o Distance from cluster head (n) Figure 9.2. Hybrid Communication Mode

4,3

Homogeneous versus heterogeneous networks by Mhatre and Rosenberg

In Heinzelman et al. (2002); Bandyopadhyay and Coyle (2003); Mhatre et al. (2005); Mhatre and Rosenberg (2003), the authors begin by studying either a homogeneous or a heterogeneous sensor network, and then optimizing the corresponding network cost (which is just the battery cost for homogeneous networks, and the battery plus the hardware cost for heterogeneous networks). However they do not provide any guidelines as to which is the best of the two networks; homogeneous or heterogeneous. This problem is addresed in Mhatre and Rosenberg (2004) where they compare homogeneous and heterogeneous networks based on the overall cost of the network. The authors use the cost metric given by (9.6) for the purpose of comparison. With a homogeneous network, for example LEACH, the uniform energy drainage due to role rotation ensures that the required battery energy in each node is minimized. However it also requires each node to have complex hardware to act as a cluster head. Thus in the case of LEACH, the overall cost of the network, /i(ao, cei, /?) is as follows. fi(otQ,O£ii 0) = no(cti +/3E)

(9.9)

where no is the number of nodes in the network, and a\ is the hardware cost of a cluster head node. Due to role rotation, each node has to be capable of transmitting directly to the remote base station, and perform other duties of a cluster head, and therefore the hardware cost of each node is a i . On the other hand, the cost of the corresponding

9 Energy and Cost Optimizations in Wireless Sensor Networks heterogeneous network, f2(0^0^1^)

245

is as follows.

/ 2 ( a 0 , aup) = no(ao + (3E0) + m ( a i + pE{)

(9.10)

Note that in the above equation the complex hardware functionalities are embedded in only a few nodes (n\ cluster head nodes), and therefore the overall hardware cost of the system is low. However, since there is no role rotation, the non-uniform energy drainage results in a higher battery energy in each node. Thus there is a trade-off between homogeneous and heterogeneous networks in terms of the cost of the battery and the hardware. In Mhatre and Rosenberg (2004), the authors first determine the minimized costs of both homogeneous and heterogeneous networks for given settings. Then they determine the difference between these minimized costs, i.e., (9.9) — (9.10), and this serves as a guideline for the designers to choose between a homogeneous and a heterogeneous network. The authors also propose a multi-hop generalization of LEACH called MLEACH. They note that in the original LEACH scheme, the nodes use single hopping to reach their respective cluster head nodes. However the nodes could use multi-hopping to reach the cluster head nodes to save on battery energy by avoiding distant transmissions to the cluster heads. M-LEACH is still a scheme for homogeneous networks, but it allows for multi-hopping within the cluster.

4.4

Other related work

In Chiasserini et al. (2002), the authors consider a single hop clustered sensor network in which the lifetime of the network is defined as the time until the first cluster head node expires. The number of sensor nodes and the number of cluster heads is fixed, and is given. The authors address the problem of optimal assignment of nodes to the cluster heads so as to maximize the lifetime of the network. There is no role rotation, and the topology is static. Nodes are assigned to the cluster heads so as to balance the load on all the cluster head nodes. The mode of communication within the cluster heads is single hopping. It is assumed that the location of all the nodes is known, and this information is used to determine the node assignment policy for each cluster head node. However, the authors have not developed a distributed protocol for solving this problem.

5,

Conclusions

In this survey paper, we provided an overview of some of the recent work on energy and cost optimizations in wireless sensor networks. Sen-

246

NEXT GENERATION INTERNET

sor nodes are highly energy constrained, and energy efficiency is of prime importance at all the layers of protocol stack. Different network design issues surface depending on the kind of application involved. In this survey, we restricted ourselves mainly to those applications which are of data gathering type. We focused our attention on two important aspects of sensor networks, namely routing and design optimizations. In the context of routing optimizations, we looked at some of the important papers on energy efficient routing for maximizing the system lifetime. Several tools from the theory of network flows were used to tackle these optimization problems. We then looked at some of the important works on design related optimization problems in sensor networks. We focused our attention on clustered sensor networks, and the problem of cost (battery plus hardware) minimization. We noted that several optimization tools and techniques are useful in designing and dimensioning of wireless sensor networks. Acknowledgments This work was supported in part by an NSF grant (contract no. 0087266).

References Ahuja, R.K., Magnanti, T.L., and Orlin, J.B. (1993). Network Flows. Prentice Hall. Akyildiz, I.F., Su, W., Sankarsubramaniam, Y., and Cayirci, E. (2002). Wireless sensor networks: a survey. Computer Networks, 38:393-422. Baker, D.J. and Ephremides, A. (1981). The architectural organization of a mobile radio network via a distributed algorithm. IEEE Transactions on Communications, COM-29(ll):56-73. Bandyopadhyay, S. and Coyle, E. (2003). An energy efficient hierarchical clustering algorithm for wireless sensor networks. In: Proceedings of IEEE INFOCOM'03, San Francisco, CA. Bhardwaj, M. and Chandrakasan, A. (2002). Bounding the lifetime of sensor networks via optimal role assignments. In: IEEE INFOCOM'02, New York, NY. Chang, J. and Tassiulas, L. (1999). Routing for maximum lifetime in wireless ad-hoc networks. In: 37th Annual Allerton Conference on Communication, Control and Computation. Chang, J. and Tassiulas, L. (2000a). Energy conserving routing in wireless ad-hoc networks. In: IEEE INFOCOM'00, Tel Aviv, Israel. Chang, J. and Tassiulas, L. (2000b). Fast approximate algorithms for maximum lifetime routing in wireless ad-hoc networks. In: IFIP-TC6 Networking 2000, LNCS, Vol. 1815. Springer. Chiasserini, C.F., Chlamtac, I., Monti, P., and Nucci, A. (2002). Energy efficient design of wireless ad hoc networks. In: Proceedings of IFIP-TC6 Networking 2002, LNCS, Vol. 2345. Springer. Cormen, T.H., Leiserson, C.E., and Rivest, R.L. (2001). Introduction to Algorithms. Prentice Hall. Ephremides, A., Wieselthier E.J., and Baker, D.J. (1987). A design concept for reliable mobile radio networks with frequency hopping signaling. In: Proceedings of IEEE,

9 Energy and Cost Optimizations in Wireless Sensor Networks

247

75(l):56-73. Ettus, M. (1998). System capacity, latency and power consumption in multihop-routed ss-cdma wireless networks. In: Proceedings of IEEE Radio and Wireless Conference (RAWCON) 98, pp. 55-58, Colorado Springs, CO. Gallager, R.G., Humblet, P.A., and Spira, P.M. (1979). A distributed algorithm for minimum weight spanning trees. Technical Report LIDS-P-906-A, Laboratory of Information Decision Systems, Massachusetts Institute of Technology, Cambridge, MA. Heinzelman, W., Chandrakasan, A., and Balakrishnan, H. (2002). An applicationspecific protocol architecture for wireless microsensor networks. IEEE Transactions on Wireless Communications, 1(4). Kalpakis, K., Dasgupta, K., and Namjoshi, P. (2002). Maximum lifetime data gathering and aggregation in wireless sensor networks. In: Proceedings of IEEE Networks'02 Conference. Lindsey, S. and Raghavendra, C.S. (2002). PEGASIS: Power-efficient gathering in sensor information systems. In: IEEE Aerospace Conference, 3:3-1130. Big Sky, MT. Meng, T.H. and Rodoplu, V. (1998). Distributed network protocols for wireless communication. In: Proceedings of 1998 IEEE Symposium on Circuits and Systems, 4:600-603, ISCAS'98, Monterey, CA. Mhatre, V. and Rosenberg, C. (2003). Design guidelines for wireless sensor networks: Communication, clustering and aggregation. Ad Hoc Networks Journal, Elsevier Science, 2(l):45-63. Mhatre, V. and Rosenberg, C. (2004). Homogeneous vs heterogeneous sensor networks: A comparative study. In: Proceedings of International Conference on Communications (ICC 2004)) Paris, France. Mhatre, V., Rosenberg, C.P., Kofman, D., Mazumdar, R.R., and Shroff, N.B. (2005). A minimum cost surveillance sensor network with a lifetime constraint. IEEE Transactions on Mobile Computing, 4(1):4-15. Michail, A. and Ephremides, A. (2000). Energy efficient routing for connectionoriented traffic in ad-hoc wireless networks. In: Proceedings of IEEE PIMRC'00. Rodoplu, V. and Meng, T.H. (1998). Minimum energy mobile wireless networks. In: Proceedings of 1998 IEEE International Conference on Communications, ICC98, 3:1633-1639, Atlanta, GA. Shah, R.C. and Rabaey, J.M. (2002). Energy aware routing for low energy ad hoc sensor networks. In: Proceedings of IEEE WCNC}02. Shepard, T. (1995). Decentralized channel management in scalable multihop spread spectrum packet radio networks. Technical Report MIT/LCS/TR-670, Massachusetts Institute of Technology Laboratory for Computer Science. Shih, E., Cho, S., Ickes, N., Min, R., Sinha, A., Wang A., and Chandrakasan, A. (2001). Physical layer driven protocol and algorithm design for energy-efficient wireless sensor networks. In: Proceedings of ACM MobiCom'01, pp. 272-286, Rome, Italy. Singh, S., Woo, M., and Raghavendra, C.S. (1998). Power-aware routing in mobile ad hoc networks. In: Proceedings of Fourth Annual ACM/IEEE International Conference on Mobile Computing and Networking, pp. 181-190, Dallas, TX. Srinivasan, V., Chiasserini, C.F., Nuggehalli, P., and Rao, R. (2002). Rate allocation and traffic splits for energy efficient routing in ad hoc networks. In: IEEE INFOCOM'02, New York, NY.

248

NEXT GENERATION INTERNET

Tilak, S., Abu-Ghazaleh, N.B., and Heinzelman, W. (2002). A taxonomy of wireless sensor network communication models. Mobile Computing and Communication Review, 6(2). Zussman, G. and Segall, A. (2003). Energy efficient routing in ad hoc disaster recovery networks. In: IEEE INFOCOM'03, San Francisco, CA.

Chapter 10 DUALITY-BASED TCP CONGESTION CONTROL WITH ERROR ANALYSIS Mortada Mehyar Demetri Spanos Steven H. Low Abstract

1.

We review optimization models for congestion control, focusing on the dual of a widely studied utility maximization problem. We show that many congestion control algorithms can be modeled within this framework, and conversely, standard techniques for iterative optimization can inspire practical congestion control algorithms. This previous work assumes precise feedback of congestion information from network to traffic sources. We study the effect of error in feedback information within this duality model. Using standard inexact optimization techniques, we show that, provided the relative error is bounded, the network behavior is attracted to a region that contains the optimal point if exact information were available.

Introduction

The primary resource of the Internet, bandwidth, is a scarce commodity that must be allocated to individual users according to pre-defined algorithms. The current architecture of the Internet addresses this allocation problem through the interaction of two types of algorithms: source algorithms, which run on a user's computer, and link algorithms, which run on the physical hardware of the network (links and routers). The function of source algorithms is to update transmission rate in response to perceived congestion of the network. Link algorithms, in contrast, regulate theflowof data by updating (possibly implicitly) some measure of network congestion. This source/link division reflects the inherently distributed design of the Internet, in which no centralized coordination is available.

250

NEXT GENERATION INTERNET

Transmission Control Protocol (TCP), in its various forms (such as TCP-Reno Jacobson (1988); Stevens (1999) or TCP-Vegas Brakmo and Peterson (1995)), is the most common source algorithm on the current Internet. Similarly, many link algorithms exist, such as DropTail, Random Early Detection (RED) Floyd and Jacobson (1993), and Random Exponential Marking (REM) Athuraliya et al. (2001). Collectively, all these algorithms fall under the general headings of Congestion Control (for source algorithms) and Active Queue Management (for link algorithms). The interaction of these algorithms is generally very complex, and many models have been developed for their study. In particular it has been shown that several important features of this interaction can be understood as the execution of a distributed network optimization procedure. The underlying optimization problem, first presented as a model for Internet congestion control in Kelly et al. (1998), seeks an optimal allocation of transmission rates subject to bandwidth (capacity) constraints on each link. In this article, we first review results pertaining to the dual of the aforementioned network optimization problem Low and Lapsley (1999); Low (2003); Low et al. (2002). We then address an issue that arises in the application of these theoretical results in real networks (see Mehyar et al. (2004) for preliminary results). Whereas the primal problem is concerned with optimal selection of transmission rates (primal variables), the dual problem addresses selection of the congestion measures on each link (this is also called the "price"). Section 2 shows that a very general class of source and link algorithms can be viewed as a (primal-dual) algorithm for simultaneous solution of the transmission rate and link price optimization. The main result of this section is that any equilibrium of the network (which is determined by the specific choice of source and link algorithms) is a primal-dual optimum. This allows one to understand equilibrium properties of quite general networks in terms of optima of a single optimization problem. The work described in Section 3 reverses the approach of the previous section: rather than taking the source and link algorithms as given, it begins with the optimization problem and its dual. We show that the dual has a very interesting distributed structure, and that standard iterative techniques for solving optimization problems naturally inspire source and link algorithms (in particular, dual gradient and Newton-like algorithms). We thus see that this optimization framework for network modeling can be used either as a tool for understanding given algorithms, or as a guide for new source and link algorithms.

10 Duality-Based TCP Congestion Control with Error Analysis

251

All the models reviewed above assume that precise link prices are fed back to sources for their rate adjustment. Section 4 addresses the fact that sources on real networks do not have direct access to congestion prices, and can only estimate them based on locally observable quantities, such as packet losses or queuing delay. We show that this non-ideal behavior at the sources can be viewed as an inexact gradient calculation. We then apply standard techniques for analysis of inexact gradient methods to characterize the behavior of the theoretically-inspired algorithms in the presence of this more realistic price-estimation scheme. We prove that as long as the relative error is bounded, the optimization flow control scheme will still "converge" in the sense that it will drive the link utilization to an attraction region. The core of the argument is that reduction of the dual function can still be achieved in the presence of inexact gradient calculation. The analysis also suggests an optimal choice of stepsize that guarantees the greatest decrease in the dual function at each iteration. Finally, we apply these results to typical types of error in price feedback and illustrate the attraction region and stepsize optimality with numerical examples.

2.

Duality model

A network is modeled as a set L of links with finite capacities c = (Q, / G L). They are shared by a set S of sources indexed by s. Each source s uses a set Ls C L of links. The sets Ls define an L x S routing matrix 1

R

i

y 0 otherwise Associated with each source s is its transmission rate xs(t) at time t, in packets/sec. Associated with each link I is a scalar congestion measure Pl(t) > 0 at time t. Following the notation of Paganini et al. (2001), be the aggregate source rate at link I and let let yi(t) = J2s^sxs(t) Qs(*) = J2i^lsPl(t) be the end-to-end congestion measure for source s. In vector notation, we have (-T denotes transpose)

y(t) = Rx(t) q(t) = RTp(t) \s\

Here, x(t) = (x s (t),s € S) and q(t) — (qs(t),s (E S) are in 5R^ , and y(t)

x

= (yi{i),l G L) and p(t) = (p/(t),/ G L) are in Sftlf1 (5R+ denotes

We abuse notation to use L and S to denote sets and their cardinality.

252

NEXT GENERATION INTERNET

non-negative real). Source s can observe its own rate xs(t) and the endto-end congestion measure qs(i) of its path, but not the vector x(t) or p(i), nor other components of q(t). Similarly, link / can observe just local congestion pi(t) and flow rate yi(t). The source rate xs(t) is adjusted in each period according to a function Fs based only on xs(t) and qs(t): for all s, x8(t+l)

= F8(x8(t),q8(t))

(10.1)

On each link /, the congestion measure pi(t) is adjusted in each period based only on pi(t) and y/(t), and possibly some internal (vector) variable vi(t), such as the queue length at link L This can be modeled by some functions (G^Hi): for all /, Pi(t + 1) = vi{t+l)

= % ( * ) , « ( * ) , <>/(*))

( 10 - 3 )

where G\ is non-negative so that pi(t) > 0. Here, Fs models TCP algorithms (e.g., Reno or Vegas) and (Gi,Hi) model AQM's (e.g., RED, REM); see the next section. We will often refer to AQM's by G/, without explicit reference to the internal variable vi(t) or its adaptation H[. We assume that (10.1)—(10.3) has a set of equilibria (x,p). The fixed point of (10.1) defines an implicit relation between equilibrium rate xs and end-to-end congestion measure qs:

Assume Fs is continuously differentiate and dFs/dqs ^ 0 in the open set A := {{xS)qs)\xs > 0,qs > 0}. Then, by the implicit function theorem, there exists a unique continuously differentiable function fs from {xs > 0} to {qs > 0} such that qs

=

fs(xs)

> 0

(10.4)

To extend the mapping between xs and qs to the closure of A, define /s(0)

= inf {qs>0\Fs(0,qs)

= 0}

(10.5)

possibly oo. If (xs, 0) is an equilibrium point, Fs(xs, 0) = xs, then define

=

0

(10.6)

10 Duality-Based TCP Congestion Control with Error Analysis

253

Define the utility function of each source s as Us{x8)

= J fs(xs)dxs,

xs>0

(10.7)

that is unique up to a constant. Being an integral, Us is a continuous function. Since fs(%s) = qs > 0 for all xS) Us is nondecreasing. It is reasonable to assume that fs is a nonincreasing function - the more severe the congestion, the smaller the rate. This implies that Us is concave. If fs is strictly decreasing, then Us is strictly concave since U"{xs) < 0. An increasing utility function implies a greedy source - a larger rate yields a higher utility - and concavity implies diminishing return. Now consider the problem of maximizing aggregate utility formulated in Kelly et al. (1998): max

> Us(xs)

x>0

^—' s

subject to Rx < c

(10.8)

~

The constraint says that, at each link /, the flow rate y\ does not exceed the capacity Q. An optimal rate vector x* exists since the objective function in (10.8) is continuous and the feasible solution set is compact. It is unique if the Us are strictly concave. As the sources are coupled through the shared links (the capacity constraint), solving for x* directly, however, may require coordination among possibly all sources, and hence is infeasible in a large network. The key to understanding the equilibrium of (10.1)—(10.3) is to regard x(t) as primal variables, p(t) as dual variables, and (F, G) = (F s , G/, s G *S, / G L) as a distributed primal-dual algorithm to solve the primal problem (10.8) and its Lagrangian dual (see Low and Lapsley (1999)):

min p>0

y^max((7 s (x s ) - xsqs) + Y^p/q

*—<' xs>0 S

t—*' I

(10.9)

Hence, the dual variable is a precise measure of congestion in the network. The dual problem has an optimal solution since the primal problem is feasible. We will interpret the equilibria (x*,p*) of (10.1)—(10.3) as solutions of the primal and dual problem, and that (F, G) iterates on both the primal and dual variables together in an attempt to solve both problems. We summarize the assumptions on (F, G, H): Cl: For all s G S and / G L, Fs and G\ are non-negative functions. Moreover, equilibrium points of (10.1)—(10.3) exist. C2: For all s G 5, Fs are continuously differentiate and dFs/dqs ^ 0 in {(xs,qs)\xs > 0,qs > 0}; moreover, fs in (10,4) are nonincreasing.

254

NEXT GENERATION INTERNET

C3: lipi = Gi(yi,pi,vi) andvi = i7/(y/,p/,^), thenyi < c\ with equality if pz > 0. C4: For all s G 5, fs are strictly decreasing. Condition Cl guarantees that (x(t),p(t)) > 0 and (x*,p*) > 0. C2 guarantees the existence and concavity of utility function Us. C3 guarantees the primal feasibility and complementary slackness of (x*,p*). Finally condition C4 guarantees the uniqueness of optimal x*. The following theorem is proved in Low (2003). 10.1 Suppose assumptions Cl and C2 hold. Let (x*,p*) be an equilibrium of (10.1)-(10.3). Then (x*,p*) solves the primal problem (10.8) and the dual problem (10.9) with utility function given by (10.7) if and only if C3 holds. Moreover, if assumption C4 holds as well, then Us are strictly concave and the optimal rate vector x* is unique. THEOREM

Proof. The discussion after the definition (10.7) of Us proves the second claim when C4 holds, so we only prove the first claim. By duality theory (e.g., Bertsekas (1995)), (#*,p*) is primal-dual optimal if and only if x* is primal feasible, p* is dual feasible, complementary slackness holds, and x* =

argmaxL(x,p*)

(10.10)

where L is the Lagrangian of (10.8) defined as: L(x,p)

= s

I

s

Hence, to prove the first claim, we only need to establish (10.10). Now maxL(x,p*) x>0

By construction of US) we have from (10.7) and (10.4) that, for any equilibrium at which x* > 0, (x*,p*), lspt

(10.11)

10 Duality-Based TCP Congestion Control with Error Analysis

255

Note that if q*s = 0, then (10.11) holds by (10.6). If x* = 0, we have from (10.5) E^(0) = / a ( 0 )

< q;

(10.12)

But, (10.11)-(10.12) implies that

with equality if x* > 0. Since L(x,p*) is concave in #, this is the necessary and sufficient Karush-Kuhn-Tucker condition for x* to maximize L(x,p*) over x > 0. Hence the proof is complete. • In Table 10.1, we give some examples of specific source and link algorithms. The parameter Ds represents the round-trip-time of a packet. In Vegas, as is a scalar parameter and ds is the propagation delay. Finally, the various quantities appearing in the AQM table are internal variables and system parameters, which we do not discuss here; see Low (2003) for details. It is prudent to summarize what we have done thus far. We have examined a general class of update algorithms for the rates, prices, and internal variables. We then showed how to construct an optimization problem of the form presented in Kelly et al. (1998), and that equilibria of the source and link algorithms correspond to optima. Note that we had to assume existence of equilibria in order to obtain this formalism. Certainly, there is no guarantee that the algorithms we have modeled have equilibria. On the other hand, the assumptions we have made about the algorithms are so general that one cannot really expect a guarantee of an equilibrium. Nonetheless, we would like to understand at least some algorithms which have a provable (and stable) equilibrium at the optimum. This is the motivation of the next section, in which we begin with the formal optimization problem, and demonstrate source and link algorithms (based on a gradient-projection scheme) which provably converge to the primaldual optimal point. Perhaps more importantly, the gradient-projection scheme gives rise to distributed source and link algorithms, and in this regard can serve as an inspiration for practical algorithms on a real network.

3.

Optimization flow control

We will now discuss the construction offlowcontrol algorithms based directly on the network optimization problem. That is, we view the

256

NEXT GENERATION INTERNET

Table 10.1. A list of some common TCP/AQM algorithms and their associated functions. The notation [z]+ signifies max{z,0}. TCP

Reno-1

Fs(xs(t),qs(t)) Utility

Reno-2

Vegas

Fs(xa(t),qa(t)) Utility Fs(xs(t),qs(t)) Utility

w to"-1 1np

1

&

D

Xs{t) Xa(t) asds log xs

if xs(t) <xs{t) if Xa(t) >Xa(t) otherwise

AQM

0 Pi(n(t + 1) - h) P2(ri(t + 1) - bL) + mi 1

RED

Hi(yi(t),pi(t)M*))

_ bi 26/

ri(t-\-l) = (1 - ai)n(t) REM

n(t Delay

Gi{yi(t)9pi(t),vi(t))

optimization formalism as representing some kind of design goal, and seek source and link algorithms which provably converge to the optimum. We will demonstrate a dual gradient algorithm for solving this optimization, which results in corresponding source and link algorithms that are inherently distributed. We will show global convergence of this algorithm in a synchronous implementation. Analogous asynchronous results can also be obtained, but we will omit them as they are quite technical. We now introduce some additional notation and assumptions. Recall that for each source 5, qs (the «sth component of pTR) is the path bandwidth price that s faces. Let xs(p) be the unique maximizer in (10.10). We will abuse notation and use xs(-) both as a function of scalar price qs G 5R+ and of vector price p G 3R+ . When the argument of xs(-) is a scalar, by the Karush-Kuhn-Tucker theorem, xs(qs) is given by ]MS

(10.13)

10 Duality-Based TCP Congestion Control with Error Analysis

257

where [z\ba = min{max{z,a}, b}. Here Us~l is the inverse of U's, which exists over the range \U's{Ms),Uls(rns)] since U's is continuous and Us strictly concave (condition Cl below). When the argument of xs(-) is a vector, xs(p) = xs(qs). The meaning should be clear from the context. Also, x(p) = (xs(qs),s G S). We make the following assumptions regarding the utility functions: Al: On the interval Is = [ra s ,M 5 ], the utility functions Us are increasing, strictly concave, and twice continuously differentiate. For feasibility, assume YlseS(l) rns ^ cl f° r a ^ ^ A2: The curvatures of Us are bounded away from zero on Is: -U"{xs) > l/a8 > 0 for all xs G I8. We will make use of the notations L := maxges |L(s)|, S :— : ~ m a x {^s? 5 ^ ^ l -

3.1

Synchronous distributed algorithm

In this section we present the basic source and link algorithms and prove their convergence under conditions Al and A2. From now on we assume that the algorithms are synchronous (i.e., all sources update at the same time, and so do the links). The asynchronous version of this algorithm can be shown to converge to the optimum as well and the proof can be found in Low and Lapsley (1999). Recalling that xs(p) is given by (10.13), the dual function which we will denote by D{p) is thus

We will solve the dual problem using a gradient projection method (e.g., Luenberger (1984); Bertsekas and Tsitsiklis (1989)) where link prices are adjusted in opposite direction to the gradient VD(p): Pl(t

+ 1) = \pi(t)-^ip{t))\+

(10-14)

Here 7 > 0 is a stepsize, and [z]+ = max{z, 0}. Since Us are strictly concave, D(p) is continuously differentiate (Bertsekas and Tsitsiklis (1989)) with derivatives given by |^(p)

=
(10.15)

Substituting (10.15) into (10.14) we obtain the following price adjustment rule for link I G L: ) - ci)]+

(10.16)

258

NEXT GENERATION INTERNET

The decentralized nature of (10.16) is striking: though the dual problem is not separable in p, given the aggregate source rate y\ that goes through link Z, the adjustment algorithm (10.16) is completely distributed and can be implemented by individual links using only local information. We summarize:

Algorithm: Synchronous gradient projection Link Z's algorithm: At times £ = 1,2,..., link I: 1 Receives rates xs(i) from all sources s G S(l) that go through link I. 2 Computes a new price pi(t + l)

\pi(t)+j(yi{t)-d)]+

=

3 Communicates new price p/(£ + 1) to all sources s G S(l) that use link I.

Source s's algorithm: At times t = 1,2,..., source s: 1. Receives from the network the sum of link prices in its path qs{t). 2. Chooses a new transmission rate xs(t + 1) for the next period: x8(t+l)

=

xs{qs(t))

3. Communicates new rate xs(t + 1) to links I G L(s) in its path. The convergence of this algorithm is shown in Low and Lapsley (1999). THEOREM 10.2 Suppose assumptions A1-A2 hold and the stepsize satisfies 0 < 7 < 2/~aLS. Then starting from any initial rates m < x(0) < M and prices p(0) > 0, every accumulation point (x*,p*) of the sequence (x(£),p(£)) generated by Algorithm Al is primal-dual optimal.

Proof. (Sketch) Assumptions Al and A2 can be shown to imply that the following Lipschitz condition holds

\\VD(p)-VD{p')\\2

< a~LS\\p-p'\\<2

for all p,pf > 0 Low and Lapsley (1999). This is a sufficient condition for convergence of the dual gradient algorithm Bertsekas (1995), and convergence to a primal-dual optimal point follows from the concavity • assumptions. We have thus accomplished our goal of constructing source and link algorithms which provably converge to the optimum. The fact that these algorithms are also naturally distributed makes the result even more interesting, as it suggests applications in realistic algorithms on real networks.

10 Duality-Based TCP Congestion Control with Error Analysis

4.

259

Optimization flow control with estimation error

One practical drawback of the proposed dual gradient method is the reliance on explicit communication of price information. Schemes such as RED Floyd and Jacobson (1993) and REM Athuraliya et al (2001) use a congestion-based queue management protocol which, in the above context, amounts to an implicit price notification scheme. Although this mechanism is more practical than the explicit transmission of price information, it suffers from various errors inherent in the implicit price notification. One particular source of error inherent to any physical implementation is the limited information available to individual sources, i.e., the receipt of acknowledgments and the round trip time (RTT) for each packet transmitted. The prices (congestion measures) in these two cases are, respectively, loss probability and queueing delay. The exact price is either very hard to estimate (loss probability) or very noisy (queueing delay). Our aim in this section is to understand the effects of such errors on the performance of the dual gradient algorithm Mehyar et al. (2004).

4.1

Price estimation error as inexact gradient

It is evident that the price update in the algorithm of Section 3 is dependent on the rate update. Thus, an erroneous rate update will result in a corresponding error in the price update. In particular, the direction of the price update will, in general, not be along the gradient direction, but along some perturbed direction. Thus, the effect of inexact price estimation at the sources amounts to an inexact calculation of the gradient at the links. The advantage of the inexact gradient viewpoint is that it allows us to embed the phenomenon of price estimation error in the optimization flow control framework. We will show that in the presence of error, the above algorithm will still drive the system to a region around the optimum, under a slight modification of the stepsize bound.

4.2

Attraction under inexact gradient

In this section we will characterize the steady-state dynamics of link utilization in terms of an attraction region, which we define as follows: Definition: A set Ai C 5ft+ is called an attraction region for link I if there exists an integer N such that for all initial conditions (source rates and link prices), yi(n) E A\ for some n less than N.

260

NEXT GENERATION INTERNET

We remark on two important subtleties. First, this definition does not require that the trajectory remain within the attraction region after entering. It is thus not required to be an invariant set. Second, this does imply that if the trajectory ever leaves the attraction region, it will return to the attraction region within TV steps. We will show that as long as the relative error is bounded, the optimization flow control scheme will still "converge" in the sense that it will drive the link utilization to an attraction region. The core of the following argument is that reduction of the dual function can still be achieved in the presence of inexact gradient calculation. At each time t, the /th component of the exact gradient is given by ses(i)

Let v(t)s be the estimation error at each source, and define the estimated price at each source by qs(t) — qs(t) + ^s(£). Hence, the rate update is xs(t) = Ug"1 (qs(t)). Thus, the inexact gradient link / actually uses is

&(*) = <*- E

u

's~\ut))

ses(i)

The error in the Zth component of the gradient is therefore bounded by \gi(t) ses(i) ses(i) ses{i)

where l/a s is the lower bound on the curvature of Us(x) and therefore as is a global Lipschitz constant for Us~1(qs(t)) by the Mean Value Theorem. The following is a sufficient condition which guarantees that the inexact gradient will still be in a descent direction: (10.17) ses(i)

where 0 < r\ < 1 can be thought of as the relative error. This condition simply ensures that the error is not large enough to completely negate

10 Duality-Based TCP Congestion Control with Error Analysis

261

the gradient, and so the dual function can still be reduced in the direction of the inexact gradient. When inequality (10.17) is not satisfied, no conclusion can be drawn, as the above condition is merely sufficient for convergence. Nonetheless, we can show that the region where (10.17) fails, i.e., where the following holds a s \ v s { t ) \ > r j \ c i - y i { t ) \ , for s o m e /

(10.18)

ses(i) contains an attraction region. THEOREM

10.3 The solution set of (10.18) is an attraction region, pro-

vided aLb

rj)

(10.19)

Proof. The condition in the definition of an attraction region will be verified in two steps: a) Choice of stepsize Since VD = g (the exact gradient) is Lipschitz with a Lipschitz constant aLS Low and Lapsley (1999), the Descent Lemma (Bertsekas (1995) proposition A.24) implies ryf 9 D{p - jg) < D{p) - j(g,g) + — 7 2 | | 2 | | 2

(10.20)

where (g,g) is the Euclidean inner product and \\g\\ is the Euclidean norm. Then we see < 7 <

aLS ||5||2

guarantees that the change in the dual function AD := D(p-jg) - D(p) is strictly less than 0. Therefore when (10.17) holds, the best bound on 7 that guarantees descent is the solution of the following optimization problem: 2

(g,g)

s u b j e c t t o \\gi - gt\\ < rj\\gi\l VZ

262

NEXT GENERATION INTERNET

Since (g,g) = Yli9l9li ^ is easy to see that the minimum occurs at point O where (9,9)

=

(l-^)ll^ll2

(10.21)

and therefore the minimum bound for 7 that guarantees descent is 2(1 — r])/aLS. b) Entry in finite steps From (10.20) we see that the minimal decrease of the dual function in each step is AD

<

1__ o / aLS\\g\\^h-\^ z \

< i

(a a) 2 \ ) \\g\\ °^Eb j

(10.22)

where ||#||2 is strictly positive since (10.17) holds and |fs(£)| is in general not identically zero (else there would be no estimation error). Therefore as long as 0 < 7 < 2(1 - r/)/aLS, the dual function is decreased by a finite amount in each iteration. Now since the primal problem is, by hypothesis, feasible, the dual function is lower bounded Bertsekas (1995). Therefore the inequality (10.17) must fail after a,finitenumber of steps, or it would contradict the fact that the dual function is lower bounded. In other words, (10.18) must hold after a finite number of steps, i.e., the trajectory of yi(t) enters the solution set of (10.18) after a finite number of steps. • Some comments on the relationship between r\ and 7 are now in order. First, note that larger values of 77 will result in smaller solution sets for (10.18). Of course, in order to guarantee that this is an attraction region, (10.19) must be satisfied. So, a larger r\ corresponds to a tighter attraction region, but demands a smaller 7. Conversely, given a choice of 7, (10.19) constrains the maximal r\ consistent with the above analysis, and hence the smallest obtainable attraction region. The above proof demonstrates that convergence can be guaranteed with any 7 satisfying (10.19), but like the convergence proof in Low and Lapsley (1999), it does not suggest a criterion for selecting 7 within this range. As is usual in iterative optimization algorithms, there is a tradeoff between taking larger steps at each iteration (i.e., selecting a large 7) and ensuring that the (inexact) gradient remains a good predictor of local function behavior (i.e., selecting a small 7).

10 Duality-Based TCP Congestion Control with Error Analysis

263

It turns out that we can obtain a satisfactory optimality result which strongly suggests choosing

lopt

- ^rr

which happens to be half of the bound imposed by the convergence criterion. THEOREM

10.4 The choice of step size ^opt has the following properties:

a) Worst-case optimality It is the worst-case optimal 7, in the sense that it maximizes the Lipschitz-bounded progress in the dual function at each iteration. b) Superiority to smaller 7 At each iteration, jopt generates a new price with a smaller value of the dual function than any smaller choice oflProof, a) From (10.22), the worst-case progress of the dual function is aLS .,^..2 \ 2

27

A simple calculation shows that the minimum of the quadratic as a function of 7 occurs when 7 = jopt. b) Consider the directional derivative j-D(p — jg) — —(S/D(p — 7<7),
< aL57||^||

Therefore the magnitude of the difference between the directional derivatives at p and p — 75 is \\(VD(p — jg) — VD(p)Jg}\\ < aZ*S7||^||2 Here we have used the Cauchy-Schwarz inequality and the Lipschitz bound. This in turn implies:

In the second last line we have applied (10.21), since we are only interested in points where the algorithm has not driven the system to the

264

NEXT GENERATION INTERNET

attraction region and can hence provably decrease the dual function at each iteration. Thus, whenever 7 is chosen to be smaller than 7 ^ , the derivative of the dual function with respect to 7 is strictly negative. This implies that 7Op£ achieves greater decrease in the dual function at each iteration than any smaller 7. •

4,3

Quantization error of marking

Theorems 10.3 and 10.4 give us some general understanding of link utilization and stepsize selection in the presence of price estimation error. Note that we have not made any assumption on what the error process vs(t) is. We will now try to model vs(t) by looking at some typical situations where errors occur. When the congestion measure is loss probability, e.g., when routers implement RED or REM, during each RTT the source s sends out one window size ws of packets and has to estimate qs by observing how many packets are dropped or marked. The fraction of packets lost is an instantaneous estimator of qs and is subject to two kinds of errors: quantization and probabilistic fluctuation. For example if ws = 4, then qs e {0,0.25,0.5,0.75,1}. So, if the actual price occurs at some intermediate value, say qs = .3, the closest one could estimate would be qs = .25. We call this the quantization error. Further, due to the probabilistic nature of the dropping scheme, we could get (albeit with lower probability), say qs = 1 as the estimate of qs and incur a larger error in the effective gradient. We call this the fluctuation error. It can be seen that, if only the quantization error is present, |^5(t)| will be bounded by l/2ws(i) = l/2dsxs(t), where ds is the RTT of source s which is assumed to be constant. Using this specific error model, condition (10.18) becomes

£

as

_ V

st5(Z)

In the single-source-single-link case this reduces to a > \c-x(t)\ 2rjdx(t) and we can solve the inequality and see that the attraction region is given by c

1 /

2a c

1 f2

2a

10 Duality-Based TCP Congestion Control with Error Analysis

265

Similarly, we can find the following attraction region for a general network (as in Mehyar et al. (2004)):

where rn/ = mins(/)ras, M* = maxs(/)Ms, and \S(l)\ is the number of sources sharing link L To illustrate the effects of quantization error, we present some simple simulation results. We use quadratic utility functions of the form rr/

x

x2 2(M-m) '

Mx M-m

Note that this function is strictly concave, and satisfies 0 < U'{x) < 1. Further, the constant a is given by M — m. Our first example is a two-source one-link network. The capacity is set to 70 and the utility functions are such that source 1 has M\ = 50 and m\ = 1, and source 2 has M a s weU a s the aggregate rate x\ + #2 on the link. We observe that the aggregate rate dynamics is nicely characterized by the attraction region. Another common type of error that we will look at is the aforementioned fluctuation error. As in RED or REM, we assume that packets are dropped (or marked) independently, according to the current price (drop probability) on each link. The observed fraction of lost packets at each source is therefore binomial distributed with standard deviation a equal to y/qs(l — qs)lWdsXs- We make a deterministic 3
, > 7/ \C -

, ,, X(t)\

This implies 3a

1

since y/qs0- — qs) ^ \* By squaring both sides we obtain a cubic function of xs, and the solution to the inequality can be readily computed. Figure 10.2 shows a simulation result in a single-source-single-link network where fluctuation error is present.

266

NEXT GENERATION INTERNET 90 Predicted Bound of / Attraction Region

icted Bound of Attraction Region

\x

100

2

200

150

250

t Figure 10.1. Simulation with a two-source one-link network. Note that the sources use different utility functions

x(t)

0

100

200

300

400

500

600

700

800

900

1000

Figure 10.2. Aggregate rate dynamics with probabilistic packet dropping and the 3cr attraction region

Finally, we present a single-source-single-link simulation illustrating the stepsize dependence discussed in Theorem 10.4. Here we again use the quantization error from our first simulation, and show the rate dy-

10 Duality-Based TCP Congestion Control with Error Analysis

267

0 8

- \n

Figure 10.3. Convergence behavior with different choices of 7 when quantization error is present

namics for various choices of the stepsize. We indeed observe that all choices below jOpt are outperformed. We do not see any clear superiority to larger values, but we again remark that the optimality is based on worst-case analysis. Finally, we note that increasingly large oscillations ensue as we increase the stepsize. Although this behavior is typical of fixed-stepsize iterative schemes, it is not guaranteed by our analysis.

5,

Summary

In this article, we review dual-based modeling and algorithm design in congestion control applications, and study the effects of error in price estimation, which inevitably arises in real networks. A wide class of source and link algorithms naturally give rise to a network optimization problem through their equilibrium properties. This provides a useful mathematical formalism for interpreting the behavior of various congestion control algorithms, but relies on the assumption that some equilibrium exists. The dual problem, through its distributed structure, naturally inspires source and link algorithms which provably converge to an equilibrium at the optimal point. These algorithms, however, assume that sources are explicitly notified of the precise link prices. We have examined the effects of imperfect price communication in the dual gradient algorithm. We utilized the fact that

268

NEXT GENERATION INTERNET

price estimation error was equivalent to an inexact gradient calculation, and hence were able to characterize the dynamics of the network in terms of a region containing the optimum. Acknowledgments ARO, and AFOSR.

We acknowledge the financial support of NSF,

References Athuraliya, S., Li, V.H., Low, S.H., and Yin, Q. (2001). REM: active queue management. IEEE Network, 15(3):48-53. Extended version in Proceedings of ITCH, Salvador, Brazil, September 2001. http://netlab.caltech.edu. Bertsekas, D.P. (1995). Nonlinear Programming. Athena Scientific. Bertsekas, D.P. and Tsitsiklis, J.N. (1989). Parallel and distributed computation. Prentice-Hall. Brakmo, L.S. and Peterson, L.L. (1995). TCP Vegas: end-to-end congestion avoidance on a global Internet. IEEE lournal on Selected Areas in Communications, 13(8): 1465-80. http: / / c s . princeton. edu/nsg/papers/jsac-vegas. ps. Floyd, S. and Jacobson, V. (1993). Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1(4):397-413. ftp://ftp.ee.lbl,gov/papers/early.ps.gz. Jacobson, V. (1988). Congestion avoidance and control. In: Proceedings of SIGCOMM'88, ACM. An updated version is available via ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. Kelly, F.P., Maulloo, A., and Tan, D. (1998). Rate control for communication networks: Shadow prices, proportional fairness and stability. Journal of Operations Research Society, 49(3):237-252. Low, S.H. and Lapsley, D.E. (1999). Optimization flow control, I: Basic algorithm and convergence. IEEE/ACM Transactions on Networking, 7(6):861-874.

http://netlab.caltech.edu. Low, S.H. (2003). A duality model of TCP and queue management algorithms. IEEE/ACM Transactions on Networking, ll(4):525-536. Low, S.H., Peterson, L., Wang, L. (2002). Understanding Vegas: A duality model. Journal of ACM, 49(2):207-235. http://netlab.caltech.edu. Luenberger, D.G. (1984). Linear and Nonlinear Programming, 2 Edition. AddisonWesley Publishing Company. Mehyar, M., Spanos, D., and Low, S.H. (2004). OptimizaInfocom'04> tion Flow Control with Estimation Error. In: http://netlab.caltech.edu/pub/papers/estimation-infocom04.pdf. Paganini, F., Doyle, J.C., and Low, S.H. (2001). Scalable laws for stable network congestion control. In: Proceedings of Conference on Decision and Control. http: //www. ee. ucla. edu/~paganini. Stevens, W. (1999). TCP/IP Illustrated: The Protocols, Volume 1. Addison-Wesley, 15 th printing.

Chapter 11 FAST ALGORITHMIC SOLUTIONS TO MULTI-DIMENSIONAL BIRTH-DEATH PROCESSES WITH APPLICATIONS TO TELECOMMUNICATION SYSTEMS L. D. Servi Dedicated to the memory of Professor Julian Keilson (1924-1999)

Abstract

1.

Performance analysis of many future telecommunication systems necessitate numerically solving large multi-dimensional birth-death equations when analytical approaches fail. Motivated by an optical IP network access problem, this paper presents a new class of algorithms, faster than all alternatives, which can be specialized to two dimensional skip-free systems or applied to systems with arbitrary dimensions which are possibly non-skip-free.

Introduction

This paper begins with a telecommunication example which precludes exact analytical analysis due to its non-classical structure. This motivates the development of exact numerical algorithms, faster than all known alternatives, to first find the steady state solution to skip-free two-dimensional birth-death equations and then to solve systems with arbitrary dimensions which are possibly non-skip-free. An example: Consider an Internet Protocol (IP) access network, where TV access routers are connected via an optical network to the global IP network through a gateway router, cf. Narula-Tam et al. (2001). Access Router i has pi ports that connect to an optical network infrastructure. The gateway router has a sufficient number of ports to the optical network, F, as well as to the backbone so that one can assume there is no output queueing at the gateway router. Each port has a tun-

270

NEXT GENERATION INTERNET WW Route^

Distribution Networks

ACCESS NETWORK

—•a»

-

Access x ' Nodes, \ p2 Router2 <| Optical Network

1 i

i Routerj 1

\

\

s

s

(|

Pi

szy

(

Bone Network

RouterN

Figure 11.1. An optical access network

able optical transmitter and receiver and can transmit data over a single wavelength. The reconfiguration problem consists of determining which access router ports should be connected to the gateway router ports using which wavelengths. We assume that the optical network infrastructure is equipped with at least P wavelengths, thus the feasibility of the electronic layer topologies depends only upon the port restrictions rather than on wavelength or physical topology restrictions. If we assume that the wavelengths associated with the transceivers on the gateway router ports are fixed, then reconfiguring the electronic layer topology requires re-tuning the transmitter on one edge router and stopping transmission on another edge router port. The process of reconfiguring the electronic layer topology results in a period of time where the wavelength being re-tuned can not be used. The design issue is to determine which access router ports should be tuned to which wavelengths. If the wavelength assignments are static then the system is not responsive to the dynamic changes in traffic demand. On the other time, if the wavelength assignments are excessively responsive to traffic demand changes then the time required to re-tuned ports, when the wavelength cannot be used by any port, will excessively waste precious resources. Servi and Finn (2002) proposed assigning some permanent wavelengths to each access route port but set aside some roving wavelengths to be tuned to one router until its queue is empty, then

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 271

cyclically tuned to the next router until its queue is empty, ..., for each router. Given the parameters of the systems (the arrival rates and service rates) we seek to optimize the number of roving wavelengths with respect to the weighted average of the queue lengths at the various ports. The critical step in this optimization is to characterize the steady state queue length as a function of the parameters of the system and the number of roving wavelengths. An approximate analytical approach, based on a working vacation model was used by Servi and Finn (2002). However, as described in more detailed in Section 3, an exact numerical approach is possible. Motivated by this and other applications, in Section 1.1 skip-free Markovian processes are formally introduced first in two dimension and then in an arbitrary dimensional setting along with the defining birthdeath equations which characterize their steady state solution. Next, in Section 1.2 previous algorithmic approaches are briefly reviewed. Section 2 introduces a new class of algorithmic solutions which are specialized to the two dimensional setting in Section 2.1 and an arbitrary dimensional setting in Section 2.2. Finally, Section 2.3 demonstrates that the restriction of being a skip-free process can be removed without loss of generalization. Next, in Section 3 the motivating example, the optical IP access network is discussed in more precise terms as well as two wireless communication examples. The focus of this paper is methodological and hence no empirical results will be presented. Appendix A introduces a scaling variant of the algorithm in Section 2.2 which improves its numerical stability. This scaling idea can be applied to all algorithms described in this paper. The algorithm in Section 2.2 implicitly assumes that Ajj+i is nonsingular for all j . This is not always the case. Appendix B presents a variant of this algorithm which was found to be useful in practice to circumvent this obstacle. Finally Appendix C presents a new algorithm to solve AX — B where A is tridiagonal which is both fast and, as described in Appendix C, will not fail where a corresponding algorithm in Section 2.6 of Press et al. (1988) will. This algorithm is a cornerstone subroutine of all other algorithms presented and may have interest in its own right.

1,1

Skip-free birth-death processes

Consider a two dimensional Markovian process Z_{t) on a state space {0,..., ni} x {0,..., no}. The process Z_(t) is defined to be skip-free if all

272

NEXT GENERATION INTERNET

transitions of Zjt) occur with each component increasing or decreasing by at most 1, i.e., transitions occur from (j, k) to a different state ( / , k1) where \k — k'\ < 1 and \j — f\ < 1. Such processes are governed by a set of (no + 1) x (no + 1) tridiagonal matrices of infinitesimal rates, Ajjf such that

pr[z(t

+ h) = w,k1) i z(t) = (i,k h = \AjA

L

for (j, k) ^ (/,&')•

(11.1)

J &,*•'

Two dimensional skip-free birth-death processes have been investigated extensively by Keilson et al. (1981, 1987), and have ties to the matrix geometric work of Neuts (1981). Let [Ajj]k^k be the negative of the total probability flow rate out of state (j, /c), so jj-i + Ajj + Ajj+i '

k

=0

for all i and j .

(11.2)

\k,k'

'

Since Z_{t) is a conservative finite space Markovian process it has a stationary distribution which will be unique if the dynamics is irreducible. The stationary distribution can be expressed in terms of the (no + 1) dimensional vectors,

ej =

[Z(oo) = (j, 0)],..., Pr [Z{oo) = (j, n 0 )])

for all j .

The balance flow at each of the states (j, 0 ) , . . . , (j, no) imply that e j satisfies j-iAj-ij

+ ejAjtj

j

T

for j = 0 , . . . ,ni,

(11.3)

where 0 T — ( 0 , . . . ,0) and x(-) is the indicator function. Finally, note that n\

3=0 T

with 1 = ( 1 , . . . , 1), is the normalization condition for the system. Before presenting a fast algorithm to solve (11.3) and (11.4), we generalize (11.3) to higher dimensions using similar notation which we will also later solve.

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 273

In particular, suppose j' = (ji,..., jn) and f = (j[,..., j'H) and define the matrix Ajy such that the infinitesimal generator for the probability flow from (j,k) to (/,£/) is [Ajjf]^/. If the system is skip-free in the last component then the matrices Ajjt are all tridiagonal. Define the space J = {i-3he{0,l,-->,

nh} for all h G {1, 2 , . . . ,

and a neighborhood of j as N(f) = {f:fej

and [j - f]h € {-1,0,1} for all h € {1, 2,..., H}}.

In three dimensions, i.e., H = 2, equation (11.3) generalizes to equating the the steady state probability flow into j to that out of j , i.e., for ji = 0 ..., ni and j 2 = 0,..., n2, l,J2-1^0'i-l,J2-l),(ji J2)

2-lAUl -jlJ2^Ul

J2-1),O'1 J2)

J2),(jl J2) + X { ^ } %

= 0.

(11.5)

For an arbitrary dimension, equation (11.3) and (11.5) can be concisely expressed as ejf AfJ = 0T,

for all j G J

(11.6)

with the normalization constraint, generalizing (11.4),

Rather than solve (11.6) and (11.7) directly, we will instead first solve (11.6) with the additional artificial constraint

274

NEXT GENERATION INTERNET

where M is a pre-specified set, and then renormalize the resulting solution to ensure that (11.7) is ultimately satisfied. Equation (11.3) is equivalent to (11.9) where eT = (e^, ef,..., e^J is a 1 x (no + l)(ni + 1) vector, and k),0

A 0) i

H.O

A=

0

0

AL2

0

0

0

•••

0

0

•••

0

0

0

A 2l i

A2<2

A2,

0

0

A3,2

-4-3,3

•• •

0

0

0

0

0

-4.4,3

•• •

0

0

0

0

0

0

0

3

*-ni—i

An,^

V o

o

o

0

i —l,ni

I

is a (no + l)(^i + 1) x (no + l)(^i + 1) matrix which is block tridiagonal and each block is itself a tridiagonal matrix. Equations (11.5) and (11.6) can also be put in the form of (11.9) where here, A, will be a recursively tridiagonal matrix, i.e., a block tridiagonal matrix where each block is a block tridiagonal matrix and in turn each of these blocks are block tridiagonal and where this recursive form is repeated one or more times.

1.2

Previous two dimension algorithms

Solving e T A = b can performed using direct approaches and iterative approaches. Below is a brief review for existing algorithms too vast to include in much detail, cf. Engeln-Miillges and Uhlig (1996), Press et al. (1988) and Stewart (1994). Direct methods, such as Gaussian elimination, the Gauss-Jordan methods, or LU decomposition methods, using, for example Crout's algorithm, involve different variations of the following steps. The key

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 275 idea is to exploit the observation that eTA = b and AP = LU implies xT£7 = bP where eTL — x T . The algorithm is as follows: 1. Determine a suitable permutation matrix P. 2. Find a lower triangular matrix L and an upper triangular matrix U such that AP = LU. 3. Use a simple forward substitution to find x such that xTU = bP. 4. Use a simple backward substitution to find e such that eTL = xT. If e has (no + l)(^i + 1) components then computation time of these algorithms is O(n^n\). Block Gaussian elimination algorithms apply the above approach to blocks, cf. Englen-Miillges and Uhligh (1996), p. 121, and Keilson et al. (1981, 1987). Gaver et al. (1984) found regenerative interpretations of such algorithms and Grassmann and Heyman (1990) and Grassman et al. (1985) also extended the methods to infinite state systems with repeated rows. As applied to solving (11.3) and (11.4) one variation of block Gaussian elimination algorithm is as follows: The basic approach is to transform A to a block lower triangular matrix and then solve the backward equations. More precisely, one can use (11.3) to inductively prove that step 2 below results in e^+lAj^ij + ejAjj — 0 for j = 0,... ,ni — 1. Analogous to (11.8), initially assume that e ^ = (0,0,... ,0,1) and then renormalize. But this implies intially assuming e£/1Aniini = (0, 0 , . . . , 0,1). The algorithm is therefore as follows: 1. Set Ao,o = Ao,o2. For j = 1,... ,m: Set ~Ajj = Ajd -

AJJ-ICAJ-IJ-I^AJ-IJ.

3. S e t e ^ - ^ O , . . . ^ , ! ) ^ ^ ) " 1 . 4. For j = m - 1,... ,0: Set ej =

-ej+1Aj+hj(Ajd)-K

5. Reset e j = e j / ^ ^ o e j l , i.e., normalize the probabilities. This is a O(nino) algorithm as the critical step requires the inverting n\ different no x no matrices. An essential characteristic of this algorithm is that although the Ajj's are tridiagonal matrices the Ajj's typically lose this structure. Section 2 describes algorithms which maintains the tridiagonal structure throughout its execution and hence requires fewer multiplications.

276

NEXT GENERATION INTERNET

Iterative methods, such as Gauss-Seidel method with relaxations (or setting UJ — 1 for no relaxation) entails iteratively evaluating ei(t) for t = 1, • • • , using the equation n

i-l

- V

\

ek(t)Aki)

k=w

J

This algorithm, as well as the related Jacobi Method requires 0{riQn\) operations per iteration and will approach the solution to e T A = b at a speed related to the eigenvalues of A and the value of a; set, cf. Stewart (1994), Section 3.2.3, Engeln-Miillges and Uhlig (1996). To be effective, it is important to preprocess the matrix A to reduce its condition number.

2,

A new faster class of algorithms

This paper introduces a new class of fast algorithms to solve (11.6) and (11.7) and hence (11.3) and (11.4). Although the implementation details for each algorithm in this class differ, the fundamental approach for each is as follows. Alternative implementation details of this approach will be described in Sections 2.1, 2.2 and Appendix B. STEP 0. Preliminaries: Specify a carefully chosen set, .A/f, such that the probability of being in an arbitrary state can be expressed as a linear combination of the probabilities of being in each of the states in M>. Specifically, define fT as the 1 x \M,\ vector of the steady state probability of being in each of the states in M. (listed in lexicographic ordering l) and then assume the existence of the \A4\ x (no + 1) matrix Rj such that £j=lTRj_

forjej.

(11.10)

For example, in two dimensions, if j = j , Rj = RJ, and M. = {(0,0),...,(0,n 0 )} so / T = (eo,o,eo,i,.". .,e 0> n 0 )j t h e n (2.1) is specialized to the matrix geometric form ej — e^R-i explored in Neuts (1981) and the vast literature which followed. More generally, matrices Rj can also be interpreted as a Green's function as motivated by Keilson (1965), in that it relates probabilities on a boundary, in this case A4, to interior state probabilities. x

By lexicographic ordering we mean (ji,J2, • • • ,jno,k) is listed before (j'i, J2, • • for some £, jt < j'e and ji — j^ for i = 1,..., I — 1 or k < k' and ji = jf{ for all i.

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 277 From (11.6) and (11.10) R

£T( Yl

r AJ'J)

== T

for a11 G J

Q>

i

-

(n-n)

A necessary condition for (11.11) is Rjf

AfJ

= 0,

for all j_e J.

(11.12)

/€N(j)

S T E P 1.

Initial conditions: Evaluate (11.10) for (j_, k) G M, i.e., [e^ - [/TJ?^

for (j, fc) G M .

(11.13)

Since / contain only the steady state probabilities of being in M. this is a self-consistency equation which can be used to identify some Rj. S T E P 2. Recursively solve for all RJ: Equation (11.12) forms the basis for a recursive algorithm for efficiently solving the Rj's if no singularities are encountered. (If singularities are found, an alternative choice of M, must be used). To compute the JJj's, not all of (11.12) is required: Instead, it typically suffices to use only Rf

Ay}

-0,

for (jf,fc) g n - M

where n = (ni, n 2 , . . . , nH-i,nH, S T E P 3.

and all p

(11.14)

n 0 ).

Find / T : Find / T using (11.11) and

fl

=l

(11.15)

(which is equivalent to (11.8)). Since (11.12) implies (11.11) only the subset of (11.11) not used in Step 2 offers linearly independent information about / . Specifically the subset of value is

lT(

Yl

R

f

A

3',j)\

=0

for (j,fc) Gn-A^.

(11.16)

S T E P 4. Find ej : Recursively solve for ej using (11.6). (Alternatively, one could solve for e j using (11.10) but this will typically require more memory as the Rj's would have to be saved).

278

NEXT GENERATION INTERNET

STEP 5. 2.1

Renormalize: Reset ej = ej/ ]T\ € j ejl. A basic two dimensional algorithm

For the two dimensional problem (with H = 1), if Aj+ij is nonsingular for j = 0 , 1 , . . . , n\ — 1, then, using the above steps in Section 2, a simple, very fast, and memory efficient algorithm can be developed for solving (11.3) and (11.4). Algorithmic development: Let J — { 0 , 1 , . . . , n\} and M. — {(0,0),..., (0,no)} so fT — (eo,O)6o,i,... ,eo,no) ~ £Q* ^n ^his case (11.13) implies eok = Y,i eOi[Ro]ik or e 0 = e0R0 so

equation (11.14) implies

forj = 0 , . . . , n i - l ,

(11.18)

(11.15) and (11.16) imply

/Tl = 1

and

iT(Rm-iAni-lini

+ RniAnuni)

= 0 T , (11.19)

and (11,6) implies

for j = 0 , . . . , n i - 1. (11.20) The resulting algorithm is A L G O R I T H M 1: 1. Initialize JRo using (11.17). 2. For j = 0,.. . n i - 1: Solve for Rj+1 using (11.18). 3. Solve for f

using (11.19).

4. For j = 0 , . . . , n\ — 1: Solve for e J + 1 using (11.20) and e^ = fT.

5. Reset eJ =

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 279

Computation time: This algorithm requires O(Max[niriQ, njj]) multiplications as step 2 requires inverting (or computing LU factors for) n\ different no x no tridiagonal matrices (taking 0{n\n§)) (using the algorithm in Appendix C), step 3 inverts one non-tridiagonal no x no matrix (taking 0{v?^)\ and step 4 multiplies a no x 1 vector by a no x no matrix 3ni times (taking O(n\nfy). Without loss of generality we can assume n\ < no making ALGORITHM 1 an O(ninjj) algorithm. This is fewer multiplications than that required in a Gaussian algorithm, O(n^njj), a Block Gaussian algorithm, O(ninjj), or the GTH algorithm, O(nin[j). In addition to storing the A's, the limiting memory requirements for this algorithms is the storing of 3 different (no + 1) x (no + 1)1? matrices for steps 2 and 3. This algorithm, as with the other new algorithms presented in this paper, performs some subtractions which, unlike the GTH algorithm (Grassmann et al., 1985) may raise numerical hazards. This hazard is overcome in Servi et aL (2004) using very high precision arithmetic. Numerical stability is also improved using the scaling approach described in Appendix A. As noted in the beginning of this section, this algorithms implicitly assumes that Aj+ij is non-singular. If this is not the case, the variant of this algorithm in Appendix B has been found to be useful in practice to circumvent this problem. More generally, this problem can be circumvented using an alternative definition of AA. Finally, note that if (11.3) for the case of j = n\ was generalized from () A 44 l i , i +£ni^ni,m = QTT to £ / U ^^ ^ n i =T 0 T then ALGORITHM 1 may be altered by replacing the second equation of (11.19) with fT{Y^j*=o-^j'A?'>rai) = - T while keeping the algorithm

2.2

Generalization to three dimensions and beyond

For more than three dimensions a similar algorithm can be developed. Algorithmic Development: Let M = {(mT,fc) :m£ M and k G {0,1,... ,n 0 }} where M = {mT : m £ J with m^ = 0 for at least one h }. With a slight abuse of notation, it is useful to define the (|M|(no + 1) x (no +1) matrix Rj as a concatenation of (no +1) x (no +1) matrices for each value of m G M.

280

NEXT GENERATION INTERNET

Let fT be a lexicographic ordering of the vector of steady state probability of being in each of the states in Jvi and Rj is a concatenation of matrices Rmj- Then, (11.10) is equivalent to

meM

Equation (11.11) (or equations (11.21) and (11.6)) imply A

for a11

/ J = 2'

i

e J

(1L22)

feN(f) me and (11.12) becomes Afj = 0

for

all I e J

and meM.

(11.23)

A solution to (11.13), also derived by evaluating (11.21) for f e M. is f°r

Rmj> = X{f=m}I

a11

/ € M and meM.

(11.24)

Using (11.23), equation (11.14) becomes Rmf

Ajf^\A^-+v

for all j $ n- M and all m e M

(11.25)

where n = ( n i , . . . , nn). Equations (11.15) and (11.16) become i_m\_ — 1 and

y ,

/-^

-m ^nij' Ajj< = 0, for all f en-M.

(11.26)

Finally, (11.6) becomes

e[+I = (

E

^

A

/ j ) A 7i,j

The resulting algorithm therefore follows:

for a11

I * a - M. (H'27)

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 281 A L G O R I T H M 2: 1. For all m G M and j G M: Initialize Rmj (and hence Rj) using (11.24). 2. For j H = 0, . . . n # - 1,: For 2H-\ = O,,...nH-i ~ I:--- For j x — 0 , . . . n\ — 1: For all m G M: Solve for j ? m j + i using (11.25). 3. Solve for em for m G M (and hence / ) using (11.26). 4. For j H = 0, , . . n # - 1,: For j # _ i = 0, ,...nH-i ji = 0 , . . . n i - 1: Find e j + 1 using (11.27).

~ I:--- For

5. For all j_e J\ Reset e j = e j / Computation Time: This is an 0(Max[(|J| - |M|)|M|(l + n 0 ) 2 , algorithm, as it requires inverting (or computing LU factors for) \J — M||M| different (1 + no) x (1 + no) tridiagonal matrices and one nontridiagonal (1 + no)|M| x (1 + no)jM| matrix. Note that this requires that Ajfjt+i is non-singular for all / . If this is not the case then the problem must be reformulated, as was done in Appendix B. It can be shown that

[()( H

and \J\ = elements. This implies, for H — 1, \M\ = 1, \M\ = 1 + no, and \J\ = 1 + m . For H = 2, \M\ = m + n 2 + 1, |A^| = (ni + n 2 + 1)(1 + n 0 ) and \J\ = (1 + m ) ( l + n 2 ). Hence, that for if = 1 ALGORITHM 2 requires O(Max[ninQ, nj]]) multiplications, as previously computed, and for H = 2 the algorithm requires O(Max[nin 2 (ni + n 2 + 1)(1 + ^o) 2 , [n\ + n 2 + 1)3(1 + no)3]). This is fewer multiplications than required for a Gaussian algorithm, 0(n^n\v%), or a GTH algorithm, O(no(nin 2 ) 3 ) (if put into a two dimensional framework with 1 + no columns and (1 + n\)(l + n 2 ) rows and is skip-free in the column direction only). If (11.6) were generalized to be ]T\-/ e^ -Aj/j = 0 T for j1 £ n — M then ALGORITHM 2 can be modified to accommodate this generalization with a modification of (11.26) analogous to that discussed in the last paragraph of Section 2.1.

282

2.3

NEXT GENERATION INTERNET

Non-skip-free Markov processes

This paper assumes that the Markov processes of interest are all skipfree. However, following theorem and corollary demonstrate that the restriction can be removed without loss of generality. 11.1 If X{t) is a finite single dimension Markov process then there exists a multi-dimension Markov process, Z_(t) such that (i) Z_{t) is skip-free; (ii) there is a one-to-one mapping from Z_{t) to X(t).

THEOREM

Proof. Let H be the smallest integer greater than the base 2 logarithm of the largest value of X(t). Define g{m) as the base two representation of m, i.e., g(m) = (go(m),... ,gH(m)) where Ylk=o9k{m)2k = m and each pfc(m) equals 0 or 1 for all k. For all values of t and t! set

Pr Z(i') = g(m') | Z_[t) - g(m)

= Pr\ X{t') = m' \ X(t) = m .

The one-to-one mapping between X(t) and Z_{t) is reflected in the oneto-one mapping of the state spaces between m and {go{m)^gi(m)^..., 9H(™<))- The process Z_(t) inherits its Markov properties from X(t). By virtue of the base two representation, each component of Z_(t) equal 0 or 1 and hence all transitions entails changes in each component by no more than 1. Hence, Z_{t) is skip-free. • There is a simple geometric interpretation of this theorem: The state space, 7Y, consisting of the vertices of a H+1-dimension hypercube has the property that each states is of distance 1 away from every other state in each direction. Hence, all Markov processes on H are skip-free. If each vertex of a hypercube is associated with a different state then all Markov process are equivalent to one on H. Hence, the theorem follows. The following corollary demonstrates that the assumption that X[t) is one dimension can also be relaxed. 11.1 If X_{t) is a finite multi-dimension Markov process then there exists a multi-dimension Markov process, Z_{t) such that (i) Z_{t) is skip-free; (ii) there is a one-to-one mapping from Z_{t) to X_(t).

COROLLARY

Proof. Since all finite multi-dimension Markov processes are equivalent to a renumbered single dimension Markov process the corollary follows • directly from the Theorem.

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 283

Using this theorem and corollary one can always construct a skipfree multi-dimension Markov process equivalent to an arbitrary multidimension Markov process. Hence the skip-free assumption imposed throughout this paper can be removed without loss of generality. Of course, the construction of a skip-free Markov process in the theorem and corollary need not be the best construction to use.

3,

Examples

This papers begins with an optical IP access network example to motivate this analysis. In Section 3.1 more details are provided for its analysis. Also, in Section 3.2 and 3.3 are two wireless performance analysis examples which cannot be addressed using a pure analytic approach due to its non-classical structure. Other examples can be found in Servi (2002).

3.1

The optical IP network access problem

The optical IP network access problem can be modeled as n queues, corresponding to the n access routers. The system approach taken to address the reconfiguration problem in Servi and Finn (2002), is to assign router i a nominal permanent bandwidth giving it the capacity to route traffic at an average rate of \i{. In addition, roving wavelengths, with the capacity to increase the routing rate of any single router by an average of /i* will be initially assigned to router 1, will be assigned to router 2 after router l's queues empty, and will cyclically allocated to all of the routers. It is assumed that a random time A is required for the reconfiguration to take place as the roving wavelengths are retuned from one route to the next. This approach can be modeled using a token that cyclically rotates among the queues switching only when a queue is empty. Queue % is assumed to have a Poisson arrival rate of traffic with a mean rate of A^ and a nominal exponentially distributed service time2 with a mean rate of Hi if it does not have the token and a mean rate of [i% + M* if it does. In addition, the reconfiguration period is assumed to be exponentially distributed with a mean rate of 5 and that during this period no router can use the capacity of the roving wavelengths. We denote ji for i = 1,..,, N to be the number of messages in queue at router i. We define the state variable jo and JN+I such that JN+I 2 The exponential assumption is used for simplicity. A more complicated finite moment service distribution could be better approximated by having a set of states corresponding to different phases of a distribution. For example, an Erlang-K distribution could be trivially modeled.

284

NEXT GENERATION INTERNET

equal 0 if the roving wavelengths are assigned to router jo and jjv+i equal 1 if the roving wavelengths were last assigned to route jo but are currently in the process of being reconfigured to be assigned to router jo + 1. Hence, no = iV,n/v+i = 1, and. the other U{ are set to be large relative to the expect value of the queue length at router i. The state of the system is the iV+2-dimensional vector j = (jo, j i , . •., jiv+i)- The steady state equilibrium equations are given by (11.6) where ej is the steady state probability of being in state j and Ay ^ for % — 1 , . . . , JV, if J£ = ki for i ^ £ except where noted, and

Az k{ — ji + 1 < n^, Hi fci = jz - 1, ji > 1, and jo / i or jjv+i = 1, Hi + H* fei = ji - 1, ji > 2, jo = ijN+i = 0, /Xj +/i* fci = 0, ji = 1, jo = hJN+l = 0,feAT+i= 1, 6

jN+1

= 1, /CAT+I = 0, fc0 = (jo + 1) mod AT,

and is otherwise 0. The first case of the above expression corresponds to an arrival to router i. The second corresponds to a service at router i when the roving wavelengths are not at router i. The third corresponds to a service at router i when the roving wavelengths are at router i and the queue is at least 2. The fourth corresponds to a service at router i when the roving wavelengths are at router i resulting in the queue emptying as hence the roving wavelength are in the process of being reconfigured to be assigned to router (i + l)modN. The fifth corresponds to the roving wavelengths completing their reconfiguration to the next router. The steady state equations associated with this example can be solved using ALGORITHM 2 and then quality of service performance measures can computed to find the best number of the roving wavelengths subject to the constraint that J ^ [i% + M* ^s fixed. One useful measure of performance is a weighted average of the mean queue lengths, i.e.., Ylj ^2i wijiej where w% is the weight of queue i. Since the variable to optimize is both robust and one dimensional the optimization can be performed by plotting the performance measure as a function of ^* and quite accurately estimating a near optimal /i* by inspection.

3*2

The retry cellular voice problem with limited queuing

Consider an N channel wireless voice base station which has an arrival rate of A and a service rate ji. Suppose that a call arriving and finding

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 285 all channels busy will enter a queue and either (i) a channel frees up and it (immediately) begins service, (ii) the call's constant hazard rate v of abandoning results in it leaving the system unserved, or (iii) the queueing period having an a priori exponential distribution with a mean r~1 ends with the customer unserved but in the system. In the third case, with probability 1 — p the call will immediately exit the system and with probability p the call will redial after a exponentially distributed amount of time with mean r]~l. The redialed call will enter service immediately if a channel is available or repeat the three options facing a fresh call that finds all channels busy. This model corresponds to a cellular system in which there is a limited time period (of average length T " 1 when blocked calls are rapidly automatically redialed, the customer might abandon it wait for a channel with hazard rate rj and a customer will redial with probability p. Let i be the number of calls in the system, i.e., in service, in queue or in the redial pool and j the number of calls in the pool which will redial. Finally, assume that the maximum number in the pool is n\ and the maximum number in the system is truncated at i = n o 3 The balance equations at (i,j) for this model can be put in the form of (11.2) and (11.3) as follows: For j = 0 , l , . . . , n i ,

otherwise,

+ max(i - j - N, 0)(v + (1 - p)r) A

if k = i - 1 if k = i + 1 and i < no

0

otherwise

and [

3

j j+lUk

'

=

J max(i - j - TV, 0)pr if k = i and j < n\ \0 otherwise.

Note that to be non-trivial., the maximum number in the system must exceed the maximum number in the redial pool plus the number of channels, i.e, no > m -\- N,

286

NEXT GENERATION INTERNET

The steady state equations associated with this example can be solved using ALGORITHM 1 and then the following quality of service performance measures can be computed: i) The probability that a fresh arrival finds all channels busy is ()

ii) The probability of a random retry call being blocked is the rate that retry calls are blocked divided by the rate that they are generated, is min(no-iV,ni)

no

n\ no

iii) The probability of getting service (after being served from the queue, immediately upon arrival, or after redialing) equals the total rate of call initiations divided by the rate of external calls, A, i.e., iin(ni,no—N— i)

m mm(j+N— l,no) 3=0

iv) The average number of retry attempts per call that is not served immediately equals the rate of retry attempts divided by the rate that fresh calls arrive and find all channels busy, i.e., Ns = X^io i ^-~->min(Tlx ,Tlo — Nj V-^77,0

2 ^ = 0 JVJ,i

Z^j=O

2^i=j+N

3,i

v) The average time waiting due to retrying before exiting is W =

3*3

The retry cellular voice problem with numbered reserve channels

Consider channel base station with n\ universal channels and n reserve channels and an arrival rate of An of new calls, an arrival rate of A/i of hand-in calls, a service rate \i and a hazard rate of redialing 77. Assume that a blocked hand-in call will not redial, that a hand-in call will prefer a universal channel to a reserve channel (if both are available) but can use a reserve channel, and that a new call cannot be admitted to a reserve channel.

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 287 Let i be the number of hand-in calls in a reserve channel currently in service, j i be the number of hand-in or new calls in a universal channel currently in service, J2 the number of new calls in the pool which will redial (having a maximum capacity of n 2 ), p be the probability that a blocked fresh new call will redial, and p be the probability that a blocked redial new call will redial again. This model represents a more typical implementation of a reserve channel access protocol found in practice and is modeled as follows: For all j , i, and /c,

fa = j i + i,fc 2 = J2J1 = 0 , . . . , n i - 1 fa = 3U fc2 = h + I J J l = n l fei = j i ~ 1,&2 = J2J1 = l , . . . , n i fei = j i + 1,/C2 = J2 ~ l , j i = 0, . . . , n i - 1 fa = j i , fe2 = J2 ~ 1 , h = ^ 1

fc = i + 1, ji = ni, i ^ n0 k — i —1

and all other components of -A-(juj2),{k\M) e c i u a l 0-

Setting E% = ESoE?=o%i

and

^Ji = ESo E"22=o ^

perfor-

mance measures of interest can be computed: The probability that a new call receives service on its first attempt is 1 — E^x, the probability that a hand-in call receives service on its first attempt is J2{j^n2 or i^n0} ej\*> the probability that a new call receives service, possibly after one or more redialing, is the total rate of successful new calls divided by the total rate

of external new calls, i.e., £>i="°1(Affi +r>E^ = ! _ # £ + £ ^ " J Eh, the average number of attempts of each successful new call is the total rate of attempts by new callers that stay in the system divided by the rate of calls that are successful, i.e.,

288

NEXT GENERATION INTERNET

and the average time that a new call that is successful waits is W = N$~1 • For a fixed number of total channels the design problem is that of determining the number of reserve channels. For reasonable parameters the typical desirable number of reserve channels was 0, 1, or 2 and hence heuristic optimization methods were used. Acknowledgments This work was conducted while the author worked at GTE Laboratories (now Verizon Laboratories).

Appendix: A. Numerical considerations Equation (11.3) can be solved recursively as described in ALGORITHM 1. However, there is a practical numerical problem that must be addressed: since the probabilities for high values of j may be many orders of magnitude smaller than those for low values of j numerical instability may set in. A useful remedy follows. The basic approach is to rescale the probabilities to eliminate the numerical stability, solve the modified equations and then return to the original scaling. Specifically, let

then, after substituting (11.A.I) into equation (11.3) and dividing by n i = i r fc we §e^ ~~°~l

+ ZA

+X T z A j + i j

= 0.

(11.A.2)

If, for example (1LA 3)

'

where ^1 is a vector of l's then (11.A.2) simplifies to

Hence, it should be this equation and not (11.3) which should be solved numerically and that (11.A.I) should be used to obtain the final solution. Note that if 2 (and hence the Ajj+i's and Ajj-i's) are a scalar this reduces to Xj^oZj-iAjj-i +ZjAj,j + Xj^mZj+i Ajj+i = 0 which, from (11.1), has the solution Zj — 1 for all j . An alternative to this scaling is to replace (11.A.3) with rj = ^ A J " 1 | | ~ where w_ might be 1 or ( 1 , 0 , . . . , 0) T or some other vector. This rescaling approach can be generalized to applied to the other algorithms in this paper.

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 289

Appendix: B, A two dimension variant of ALGORITHM 1 For many practical applications, [Aj+hj]k',k = 0 for fc' = 0,...n*,

j = O,...,m-l,

and all k.

(11.B.I)

for some value of n* > 1. In this case, AJ+IJ is singular so ALGORITHM 1 is not viable. Instead one must identify an equivalent formulation of (11.3) or (11.6) for which the corresponding matrix is non-singular. One useful approach is to define S = (6j,o,. • • ,e?> 0 ), as ej-itk ej,k

for k = 0 , . . . , n* f ° r k — n* -f 1 , . . . , no

(11.B.2)

j = 0, . . . , m

where ej is 1 x (no + 1) if j G J and, again, all indices are modulus n\ + 1 so, by definition, = (e n i ,o» e n i > i , . . . , e n i , n * 5 eo,n*

If for j ' = i - 1, i ' = 0 , . . . , n * , ^J-IJ]^^

for j ' = j — 1, i1 = n* + 1 , . . . ,no,

{^o}[-Aj-i,j]i',i ^j,j]i>,i

f° r f — j) i' — 0, • • • ) n * for j ' = j , z' = n* + 1 , . . . , no,

^-j,j]i',i

f° r j ' — i + 1? ^ — O? • • • >n*>

4j+i,j]i/ f i

for j ' = j + 1, z' = n* + 1 , . . . , no,

then using (11.B.I) and (11.B.2) one can verify that (11.3) is equivalent to X{&o}eJ-iAj-i,j

+ejAji:J

+ eJ+1Aj+i,j

=0

T

,

j = 0,...,m.

(11.B.3)

With these definitions an algorithm can be developed to solve (11.B.3) (and hence (11.3)) which, ironically, is faster than ALGORITHM 1. Algorithmic

Development:

Let M - {(0,n* + l),(0,n* + 2 ) , . . . ,(0,n o )} so ~ (eo,n* + l» eo,n*+2, . . . , Co,n())

(11.B.4J

and, from (11.10),

ej - frRj

for j e J

(11.B.5)

and Rj is (no — n*) x (no + 1). Equation (11.13) implies no-n*-l e o ,i =

[•Ro]pt = X { i = p + n * + i }

e o ,p+ n *+i[^o]p,i

for z = n* + l , . . . , n 0 ,

for i = n * + l , . . . , n 0 a n d p = 0 , . . . ,no-n*

- 1. ( 1 1 . B . 6)

290

NEXT GENERATION INTERNET

The algorithm below uses Ro only by premultiplying by Ao,ni, which, from (11.B.I), contains only zeros in the first n * + 1 rows. Hence, it is not necessary for equation (11.B.6) to define the first n* 4- 1 columns of JRoEquations (11.14) implies Rj+i = ( - X{&o}Rj-iAj-ij

-RjAjjjAjlu

j = 0, . . . , m - 1

(11. B. 7)

and ni — lAni

— itni + RniAni,ni

4" ROAQ^I

= 0

for p = 0 , . . . ,no — n* — 1 and/c — 0 , . . . ,n*.

(11.B.8)

Equations (11.15) and (11.16) imply / T l - 1 and

Finally, , equation (11.B.3) implies 1 1A,-iJ-eJA^jA7+ 1J

for j = 0,... ,m - 1. (11.B.10)

The resulting algorithm is:

ALGORITHM 1': 1. For i = 0,..., no, and p = 0 , 1 , . . . , no — n* — 1: Initialize some components of Ro using (11.B.6). 2. For j = l , . . . , m : Solve for ^ using (11.B.I), (11.B.6), and (ll.B.7). 3. Find f using (11.19). 4. For j = m , . . . , 0 : Solve for e^- using (11.B.4) and (11.B. 10). 5. For j = m , . . . , 0: Solve for e • using (11.B.2). 6. For j = 0,..., m: Reset e;- = e^/ J^™i0 eJl, i.e., normalize the probabilities. This algorithm requires 0(Max[nino, (no — n* + I)3]) multiplications as step 2 requires inverting (or computing LU factors for) m different no x no tridiagonal matrices (taking O(nino)), step 3 inverts one non-tridiagonal (no — n* + 1) x (no — n* + 1) matrix (taking O((no — n* + I) 3 )), and step 4 multiplies a no x 1 vector by a no x no matrix 3ni times (taking O(nirio)). This is fewer multiplications than is required for a Gaussian algorithm, O(n3no), a Block Gaussian algorithm, O(nino), the GTH algorithm, O(nmo), or even ALGORITHM 1, O{max[ninl,nl]). In addition to storing the A's, the limiting memory requirements for this algorithms is the storing of 3 different (no - n*) x (no + 1) R matrices for steps 2 and 3.

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 291

Appendix: C* A fast algorithm for solving AX = B if A is tridiagonal and non-singular The algorithms in this paper require repetitively solving AX = B

(11.C.I)

with O(mn) multiplications where A is an (n + 1) x (n + 1) non-singular tridiagonal matrix whose (i,k)th component is a^, X is a (n 4- 1) x (m -\- 1) whose (i,p)th component is XiiP, and B is a (n 4- 1) x (m 4-1) whose (i,p)th component is b^p. To be consistent with the rest of this paper it will be assumed that the matrix indices start at 0 rather than 1. To implement the algorithm there is a substantial memory saving if the structure of A is exploited by representing A using three vectors, a + , a0, and a~ whose ith components are af = X{i#n>aM+i> a? = Q>i,i, and a~ = X{^O}&M-IFirst an O(mn) algorithm will be described that is applicable only to the case of af 7^ 0 for all i. This algorithm will then form the basis for an equally fast algorithm that doesn't rely of this assumption. Proof of correction will follow the second algorithm.

CASE 1: af ^OforalH: Define CjiV and dj such that X

J,P - chv + djX0,P

j = 0 , . . . ,n.

(11.C.2)

The algorithm below will first recursively solve for the Cj)P's and dj's and then find a self-consistency equation to solve for xoiP. This is done as follows:

ALGORITHM C.I: S T E P 0.

Let d0 = 1 and for p = 0 , . . . , m: Let c0)P = 0.

S T E P 1.

Forz = 0 , . . . , n - 1 :

and SET di+1 = S T E P 2.

w^ar^iflgA;

(1LC>4)

Forp = 0 , . . . , m : SET xo p =

j andn-i

.

(11.C.5)

+ a%dn

STEP 3. For p = 0 , . . . , m and for j = 1 , . . . , n: Solve for Xj}P using (11.C.2) and (11.C.5). Before proving the correctness of ALGORITHM C,l we will proceed to demonstrate how to generalize it to the case of af — 0 for some i.

292

NEXT GENERATION INTERNET

C A S E 2 : a t = 0 for £ = 1,...,L and some L > 1. For all other values of

hat + °-

ALGORITHM C.2 S T E P 0.

Let io = - 1 and iL+1 - n.

S T E P 1,

For £ = 0,..., L : Apply ALGORITHM C.I to solve A(e)X(i)

= 13 W

(11.C.6)

where if i = 0 , . . . , Z£+i — it — 1 a n d j = i — 1, if i — 0 , . . . , t£+i — i^ — 1 a n d j = i, if z = 0 , . . . ,2£+i — it — 1 and j = i + 1, else, (11.C.7) and I &i+i£ + l,p — X{€#O} a z -j_i £ _j-i

if i = 0, and p — 0 , . . . , ra, if i = 1 , . . . , i^+i — it — 1 and p = 0 , . . . , m. (11.C.8) S T E P 2.

The solution to (11.C.I) is given by and £ = 0 , . . . , L .

(11.C.9)

REMARK 1 1 . 1 T/ia^ ALGORITHM C.I is O(nm) implies that ALGORITHM C.2 is O(nm). Note that if m = 0 then B and hence X will be vectors. Note, also that both algorithms naturally generalize to the case where A is not tridiagonal but has only zeros above the superdiagonal. One alternative to ALGORITHM C.2 is using a LU decomposition algorithm that is specialized to triadiagonal matrices, cf. Section 2.6 of Press et al. (1988). However, this approach has the danger of encountering a non-singular triadiagonal matrix which can fail if a zero pivot is required. Two issues must be addressed to establish the correctness of ALGORITHM C.I. One must demonstrate that the algorithm never divides by zero. As this is a more technical issue, its proof is deferred to Lemma 11.4. The more critical correctness is demonstrated in Lemma 11.1 and Lemma 11.2. LEMMA 1 1 . 1 Equations (11.C.I) and (11.C.5) uniquely solves (11.C.I) if A if nonsingular, tridiagonal and at i=- 0 for all i. Proof. From (11.C.I), for i — 0 , . . . ,n, n

h

— \~^

—

~~

•

4- °

•

-4-

+

(11 C 1fh

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 293 which, from (11.C.2), implies ( c i - i . p + d i - i x o , p ) + cii ( c * ) P 4 - d i X o , p ) + X { i # n } « i " ( c i + i , P +

+ , p )

(ll.C.ll) Equations (11.C.3) and (11.C.4) follow from (ll.C.ll) by matching the constants and coefficients of xoiP for i — 0,... , n — 1. Equation (11.C.5) follows from evaluating • (ll.C.ll) for i = n and solving for xotP. Before demonstrating the correctness of ALGORITHM C.2, observe by construction, that each equation in (11.C.6) has the property that A is a tridiagonal matrix with exclusively non-zero elements just above the diagonal so, indeed, ALGORITHM C.I is applicable. The correctness of ALGORITHM C.2 follows from Lemma 11.2: LEMMA 11.2 Equation (11.C.I) and (11.C.6) are equivalent. Proof. Equations (11.C.6)-(11.C.9) imply that for i = 0,.. .,ie+i - i i - \ ,

ie +

itpxie-ie_1-itP'

After the substitution of variable, j - i + it + 1, this becomes: For j — it -Y

However, since, by assumption that af£ — 0, x ^ _ ^ _ l p = a?j)P, and x^Z^^ XiiiP so this implies : For j — it + 1 , . . . ,i£

1-i,p

=

which is equivalent to (11.CIO). But, as shown in Lemma 11.1, this is equivalent to (11.C.I). • For completeness, one must establish that ALGORITHM C.I never incurs division by zero. As a means to address this we first prove the following lemma. LEMMA 11.3 do = 1 and equation (11.C.4) imply i =l t

_

n

_

x

(1LC

,2)

where A^ is defined as (i + 1) x (i + 1) matrix consisting of the 0,... ,i rows and columns of the matrix A. Proof. Proof by induction. Clearly, this is true for i = 0. Suppose this is true for i = 0,... ,L : The matrix A^ has two non-zero elements in its tth row. Hence, expanding, Det(A^) along this row (cf. Sections 2.4.8 and 2.2 of Marcus and Mine, 1964), [t1] Bet[A[i]] = -xiooy ? [ 1 ]

294

NEXT GENERATION INTERNET

where A l is the matrix A^"1^ with the ith row and (i — l)th column deleted. Since A has only one non-zero element in its last column, expanding Det(A *" ) along that column gives us Det(A[t~1]) = atf_1Det(A[t~2]). Hence, Det[A[/-]] - -x { ^o}ara t + _ 1 Det(A [fc " 2] ) + a?Det[A[t"11]. From the inductive hypothesis, (ll.C.ll) is valid for i = t and i = i — 1 so

which, applying (11.C.3), implies (ll.C.ll) for i - L + 1.

•

Finally, for completeness, below we demonstrate that ALGORITHM C.I cannot have division by zero if A is non-singular. LEMMA 11.4 If A is non-singular then the denominator of (11.C.5) is non-zero. Proof. Using nearly identical reasoning as was in Lemma 11.2, n-l

DetA = DetA[n] - (-<£dn-i - a°ndn) (JJ 4) (-l) n . fe=O

Since, by assumption, DetA ^ 0 and the product is non-zero, the lemma follows. D

References Engeln-Mullges, G. and Uhlig, F. (1996). Numerical Algorithms with Fortran. Springer Press. Gaver, D.P., Jacobs, P.A., and Latouche, G. (1984). Finite birth-and-death models in randomly changing environments. Advances in Applied Probability, 16:715-731. Grassmann, W.K. and Heyman, D.P. (1990). Equilibrium distribution of blockstructured Markov chains with repeating row. Journal of Applied Probability, 27:557-576. Grassmann, W.K., Taskar, M.I., and Heyman, D.P. (1985). Regenerative analysis and steady state distributions for Markov chains. Operations Research, 33:1107-1116. Keilson, J. (1965). Green's function methods in probability theory. Griffin's statistical monographs & courses, London. Keilson, J., Sumita, U., and Zachmann, M. (1981). Row-continuous finite Markov chains - structures and algorithms. Graduate School of Management, University of Rochester, Working Paper No. 8115. Keilson, J., Sumita, U., and Zachmann, M. (1987). Row-continuous finite Markov chains - structures and algorithms. Journal of Operations Research Society, Japan, 3:291-314. Marcus, M. and Mine, H. (1964). A Survey of Matrix Theory and Matrix Inequalities. Dover Press. Narula-Tam, S., Finn, G., and Medard, M. (2001). Analysis of reconfiguration in IP over WDM access networks. In: Proceedings of the Optical Fiber Communication Conference (OFC), pp. MN4.1-MN4.3. Neuts, M. (1981). Matrix Geometric Solutions in Stochastic Models - An Algorithmic Approach. Johns Hopkins University Press. Baltimore, MD.

11 Fast Solutions to Telecommunication Motivated Birth-Death Systems 295 Press, W.H., Flannery, B.P., Teurolsky, S.A., and Vetterling, W.T. (1988). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press. Servi, L.D. (2002). Algorithmic solutions to two-dimensional birth-death processes with applications to capacity planning. Telecommunications System, 21(2-4):205212. Servi, L.D. and Finn, S.G. (2002). M/M/l queues with working vacations (M/M/l/WV). Performance Evaluation, 50:41-52. Servi, L.D., Gerhardt, T., and Humair, S. (2004). Fast, accurate solutions to large birth-death problems. In preparation. . Stewart, W.J. (1994). Introduction to the Numerical Solutions of Markov Chains. Princeton University Press.

Chapter 12 A NEW PARADIGM FOR ON-LINE MANAGEMENT OF COMMUNICATION NETWORKS WITH MULTIPLICATIVE FEEDBACK CONTROL Haining Yu Christos G. Cassandras Abstract

1.

We describe the use of Stochastic Flow Models (SFMs) for control and optimization (rather than performance analysis) of computer networks. After reviewing earlier work applying Infinitesimal Perturbation Analysis (IPA) to SFMs without feedback or with additive feedback, we consider systems operating with a multiplicative feedback control mechanism. Using IPA, we derive gradient estimators for loss and throughput related performance metrics with respect to a feedback gain parameter and show their unbiasedness. We also illustrate the use of these estimators in network control by combining them with standard gradient-based stochastic approximation schemes and providing several simulation-based examples.

Introduction

A natural modelling framework for computer networks is provided by Discrete Event Systems (DES), most notably through queueing theory, e.g., Kleinrock (1975). However, it has become increasingly difficult for traditional queueing theory to handle the complexity of today's computer networks. First of all, the enormous traffic volume in today's Internet (which is still growing) makes packet-by-packet queueing analysis infeasible. Moreover, the discovery of self-similar patterns in the Internet traffic distribution (see Leland et al., 1993) and the resulting inadequacies of Poisson traffic models (see Paxson and Floyd, 1995) further complicate queueing analysis. Consequently, performance analysis techniques that do not depend on detailed traffic distributional information are highly

298

NEXT GENERATION INTERNET

desirable. Fluid models have thus become increasingly attractive. The argument leading to the popularity of fluid models is that random phenomena may play different roles at different time scales. When the variations on the faster time scale have less impact than those on the slower time scale, the use of fluid models is justified. The efficiency of a fluid model rests on its ability to aggregate multiple events. By ignoring the micro-dynamics of each discrete entity and focusing on the change of the aggregated flow rate instead, a fluid model allows the aggregation of events associated with the movement of multiple packets within a time period of a constant flow rate into a single rate change event. Introduced by Anick et al. (1982) and later proposed by Kobayashi and Ren (1992) for the analysis of multiplexed data streams and by Cruz (1991) for network performance analysis, fluid models have been shown to be especially useful for simulating various kinds of high speed networks (see Kesidis et al., 1996; Kumaran and Mitra, 1998; Liu et al., 1999; Yan and Gong, 1999). Stochastic Flow Models (SFM) have the extra feature that the flow rates are treated as general stochastic processes, which distinguishes itself from the approach adopted in Akella and Kumer (1986) and other work, e.g., Perkins and Srikant (1999, 2001). While the aggregation property of SFMs brings efficiency to performance analysis, the resulting accuracy depends on traffic conditions, the structure of the underlying network, and the nature of the performance metrics of interest. On the other hand, SFMs often capture the critical features of the underlying "real" network, which is useful in solving control and optimization problems. In control and optimization, e.g., Kelly et al. (1998) and Low (2000), estimating the gradients of given performance metrics with respect to key parameters becomes an essential task. Perturbation Analysis (PA) methods (see Ho and Cao, 1991; Cassandras and Lafortune, 1999) are therefore suitable, if appropriately applied to a SFM as an abstraction of an underlying network component or a network, as in recent work by Wardi et al. (2002); Liu and Gong (1999); Cassandras et al. (2002), and Cassandras et al. (2003). In a single node with threshold-based buffer control, Infinitesimal Perturbation Analysis (IPA) has been shown to yield simple sensitivity estimators for loss and workload metrics with respect to threshold parameters; see Cassandras et al. (2002). In the multiclass case studied in Cassandras et al. (2003), the estimators generally depend on traffic rate information, but not on the stochastic characteristics of the arrival and service processes involved. In addition, the estimators obtained are unbiased under very weak structural assumptions on the defining traffic processes. As a result, they can be evaluated based on data observed on a sample path of the actual (discrete-event) network and combined with gradient-based opti-

12 Multiplicative Feedback Control in Networks

299

mization schemes as shown in Cassandras et al. (2002) and Cassandras et al. (2003). This makes it possible to adjust parameters on line in order to adapt to rapidly changing network situations. On-line management is appealing in today's computer networks and will become even more important as high speed network technologies, such as Gigabyte Ethernet and optical networks become popular. In such cases, huge amounts of resources may suddenly become available or unavailable. Since manually managing network resources has become unrealistic, it is critical for network components, i.e., routers and end hosts, to automatically adapt to rapidly changing conditions. An important feature in today's Internet management is the presence of feedback mechanisms. For example, in Random Early Detection (see Floyd and Jacobson, 1993), an IP router may send congestion signals to TCP flows by dropping packets and a TCP flow should adjust its window size (and therefore its sending rate) according to feedback signals (for example, acknowledgement packets sent back from a destination node; see Jacobson, 1988). However, queueing networks have been studied largely based on the assumption that the system state, typically queue length information, has no effect on arrival and service processes, i.e., in the absence of feedback. Unfortunately, the presence of feedback significantly complicates analysis, and makes it extremely difficult to derive closed-form expressions of performance metrics such as average queue length or mean waiting time (unless stringent assumptions are made; see Takacs, 1963; Foley and Disney, 1983; Avignon and Disney, 1977/78; Wortman et al., 1991), let alone developing analytical schemes for performance optimization. It is equally difficult to extend the theory of PA for discrete-event queueing systems in the presence of feedback. The importance of incorporating feedback to networks as well as their SFM counterparts, and the effectiveness of IPA methods applied to SFMs to date motivates the study of SFMs with multiplicative feedback. We define a(t) as the maximal external incoming flow rate for a node in the network and introduce a feedback mechanism by setting the inflow rate to c • a(t) when the buffer content x(t) is greater than a certain intermediate threshold * It is worth noticing that this form of feedback has been widely adopted in today's Internet, i.e., in the Random Early Detection (RED) (see Floyd and Jacobson, 1993) and other algorithms. The rest of the chapter is organized as follows. Section 2 briefly reviews earlier work applying IPA to SFMs. Section 3 presents the SFM framework with multiplicative feedback. In Section 4 we carry out IPA and derive explicit sensitivity estimators for loss and throughput related metrics. We also prove unbiasedness of these estimators. In Section 5 we present some numerical examples to illustrate the use of IPA estima-

300

NEXT GENERATION INTERNET

tion in on-line queueing system control. Conclusions and future research directions are given in Section 6.

2.

Review of IPA in SFMs

In this section we briefly review earlier results incorporating IPA to SFMs. Consider a network node where buffer control at the packet level takes place using a simple threshold-based policy: when a packet arrives and the queue length x(t) is below a given level 6, it is accepted; otherwise, it is rejected. This can be modeled as a queueing system. Next, we adopt a simple SFM for the system, treating packets as "fluid." The buffer content at time t is again denoted by x(t) and it is limited to #, which may be viewed as the capacity or as a threshold parameter used for buffer control. When the buffer level reaches 9 the system starts to overflow. In the underlying DES, both x(t) and b are integers; in the SFM, x(t) and 9 are treated as real numbers. As we will explain later, analyzing the SFM provides useful information for solving control and optimization problems defined on the underlying network node. Figure 12.1 shows a typical network node on the left with buffer control and its SFM counterpart on the right. In the SFM, the maximal processing rate of the server is generally time-varying and denoted by /?(£). The incoming rate, also generally time-varying, is denoted by a{t). We also use S(i) and j(t) to denote the outflow flow rate and the overflow rate due to excessive incoming fluid at a full buffer respectively. Over a time interval [0, T], the buffer content x{t\ 9) is determined by the following one-sided differential equation, ,, l

( 0, 0) | ' = I 0, { a(t)-(3(t),

if x(t; 6) = 0 and a(t) - 0(t) < 0, if x(t;9) = 6 and a(t)-f3{t)>0, (12.1) otherwise

The overflow rate j(t; 9) is given by 7

^

j

_ J max{a(t)-/?(*),0}, ifz(t;0) = 0, ~ \ 0, iix(t]9)<8.

A typical state trajectory over [0, T] can be decomposed into two kinds of intervals: empty periods and buffering periods. Empty Periods (EP) are intervals during which the buffer is empty, while Buffering Periods (BP) are intervals during which the buffer is nonempty. We define EPs to always be closed intervals, whereas BPs are open intervals unless containing one of the boundary points 0 or T. We consider two performance metrics, the Loss Volume LT(9) and the Cumulative Workload (or just Work) QT{9), both defined on the

301

12 Multiplicative Feedback Control in Networks x(t) Pit)

a{t)

Queuing Model Figure 12.1. terpart

SFM

A network node with threshold-based buffer control and its SFM coun-

interval [0, T] as: (12.3)

and

QT{0) =

/

x(t]0)dt.

(12.4)

Jo

Let us denote the fcth BP as B^. Define $(9) as the index set for all BPs with at least one overflow period. For every k G $(9), there is a (random) number M& > 1 of overflow periods in $&, i.e., intervals during which the buffer is full and a(i) — /3(t) > 0. Let us denote these overflow periods by T^m^ m = 1 , . . . , M^, in increasing order and express them as Fk,m — [uk,m{9), Vk,m}ifc= 1 , . . . , if. Observe that the starting time Uk,m{9) generally depends on 9, whereas the ending time vi^m is locally independent of #, since it corresponds to a change of sign in a(t) — @(t) in (12.1), which has been assumed independent of 6. Through Infinitesimal Perturbation Analysis (IPA), the following sample derivatives can be obtained, as shown in Cassandras et al. (2002): PROPOSITION

12.1 For every 9 e 6 , L'T(9) =

(12.5)

-

and Q'T(0)

(12.6)

-

Under certain technical conditions (see Cassandras et al., 2002), it is also proved that PROPOSITION

12.2 The IPA estimators L'T(9) andQ'T(9) are unbiased,

i.e.,

dE[LT(9)} d9

= E [LfT(9)

and

dE[QT(6)} 86

= E [Q'T{6)

302

NEXT GENERATION INTERNET

These IPA estimators are extremely simple to implement not only in a SFM, but in the actual network component as well: (12.5) is merely a counter of all BPs observed in [0, T] in which at least one overflow event takes place. The estimator is nonparametric in the sense that no knowledge of the traffic or processing rates is required, nor does (12.5) depend on the nature of the random processes involved. In (12.6), the contribution of a BP, Z3&, to the sample derivative QT(0) is the length of the interval defined by the first point at which the buffer becomes full and the end of the BP. Once again, as in (12.5), the IPA derivative QT{9) is nonparametric, since it requires only the recording of times at which the buffer becomes full (i.e., Uk,\{6)) and empty (i.e., rjk(O)) for any BP which has at least one overflow period. In other words, (12.5) and (12.6) may be directly obtained from a single sample path of the network node, and the final values of the estimators are independent of the SFM. The analysis above can be extended to a network with multiple nodes. In Sun et al. (2003), a tandem network is studied, where the output of a node becomes the input to a downstream node, and the dynamics of each node follow those of the single-node system described above. Figure 12.2 shows such a tandem network. By similar techniques as in Cassandras et al. (2002), IPA analysis can be carried out and the unbiasedness of IPA estimators can be proved for loss and work related metrics with respect to the parameters #, &2>..., bm. Figure 12.3 illustrates another extension by introducing feedback. In this case, a(t) is not the inflow rate. The actual incoming flow is the output of a traffic shaper. A traffic shaper modifies some incoming process a(t), according to system information (i.e., queue content information #(£)), and creates the actual incoming flow to the node. We denote the actual inflow rate (which is also the output of the traffic shaper) by u(t). If we regard queue content as state information, x{t) being input to the above traffic shaper implies a feedback controller of the form:

u(t) — u(a(t),x(t)).

Because u(i) depends on x(t) and x[t) depends on

the buffer capacity #, the inflow rate is not independent of 9, However in what follows we will simply use u(t) for notational simplicity unless the dependence needs to be stressed. The system dynamics are: Hr(f. V)

dt+

m }

-

]

f 0, " , ^

o,^

if x(t] 0) = 0 and u(t) - (3{t) < 0, if z(t;0) = 0 a i i c U ( t ) - - / ? ( * ) > 0, otherwise

(12.7)

12

303

Multiplicative Feedback Control in Networks a2(t)

I

a\(f) fait)

nlo

/?„(')

Figure 12.2. SFM of a tandem network

^

I9 }

cx(t)

Traffic shaper

u(t)=u(a(t),x(t))

•MNNMt

liillSiiiftil!

iiBlUS lilliilii Figure 12.3. A SFM with feedback

The outflow rate S(t) is given by mm {u(t),(3(t)}

if re > 0 Hx = O

(12.6

The overflow rate j(t) is given by max

0

iix<6

(12.9)

Similarly, S(t) and ^(t) are functions of a(t),x(t) and 9. But we do not explicitly indicate the dependence unless it is necessary to do so. In Yu and Cassandras (2003), an additive feedback mechanism is studied by setting the inflow rate u(i) to be

where p{x) is a feedback function. Using similar techniques as in Cassandras et al. (2002), we can derive the IPA sample derivatives LT{6) and QT(9) and prove their unbiasedness under modest technical assumptions. Moreover, in the case of linear feedback, i.e., p(x) — ex, the estimators again turn out to be nonparametric; for details, see Yu and Cassandras (2003). The feedback mechanism in Yu and Cassandras (2003) implies that state information, i.e., buffer content, is instantaneously available to the controller. This is reasonable for situations such as manufacturing systems, but unlikely to hold in high-speed distributed environments such

304

NEXT GENERATION INTERNET

as communication networks. This stringent requirement, together with a natural interest in feedback policies which are readily applicable to real-world networks, leads to the problem of deriving IPA gradient estimators for SFMs with multiplicative feedback mechanisms. Consider a single-node SFM with threshold-based buffer control as in Cassandras et al. (2002). Once again, we define a(t) as the maximal external incoming flow rate and introduce a feedback mechanism by setting the inflow rate to c • a(t) when the buffer content x(t) is greater than a certain intermediate threshold (/). Compared with Yu and Cassandras (2003), the current mechanism has two main advantages: (i) System information is needed only when the buffer content reaches or leaves the threshold 0, while in Yu and Cassandras (2003) it has to be continuously available; as a result, the cost of communicating state information is greatly reduced, and (ii) The multiplicative feedback mechanism can be easily implemented in an actual network, for example via probabilistic dropping. Similar to our previous work, applying IPA and deriving the SFM-based sensitivity information of certain performance metrics with respect to key parameters is still our primary interest. However, the following differences makes the problem more challenging: First, our interest switches from 0, which decides the feedback range to c, which decides the feedback gain. Secondly, the feedback only applies in part of the range, i.e. there is no feedback when x < . While these features make the implementation of this feedback mechanism simple, they also complicate the IPA estimation.

3.

A SFM with multiplicative feedback

The SFM for a typical network node that we consider consists of a server with a finite buffer as shown in Figure 12.3. In the remainder of this chapter, we study the following traffic shaper with multiplicative feedback: ,v / ca(i) \i
AJ

. J .

^ ,

(12.10)

v J y a(t) if 0 < x < (f) where a{t) is the maximal inflow rate, c is the feedback gain parameter and 4> < 6 is an intermediate threshold. We assume 0 < c < 1, thus ensuring that the effect of feedback is more pronounced when x > (j). When the buffer level is below , the whole flow is accepted into the system; when the buffer level is above , part of the flow may be rejected before entering the system. Thus, cf> decides the feedback range and c decides the feedback gain. As discussed before, the inflow rate u{t) is a function of x{t) and c. Since the queue length x{t) is a function of 4> and #, u(t) depends on

12 Multiplicative Feedback Control in Networks

305

simply denote it by u(t) for notational simplicity unless the dependence needs to be stressed. Note that from an implementation standpoint (12.10) is a policy in which packets arriving at a node after its queue content exceeds a level (/> are dropped with probability 1 — c. The policy can be readily extended to one with multiple thresholds i,..., <j)n and corresponding gains c i , . . . , cn to resemble the RED algorithm Floyd and Jacobson (1993) adopted as part of congestion control in the Internet. Thus, a byproduct of the analysis that follows is to develop means for determining state-dependent packet dropping probabilities that optimize a performance metric of choice. The only requirement imposed by the feedback mechanism in (12.10) is that the source be notified of the events: ux(i) reaches " or "#(£) leaves ". It is also assumed that the stochastic processes {a(t)} and {/?(*)} are independent of the buffer level x(t) and of the parameters c, (j) or 0. Further, it is assumed that the rate processes are bounded in the sense that there exist a m a x and /3 m a x such that w.p. 1 a(t) < a m a x < oo and fi(t) < /?max < oo. Finally, we assume that the real-valued parameter c is confined to a closed and bounded (compact) interval C and that c > 0 for all c G C Now we can see that the dynamics of the buffer content are given by max {u(t) — 5(£), 0} = { u{t) - 5{t) * min {u(t) - 5(t), 0}

when x(i) = 0 when 0 < x(t) < 6 when x(t) = 6>

(12.11)

where 5(£) is the outflow rate defined in (12.8). Note that the above dynamics are not yet complete, because the case x(t) = (j) in (12.10) is not specified. In order to fully specify it, let us take a closer look at all possible cases when x(t) = (/>: Case 1. P(t) < ca(t): The buffer level at t+ becomes x(t + ) > 0; Case 2. a(t) < (3(t): The buffer level at t + becomes x(t+) < ; Case 3 ca(r) < /3(r) < a(r) for all r in an interval [£, t + e) for some e > 0: There are two further cases to consider: (i) If we set u{r) = ca(r), it follows that dx - ca(r) - /?(T) < 0 —dt t=r and the buffer content immediately starts decreasing. Therefore x ( r + ) < (j) and the actual incoming rate becomes n ( r + ) = a ( r 4 ) . Thus, dx

+

+

306

NEXT GENERATION INTERNET

and the buffer content starts increasing again. This process repeats, resulting in a "chattering" behavior, (ii) If, on the other hand, we set U{T) = a ( r ) , it follows that ^ | t = r + = a(r+) - /3(r+) > 0. Then, upon crossing , the actual input rate must switch to ca(r+) which gives ca(r+) — /?(r + ) < 0. This implies that the buffer content immediately decreases below (j) and a similar chattering phenomenon occurs. The chattering behavior above is due to the nature of the SFM and does not occur in the actual DES where buffer levels are maintained for finite periods of time; in the present SFM, it is readily prevented by setting u(r) = (3(r) and ~ ^ - = 0 for all r > t such that the buffer content is maintained at (j). Note that u{t) is a function of of x{t) and c, i.e., u(t) = u(t, x(t)] c). Now we can complete the dynamics by modifying (12.10) as follows: 0 < x < (/> x(t) = 0 and /?(£) < ca(i) x(t) = 0 and ca(t) < /?(*) < a(i) u{t,x(t)\c) = x(t) = 0 and a(t) < (3{t) cf> < x < 9 (12.12) with the initial condition x(0;c) = 0. Our objective is to obtain sensitivity information of some performance metrics with respect to key parameters. We limit ourselves to considering c as the controllable parameter of interest. For a finite time horizon [0, T] during which c is fixed, we define the throughput as: a(t) ca(t) /?(*) a[t) ca(t)

when when when when when

- -

1

/ S(t)dt Jo

(12.13)

and the loss rate as: 1 fT LT = - / l[x{t) = 0](u(t) - S(t))dt T Jo =— / T

l[x(t) = 6](ca(t) - (3(t))dt

(12.14)

Jo

where 1[-] is the usual indicator function. A typical optimization problem is to determine c* that maximizes a cost function of the form M0)

= E[HTic)} - A • E[LT(C)\

(12.15)

where A generally reflects the trade-off between maintaining proper throughput and incurring high loss. Care must also be taken in defining the previous expectations over a finite time horizon, since they generally

12 Multiplicative Feedback Control in Networks

307

depend on initial conditions; we shall assume that the queue is empty at time 0. Note that we do not make any stationarity assumption here since the performance metrics are defined over a finite time interval. Moreover, the finite-horizon formulation is suitable for the "moving finite horizon" type of network performance problems where one trades off short-term quasi-stationary behavior against long-term changes possibly caused by user behavior. In order to accomplish this optimization task, we rely on estimates of dE[HT(c)]/dc and dE[L/T(c)]/dc provided by the sample derivatives dHT(c)/dc and dLT(c)/dc. Accordingly, the main objective of the following sections is the derivation of dHT(c)/dc and dLT^c)/dc, which we will pursue through IPA techniques. For any sample performance metric C(6) and a generic parameter #, the IPA gradient estimation technique computes dC(6)/d6 along an observed sample path. If the IPA-based estimate dC(6)/d6 satisfies dE[C(9)]/d0 = E[dC/dO], it is unbiased. Unbiasedness is the principal condition for making the application of IPA practical, since it enables the use of the IPA sample derivative in stochastic gradient-based algorithms. A comprehensive discussion of IPA and its applications can be found in Ho and Cao (1991); Glasserman (1991) and Cassandras and Lafortune (1999).

4.

Infinitesimal perturbation analysis

In this section we tackle the performance optimization problem raised in the last section. After introducing the notion of sample path decomposition in Section 4.1, we present our main results, namely, the IPA gradient estimates and their unbiasedness in Section 4.2. As we will see, the IPA gradient estimates rely on event time sample derivatives, which will be derived in Section 4.3. Some critical properties for the proof of unbiasedness will be established in Section 4.4.

4.1

Sample path decomposition and event definition

As already mentioned, our objective is to estimate the derivatives dE[HT(c)]/dc and dE[Li^(c)]/dc through the sample derivatives dHT(c)/dc and dLx{c)/dc^ which are commonly referred to as IPA estimators. In the process, however, it will be necessary to identify events of interest and decompose the sample path first. For a fixed c, the interval [0,7"] is divided into alternating boundary periods and non-boundary periods. A Boundary Period (BP) is defined as a time interval during which x(t) = 6 or x(t) = 0, and a

308

NEXT GENERATION INTERNET

Non-Boundary Period (NBP) is defined as a time interval during which 0 < x(t) < 9, BPs are further classified as Empty Periods (EP) and Full Periods (FP). An EP is an interval such that x(t) = 0; a FP is an interval such that x(t) — 9. We assume that there are TV NBPs in the interval [0,T], where N is a random number. We index these NBPs by n — 1 , . . . , iV and express them as [rjn, £ n ). Figure 12.4 shows a typical sample path of the SFM. We define the following random index sets: = { n : x(t) = 9 for all * G [Cn-i,r?n), n = l , . . . , i V } = {n: x(t) = 0 for alH G [Cn-i,*7n), n = l , . . . , i V }

(12.16) (12.17)

so that if n G ^ F 5 the nth BP (which immediately precedes the nth NBP) is a FP; if n ^ "$F, the nth BP (which immediately precedes the nth NBP) is an EP. Next, we identify events of interest in the SFM : (i) A jump in a(t) or (3(t) is termed an exogenous event, reflecting the fact that its occurrence time is independent of the controllable parameter c, and (ii) The buffer content x(t) reaches any one of the critical values 0, (ft or 0; this is termed an endogenous event to reflect the fact that its occurrence time generally depends on c. Note that the combination of these events and the continuous dynamics in (12.11) gives rise to a stochastic hybrid system model of the underlying discrete event system of Figure 12.3. Finally, we further decompose the sample path according to the events defined above. Let us consider a typical NBP [rjrnCn) a s shown in Figure 12.5. Let 7rn^ denote times when x(t) reaches or leaves 0, (ft or 9 in this NBP, where i — 0 , 1 , . . . , In — 1, in which In is the number of such events in [r/n, ( n ). Note that the starting point of the NBP is rjn = 7rnjo. To maintain notational consistency we also set (n — Knjn even though this point is not included in [%,Cn)« We can now see that a sample path is decomposed into five sets of intervals that we shall refer to as the modes of the SFM: {%) Mode 0 is the set Mo of all EPs contained in the sample path, (ii) Mode 1 is the set M\ of intervals [7rn^, 7rn^-j-i) such that x{^n,i) = 0 or (ft and 0 < x(t) < (ft for all t G (7rn^, 7rn^+i), n = 1 , . . . , iV, (in) Mode 2 is the set M2 of intervals [7rn)i,7rnji+i) such that x(t) = (ft for all t G [7rnji, 7rnji+i), n = 1 , . . . , N, (iv) Mode 3 is the set M3 of intervals [7rn^,7rn^-fi) such that x(7Tn^) = (ft or 9 and (ft < x(t) < 9 for all t G (7rnji, 7rnjj+i), n— 1 , . . . , iV and (v) Mode 4 is the set M4 of all FPs contained in the sample path. Note that the events occurring at times 7rn>i are all endogenous for i — l , . . . , / n and we should express them as 7rn>i(c) to stress this fact, for notational economy, however, we will only write 7rn^. Finally, recall that for i = 0, we have 7rn,o = r]n, corresponding to an exogenous event starting the nth NBP. As shown in Fig-

309

12 Multiplicative Feedback Control in Networks 0

Figure 12.4- A typical sample path

hi

n,l

n,2

n,3

nA

n,5

n,6 ?n

Figure 12.5. A typical NBP

ure 12.5, the NBP [r/n, Cn) is decomposed into In — 7 intervals, including three M\ intervals [TT^CTT^I), [7rn?2,7rn?3), [7rn,4, TT^S), three M 3 intervals brna^n^), [7Tn,3,7Tn,4), [^.,6,^7), and one M 2 interval [^5,7r n>6 ).

4.2

IPA gradient estimates for performance metrics and their unbiasedness

In this section we present the IPA gradient estimates for performance metrics and prove their unbiasedness. THEOREM

12.1 The IPA estimator of dE[HT(c)] /dc is: dHT _ _ dc

in which

MI/

:

(12.18)

Q~ 1 is given by Lemma 12.6.

Proof. See Appendix.

•

310

NEXT GENERATION INTERNET 12.2 The IPA estimator of dE [LT(c)] /dc is:

THEOREM =

in which

i

A

( C ) + " ( C % )

>H

(12.19)

QC 1 is given by Lemma 12.6.

Proof. See Appendix.

•

In addition, we define the Suppression Traffic Volume to be the average volume which is denied admission before entering the system and denote it by RT: 1 fT RT = — / i [x(t) > ] [a(t) - u(t)} dt T Jo THEOREM

12.3 The IPA estimator of dE [RT(c)} /dc is:

*i)^

~

«(W (12.20)

in which ^

is given by lemmas 1.5-1.9 in the following section.

Proof. See Appendix.

•

Note that in our previous work on SFM-based IPA (see Cassandras et al., 2002; Sun et al., 2003; Yu and Cassandras, 2003), only raw data from a network node are required for IPA estimation, such as detecting a FP or an EP. However, the IPA estimators (12.18), (12.19) and (12.20) rely on flow rates and mode identification, which are all defined in a SFM context, making it less obvious to find their analogs in an actual node. / and ca(i) < (3(i) < a(t), Specifically, when the buffer level reaches <> the SFM enters Mode 2 and the buffer level should stay at (f) until this condition no longer applies. However, in the underlying DES, the buffer level will oscillate around (j) instead and we must carefully define such a chattering interval so that it corresponds to Mode 2 of the SFM. As a result, errors in the recursive calculation of the event time sample derivatives diti/dc may be introduced. To minimize the effect of such errors, we make use of the estimators whose form involves dni/dc the

12 Multiplicative Feedback Control in Networks

311

least. For example, the IPA estimator for the throughout sensitivity, dHT{c)/dc, can either be directly evaluated by (12.18) or derived indirectly through estimators of dLT(c)/dc and dRricj/dc as follows. Recall the flow balance equation HT(c) + LT(c) + RT{c) = a where a, the time average of the defining process a{t) over [0, T], is independent of c. The above equation then gives dHT _ ~^~~

dLT dRT ~~dc~~~dc~

noon [LZ Zl) '

The two evaluations (12.18) and (12.21) are equivalent in the SFM. But because of the discrepancy between DES and SFM, they may yield different results when applied to an actual network node. We select the latter estimation option for the following reason. The direct estimation oidHx/dc in (12.18) entirely relies on the evaluation of event time sample derivatives dixi/dc. As mentioned above, evaluating these from actual network data may introduce errors. On the other hand, the second and last terms in (12.19) can be directly observed from a DES sample path by counting the number of departures and number of packets dropped when the buffer is full; only the first term still involves event time derivatives. Similarly, the last term of dRr/dc in (12.20) can be evaluated directly from actual network data. Recall that the inflow rate is u(t) = ca(t) when x > (j). Thus, J ^ + 1 ca(t)dt is the inflow volume when x > (f> and the last term in (12.20) can be obtained from the incoming packet volume divided by c when the buffer level is above or equal to
ASSUMPTION

This assumption precludes a situation where the queue content reaches one of the critical threshold values 0, (p or 6 at the same time 71^ as an exogenous event which might cause it to leave the threshold; this would prevent the existence of the event time sample derivative diii/dc which will be derived in the sequel (however, one could still carry out perturbation analysis with one-sided derivatives as in Cassandras et al., 2002). Moreover, by Assumption 12.1, AT, the number of NBPs in the sample path, is locally independent of c (since no two events may occur simultaneously, and the occurrence of exogenous events does not depend

312

NEXT GENERATION INTERNET

on c, there exists a neighborhood of c within which, w.p.l, the number of NBPs in [0, T] is constant). Hence, the random index set typ is also locally independent of c. Similarly, the decomposition of the sample path into modes is also locally independent of c. Normally, the unbiasedness of an IPA derivative dC(6)/d6 for some performance metric C{9) is ensured by the following two conditions (see Rubinstein and Shapiro, 1993, Lemma A2, p.70): (i) For every 6 G ©, the sample derivative exists w.p.l, and {%%) W.p.l, the random function C{6) is Lipschitz continuous throughout 0 , and the (generally random) Lipschitz constant has a finite first moment. Based on the monotonicity properties that will be established in Section 4.4, we can readily verify the two conditions and consequently establish the unbiasedness of the IPA estimators in the following theorem. 12.4 Under Assumption 12.1, the IPA estimators (12.19) and (12.20) are unbiased, i.e.,

THEOREM

dE[LT(c)} dc

= = E E

\dLT{c)^ ^ [ dc j '

dE[HT(c)} dc

=

(12.18),

^

and dE[RT(c)} dc

= E

\dRT{c) [ dc

Proof. See Appendix.

4.3

•

Event time sample derivatives

In this section, we derive sample derivatives for event times, which are necessary in the process of obtaining IPA estimators for performance metrics. First we make the following additional assumptions: 12.2 a{t) and (3(t) are piecewise constant functions that can take a finite number of values.

ASSUMPTION

This assumption can be regarded as an approximation of general timevarying processes. As we will see later, we do not set any upper bound on the numbers of values that a(t) and /3{t) can possibly take, essentially allowing the piecewise constant process to approximate a general timevarying process as close as possible. The assumption is brought in mostly for ease of analysis. Due to this assumption and recalling the dynamics in (12.11), x(t) has to be a piecewise linear function of time t, as shown in Figure 12.4.

12 Multiplicative Feedback Control in Networks

313

12.3 W.p.l, there exists an arbitrarily small positive constant 6 such that for all t, \a(t) — P(i)\ > e > 0 and for a fixed c:

ASSUMPTION

\ca(t) - P(t)\ > e > 0 Combining the above two assumptions, we obtain for every pair of possible values of a(t) indexed by i and /?(£) indexed by j : \cCti -

fij\>€

which is equivalent to ca{ — /3j > e or cai — /3j < —e

Therefore we obtain c> —

or c < ~

which implies an "invalid interval" (&~, ^ ^ J for c. According to Assumption 12.2, there are a finite number of such invalid intervals. We shall also refer to a valid interval as the maximal interval between two adjacent invalid intervals. In what follows, we shall concentrate on a typical NBP [^n,Cn(c)) and drop the index n from the event times ?rn^ in order to simplify notation. Assumptions 12.2-12.3 are needed to ensure the existence of the sample derivatives diti/dc, but they can be significantly weakened by simply assuming that w.p. 1 an event such that ca(t) — p{t) changes sign cannot coincide with any endogenous event (e.g., x(t) reaches the level #). This weaker condition introduces some technical complications in the derivations that follow which we will choose to avoid here by restricting ourselves to piecewise constant rate processes satisfying the last two assumptions. In the rest of this section, we derive the sample derivative diTi/dc through a series of lemmas which cover all possible values that X(TT^;C) can take in an interval [TT 12.1 Under Assumptions 10.1-10.3, if a FP ends at time r/n, i.e., x(r)n;c) = 6, then

LEMMA

Proof. See Appendix.

•

314

NEXT GENERATION INTERNET

12.2 Under Assumptions 10.1-10.3, if an EP ends at time rjnj i.e., x(rjn;c) = 0, then

LEMMA

dc Proof. See Appendix.

•

The above two lemmas show that an event time perturbation is always eliminated after a NBP ends. The following lemma further asserts that the same is true after a finite interval during which x{t\ c) = . 12.3 Under Assumptions 10.1-10.3, if an M<2 interval ends at time TTi, x{^i\c) = (f>, then

LEMMA

~dc

=

Proof. See Appendix.

•

Next, we define the following shorthand notation: A(t) = ca(t) - /?(£) and B(t) = a(t) - (3{t) LEMMA

12.4 Under Assumptions 10.1-10.3, if[m^i+i) dc

t,_-

, ,

£ M3> then

«(r)rfr

(12.22)

Proof. See Appendix.

•

Define Hfat) = / 6(t)dt

(12.23)

as the node throughput during time interval [TT^, £). In addition, we have the following flow balance equation: ccxytjciL — n \n%i vTz+i/ — 'LyKi+li

c) — xyixn c)

/ : which gives / a(t)at — c Jin Combining Lemma 12.4 and (12.24) gives

dc

~ A(7rr+1)

dc

i; c) — x(7iy, c) 4 cA(n-+1)

[

' '

12

315

Multiplicative Feedback Control in Networks

au n

mmm mmmm mmmm

.;

7+1

Figure 12.6. The decomposition of an M3 interval

where [ x ( 7 r i + i ; c ) - x(m\c)]

e{-0,0,0-}.

According to Assumption 12.2, a(i) and /?(£) are piecewise constant functions. The interval [TT^, TT^+I) can be then decomposed by exogenous events occurring when a(t) jumps from one value to another. As shown in Figure 12.6, we use a^fc, k = 1 , . . . Si to denote the feth such exogenous event and let TTJ = a^o and TT^+I = ^ 5 ^ 1 in order to maintain notational consistency. Moreover we define the value of a(t) in interval [<Jiik)&i,k+i) a s ai,k- It follows that /

a (t)dt =

If we use the following shorthand

i, fe,

(12.26)

to define the length of an interval between two exogenous a(t) jump events, we get / ^ i + 1 a(t)dt = Y^ki=oahkbi,k- Then, (12.22) becomes

dc

(12.27)

Similar to the work in Yu and Cassandras (2003), our ultimate purpose is to apply the IPA estimators (which we will derive in next section based on event time sample derivatives) to an actual underlying DES. The

316

NEXT GENERATION INTERNET

three expressions (12.22), (12.25) and (12.27) provide alternative ways to evaluate the event time sample derivative which are equivalent in the SFM context. In the discrete-event setting, however, some information required for IPA estimation may be more difficult to obtain than other. For example, (12.27) depends on the evaluation of a ^ , the maximal incoming rate and 6 ^ , the length of intervals between two a{t) jump events. This information may be difficult to acquire or measure if the source is remote. On the other hand, (12.25) requires a throughput evaluation during the time interval [vr^, TT^+I), which may be much easier to obtain, i.e., in an actual network node, it can be done by simply counting processed packets. In summary, we want to remind readers that different forms of IPA estimators exist and that one should select the appropriate one based on implementation considerations. We also point out that (12.22) can be further simplified when the service rate P(t) = f3 is constant: =

dc

ca^f)

- (3 ^ (hj, _ [x(iri+i;c) - xjnn C)] +

ca(7ri+1) - f3

dc

c 2 a(yr. +1 ) - c(3

In this case, only TT^+I — TT^, the length of the Mode 3 interval has to be evaluated. LEMMA

12.5 //[7^,7^+1) e Mi,

^ -^ Proof. See Appendix.

•Ir

<12 28)

-

•

The combination of Lemmas 1.5 through 1.9 provides a linear recursive relationship for obtaining the event time sample derivative diti/dc, and the coefficients involved are based on information directly available from a sample path of the SFM and the throughput given in (12.23). Moreover, ^ , the event time sample derivatives for the end of a NBP Vln) Cn)> c a n a l s o be derived by combining the above lemmas. Recall that Cn = Knjn. Using the previous lemmas, we can also obtain a recursive expression for ^ as follows: LEMMA

12.6 For a NBP [r/n,Cn),

12 Multiplicative Feedback Control in Networks

317

dc

H(7rn i _i,7r n j )

if %((n] c) — 9 and

cA(n~In)

x(7Tn i -l* c) = #

i,/ -i^n,i

)

if ^(Cnjc) = 0 and

(12.29) Proof. See Appendix.

•

To summarize, event time sensitivities ^ are triggered only in Mode 3 through the second term in (12.22). The sensitivities are subsequently reset to zero after a Mode 0 (EP), Mode 2, or Mode 4 (FP) interval. With the help of the five lemmas derived above, we are now able to derivative IPA estimators for various performance metrics in the following section,

4,4

Monotonicity and Lipschitz continuity of x(t;c) with respect to c

In this section we establish some monotonicity properties that are critical in the proof of unbiasedness (Theorem 12.4). As mentioned before, the buffer content is a function of c, i.e., x{t) — x{t\ c). In this section we establish the monotonicity and Lipschitz continuity of the function x{t\ c) with respect to the parameter c. As we will show later, this property is critical in proving unbiasedness of IPA estimators. We first establish this result for SFMs with the general feedback scheme introduced in Section 2, and verify its applicability to the specific multiplicative feedback mechanism of Section 3. Consider a SFM with feedback defined in Section 2 where c is a generic controllable parameter of the traffic shaper. We assume that c and 0 are independent of each other. As mentioned before, u(t), the actual inflow rate, 5(t), the outflow rate and j(t) are all functions of some defining process a(£), queue content x(t), thresholds #,, and c, the parameter of interest. But for notational simplicity we will suppress these dependencies unless it is necessary. We make the following assumptions on the dependence of u(t) on x(t; c) and c. Based on these assumptions we can establish the monotonicity and Lipschitz continuity of x{t\c) for SFMs with general negative feedback. We will subsequently verify them for the multiplicative feedback mechanism introduced in Section 3.

318

NEXT GENERATION INTERNET

12.4 For any fixed t and c, u(t,x\c) is a monotonically noninceasing function of x, i.e., when x\ > x%, u(t,x\\c) < u(t,X2]c) for all t and c. ASSUMPTION

12.5 For any fixed t and x, u(t,x;c) is a monotonically nondecreasing function of c.

ASSUMPTION

ASSUMPTION

12.6 u{t,x\c) is Lipschitz continuous with respect to c,

i.e. \u(t, x]c + Ac) - u(i, x;c)\ < KAc in which K is the Lipschitz constant. We assume that for all c, x(t; c) is a continuous function with respect to t, and x(t] c) = 0 when t = 0. For notational simplicity, we use SFMjy to denote the state trajectory of the nominal system under parameter c and SFMp to denote its perturbed counterpart under parameter c+Ac. Throughout this section, for a function /(•) we use /'(•) to represent f(c + Ac), the corresponding function in SFMp, while /(•) represents /(c), the corresponding function in SFMjy. Thus, the buffer level is denoted by x(t) in SFMN and x'{t) in SFMP. Define buffer level perturbations as Ax(t) = x(t\ c + Ac) - x(t\ c) = xf(t) - x(t) with respect to a perturbation Ac. The following lemma establishes monotonicity: LEMMA

12.7 Under Assumption 12.5, for any Ac > 0, Ax(t) > 0 For allt>0

(12.30)

Proof. See Appendix.

•

In order to establish the Lipschitz continuity, first we present the following lemma leading to Theorem 12.5: LEMMA

12.8 Define A5(t) - 6(t, x\t)\ c + Ac) - <J(t, x(t); c) = 6f(t) - S(t)

and \t)- c + Ac) - 7 (t, x(t); c) - V(t) Then under Assumption 12.5, for any Ac > 0, AS(t) > 0, A 7 (t) > 0

12 Multiplicative Feedback Control in Networks

319

Proof. See Appendix.

•

THEOREM

12.5 Under Assumptions 10.4~10.6, Ax(t) < KTAc

Proof. See Appendix.

•

It is also easy to verify the above results for Ac < 0. In order to establish that the general results in this section cover the feedback mechanism defined in (12.12), i.e., to prove that Lemma 12.7 and Theorem 12.5 hold for u(t) in (12.12), we need to verify Assumptions 10.4-10.6. For Assumption 12.4, assume x\ < x, u(t,xi;c) = a(t) > u(t,X2',c)] (ii) if x\ = <j> < #2, u(tJX2]c) = ca(t) < u(t,xi;c)\ (in) if cf> < x\ < x<2, u(t,x\\c) — u(t,X2\c) — ca(t). Therefore, for any xi, x<2, if x\ < x<2, u(t\x\\c) > u(t\X2\c) and the assumption is verified. For Assumptions 12.5 and 12.6, there are again three possible cases: Case 1. if x < , u(t,x\ c) = u(t,x\ c+Ac) = a(t), which gives u(t,x\ c+ Ac) — u(tyx\c) = 0; Case 2. if x > (f), u(t,x;c) = ca(t), u(t, x;c + Ac) = (c+Ac)a(t), which gives u(t, x;c + Ac) — u(t, x\ c) = a(t) • Ac > 0; Case 3. if x = 0, since ca(t) < (c + Ac)a(t) < Qj(t), there are four cases to consider regarding the relative values of ca(t), (c + Ac)a(t), a(t) and Case 3.1. ca(t) < (c + Ac)a(t) < a(t) < /3(t)> u(t,x\c) = n(t,x;c + Ac) = a(t). It follows that u(t, x\ c + Ac) — u(c\ t, x) = 0; Case 5.;8. ca(t) < (c + Ac)a(t) < fi(t) < a(t), u(t,x;c) = u(t,x;c + Ac) = /?(£). It follows that n(t, x; c + Ac) - u(c\ £, x) = 0; Case 5.5. ca(£) < /3(t) < (c + Ac)a(t) < a(t), u(t,x;c) = /3(t) and i^(c + Ac;i, x'(t)) = (c + Ac)a(t), so that u(t, x;c + Ac) > u(t,x]c). Moreover, it follows that u(t, x;c + Ac) - u(t, x; c) =(c + Ac)a(t) - /3(t) <(c + Ac)a(t) -ca(t) =a(t) • Ac Case 3.4. p(t) < ca(t) < (c + Ac)a(t) < a(t), n(t,x;c) = u(tJx;c + Ac) = ca(t). It follows that u(t, x\ c + Ac) — u(t, x\ c) — 0.

320

NEXT GENERATION INTERNET

Combining all of the above cases verifies Assumption 12.5. In order to verify Assumption 12.6, note that u(t, x\ c+Ac) — u(t, x; c) < a(i) • Ac from the above cases. Recalling that we have assumed the process {#(£)} to be such that w.p. 1 a{t) < a max < oo, it follows that u(t, x;c + Ac) - u(t, x\ c) < a(t) • Ac < a max • Ac

(12.31)

Hence Assumption 12.6 is also verified with the Lipschitz constant K — a max . Therefore, Lemma 12.7 and Theorem 12.5 hold for the feedback mechanism (12.10).

5.

Optimization examples

In this section we present some numerical examples to illustrate how the IPA estimators we have developed are used in optimization problems. As suggested before, the solution to an optimization problem defined for an actual queueing system may be approximated by the solution to the same problem based on a SFM of the system. Let us now consider the feedback-based buffer control problem defined in Section 3 with cost function (12.15): J$ES(c) = E[H$ES(c)] - A • E[L°ES(c)} The optimal value of c which maximizes Jj!ES(c) above may be determined through a standard stochastic approximation algorithm (details on such algorithms, including conditions required for convergence to an optimum may be found, for instance, in Kushner and Yin, 1997): cn+i = cn + vnHn(cmuj°ES),

n = 0,1,...

(12.32)

where Hn(cn^cu^ES) is an estimate of dJr/dc evaluated at c = cn and {vn} is a step size sequence. 7Yn(-), the form of the estimator, comes from SFM-based IPA analysis, i.e., from (12.18) and (12.19), but the data input to the estimator are based on a DES sample path denoted by u)®ES'. Obviously, the resulting gradient estimator Hn{cn)uo^ES) is now an approximation leading to a sub-optimal solution of the above optimization problem. Note that, after a control update, the state must be reset to zero, in accordance with our convention that all performance metrics are defined over an interval with an initially empty buffer. In the case of off-line control (as in the numerical examples we present), this simply amounts to simulating the system after resetting its state to 0. In the more interesting case of on-line control, we proceed as follows. Suppose that the nth iteration ends at time rn and the state is x(cn]Tn) [in general, x(cn]rn) > 0]. At this point, the threshold is updated and its new

321

12 Multiplicative Feedback Control in Networks Table 12.1. Summary of parameter settings for two scenarios Scenario a value set P e <\> A 1 15 5 2 0.125 100,28,27,24,21,20,14,9 30 2 15 2 2 15 0.25 150,60,30,8

«=£

Co

c*

0.95 0.95

0.62 0.21

value is cn+i. Let r^ be the next time that the buffer is empty, i.e., x{cn+\\Tn) = 0. At this point, the (n + l)th iteration starts and the next gradient estimate is obtained over the interval [r^, r® + T], SO that Tn-j_i = T® + T and the process repeats. The implication is that over the interval no estimation is carried out while the controller waits for the system to be reset to its proper initial state; therefore, sample path information available over [rn, T®] is effectively wasted as far as gradient estimation is concerned. Figure 12.7 shows examples of the application of (12.32) to a network node modeled as in Figure 12.3 under two different parameter settings (scenarios). The service rate /?(£) remains constant throughout the simulation but a{t) is piecewise constant: it remains constant for an exponentially distributed period of time and when it switches the next value of a(t) is generated according to a transition probability matrix. For simplicity, we assume that all elements of the transition probability matrix are equal and the only feasible value of these elements is q = 1/ra, in which m is the number of values a(t) can take. For different scenarios, a(t) value sets, the value of /3, the initial value of feedback gain Co and overflow penalty A also vary. Table 1 summarizes the settings for both scenarios. Also shown in the table are Co, the initial feedback gain value, and c*, the value obtained through (12.32). In Figure 12.7, the curve "DES" denotes the cost function JT(C) obtained through exhaustive simulation for different (discrete) values of c with T = 100000; the curve "IPA Algo." represents the optimization process (12.32) with the simulation time horizon for each step of (12.32) set to T' = 10000, and with constant step size v = 0.01. As shown in Figure 12.7, the gradientbased algorithm (12.32) converges to the neighborhood of the optimal feedback gain. 6.

Conclusions and future directions

SFMs have recently been used to capture the dynamics of complex stochastic discrete event systems, such as computer networks, and to implement control and optimization methods based on gradient estimates of performance metrics obtained through IPA. In Yu and Cassandras (2003) we showed that IPA can be used in SFMs with additive feedback and here we have further explored the effect of feedback by considering a

322

NEXT GENERATION INTERNET

Scenario 1 22

21 20

19 18 17 • DES - IPA Algo.

16 15 0. 2

0. 4

0. 6 c

Scenario 2

20 10

o 0.2

0.4

^ko. i

0. 8

.2 - i o o § -20 § -30 o -40 -50

"DES -IPA A l g o .

-60

Figure 12.7. Numerical results for SFM-based gradient optimization of an actual network node

12 Multiplicative Feedback Control in Networks

323

single-node SFM with a controllable inflow rate as a multiplicative function of the state (i.e., queue level) feedback parameterized by a feedback gain c and a threshold <j) (capturing a quantization in the state feedback). We have developed IPA estimators for the loss volume and average workload with respect to the feedback gain parameter c and shown their unbiasedness, despite the complications brought about by the presence of feedback. This scheme bypasses the need for continuous state information seen in additive mechanisms and involves only knowledge of a single event representing the queue level crossing the threshold . Moreover, even if this state information is not instantaneously supplied, the delays involved are naturally built into the IPA estimator, based on which appropriate control parameters can be selected. This work opens up a variety of possible extensions. First, looking at the feedback mechanism (12.12), note that c represents the feedback gain and (/> represents the range. Instead of controlling c or 0 separately (along the lines of previous work in Yu and Cassandras, 2004), it may be more effective to control the (c, (f>) pair jointly. Next, noticing that probabilistic dropping/marking mechanisms are widely adopted in computer networks (e.g., in Random Early Detection or Random Early Marking), it is appealing to apply IPA specifically to these algorithms. Finally, of obvious interest is the application of our SFM-based IPA estimators to an actual underlying DES such as the internet, i.e., to determine the value of c that minimizes a weighted sum of loss volume and average workload, as we have done in Cassandras et al. (2002) and Yu and Cassandras (2003). As mentioned earlier, one advantage of IPA is that the estimators depend only on data directly observable along a sample path of the actual DES (not just the SFM which is an abstraction of the system); see, for example, Cassandras et al. (2002) and Yu and Cassandras (2003). Here, however, we have seen that this direct connection to the DES no longer holds because the estimators rely on the identification of "modes" whose definition does not always have a direct correspondence to a DES, As a result, in order to successfully apply the SFM-based IPA estimators to an actual DES, we need to carefully select and interpret an appropriate abstraction of the underlying DES. Acknowledgments The authors' work is supported in part by the National Science Foundation under Grants EEC-0088073 and DMI-0330171, by AFOSR under contract F49620-01-0056, and by ARO under under grant DAAD19-01-0610.

324

NEXT GENERATION INTERNET

Appendix Proof of Theorem 12.1. Recall the definition of throughput: 1 fT T = TJ; 6(t)dt. 1 Jo Using (12.8), we can rewrite this equation as follows: H

=

f I J2 £'+1 a^dt + E //+1 0(*)d*I

:

>7r/+1 >7r i+1 k(t) - P(t)] dt + v>^J" // p(t)dt+ vyJ^ /I ' /3(t)dt

^ < y / T .*-/

/

7o

I

/3(tjdt ~f~ x

/

J

Differentiating with respect to c we obtain:

fo

f i6M()

Note that, if [TT*, TT^+I) G MO, TT^ is the start of an EP. Hence it is also the end of some NBP, i.e., 7Ti = Cn-i and 7Ti+i = rjn for some n G ^E- Combining this with Lemma 12.2, we get diti+i/dc — 0. The above equation then becomes:

D

Proo/ o/ Theorem 12.2. If [Cn-i, r/n) is a FP, we have x(t) = 9 for all t G [Cn-i, ryn). Recalling that ^ F is locally independent of c, it follows from (12.14) that dLT(c)

1 V^

Combining the above equation with Lemma 12.1 it follows that

Moreover, we have the flow balance equation a (t)dt=

fin

12

Multiplicative Feedback Control in Networks

325

in which Ln is the lost volume because of overflow in the FP [Cn-i, rjn). Noticing that x(rjn\c)

— x(£n-i\c)

dLT(c)
_

— #, we o b t a i n :

1 v^ J. —7

f_r i

(c

\_B(C

<9Cn-i uu

u. \.icsn_l

Since

we o b t a i n : dLT(c)

_1

v-^ J

r

,

.

. aCn-l

, 1 Pn

o,.N,,l

, ^T

D

Proof of Theorem 12.3. According to (12.11) and (12.12), when the system is in Mode 2, the suppressed flow rate is a(t) — (3(t)\ when the system is in Mode 3 or 4, the suppressed flow rate is (1 - c) a(t). Therefore

Differentiating with respect to c we obtain: dRT

1 j r^ f, U6M 2 ^

I

t it'

t it'

i

iEM3UM4

(12.A.1) According to Lemma 12.3, ^ ± i = 0 if z G M2. Thus we get (12.20). • Proof of Theorem 12.4. We prove the unbiasedness of the IPA derivatives by establishing that the unbiasedness Conditions (i) and (ii) are satisfied for the random functions LT(C), HT(C) and RT(C). Condition (i) is in force by Assumptions 10.1-10.3. Regarding Condition (ii), we have the following flow balance equations for SFMN and SFMp respectively: fT fT fT x(T) - z(0) = / u(t)dt - / 5(t)dr - / Jo Jo Jo and x'(T)-x'(0)=

[ u'(t)dt[ S'(t)dt- I i'(t)dt Jo Jo Jo Combining the above equations and recalling the assumption that a/(0) £;( — x(0) — 0, we obtain: fT

/ Jo

fT

fT

Au(t)dt = Arc(T) + / AS(t)dt + / Ay(t)dt Jo Jo

(12.A.2)

326

NEXT GENERATION INTERNET

According to Lemma 12.7, Ax(T) > 0. According to Lemma 12.8, A5(t) > 0, A*y(t) > 0. Therefore, f / Au(t)dt > 0. Jo

Moreover, pT

pT

PT

pT

/ A6(t)dt < / Au(t)dt, / A
(12.A.3)

+ Ac) < u(t,x(t);c + Ac). Thus,

Au(t) < u(t,x(t)\c + Ac) - u(t,x(t)\c) According to (12.31), u(t, x(t)\ c 4- Ac) — w(t, x(t); c) < amax • Ac. Hence, Au(t) < c w x • Ac which gives

1

Au(t)dt
Jo

Therefore, from (12.A.3) we get /

Jo and

A*y(t)dt < /

Au(t)dt < a m a x T • Ac

Jo

fT

fT

/ A8(t)dt < / Au(t)dt < a m a x T • Ac Jo Jo In other words, both LT(C) and HT(C) are Lipschitz continuous. For RT(C), recall the flow balance equation a = LT(c)

where

+ HT(c)

+

RT(C),

1 fT a =— / a(t)dt, 1 Jo

is the time average of a(t), independent of c. Hence RT(C)

= a - LT(c) - HT(C)

is also Lipschitz continuous. This completes the proof.

•

Proof of Lemma 12.1. If x(t) decreases from 0 at time ryn, this defines the start of a NBP. From (12.11) and (12.12) we must have ca(r)~) - f3(r)~) > 0 and ca(r]+) P(Vn) < 0- From Assumption 12.3 we know that ca(rj~) — P(r]~) > e, ca(r]^) — P(Vn) < e- Recalling Assumption 12.2, we conclude that a jump in a(t) or /3(t) occurs at time r\n. Since a(t) or /3(t) are independent of c, Assumption 12.1 implies that there exists a neighborhood of c within which a change of c does not affect rjn. This implies that f}n is locally independent of c and the result follows. • Proof of Lemma 12.2. The proof is similar to that of the previous lemma, with

12 Multiplicative Feedback Control in Networks

327

Proof of Lemma 12.3. At the end of an M2 interval, x(t) may either increase or decrease from (j). On one hand, x{t) increasing from <j> at time TT* defines the start of a M3 interval. Specifically, from (12.11) and (12.12) we have ca{^~) < /3(TT~) < a(7r~) and ca{i^f) > f3(irf). Since a(t) and /3(t) are independent of c, we conclude that an exogenous event occurs at time m. Moreover, from Assumption 12.3 we know that there exists a neighborhood of c within which a change of c does not affect TR. This implies that m is locally independent of c. On the other hand, x(t) decreasing from <j> at time 71^ defines the start of a Mi interval. Specifically from (12.11) we have ca(?r~) < /3(TT~) < Q;(TT~) and cai^f) > /?(TTZ+), which implies that iri is the occurrence time of an exogenous event and therefore locally independent of c. The • result follows when we combine the above arguments. Proof of Lemma 12.4. If [^,^+1) e M3, from (12.11) and (12.12) we have [ca(t) - p(t)] dt = x(in+i\c) - x(7n; c)

(12.A.4)

Note that x(iii\c) and #(7^+1 ;c) can only take values from the set {<9,0}. Therefore, differentiating with respect to c we obtain:

]^

t

]^

£

<*(t)dt = 0

which gives dc

ca(7r:-+1) - 0(n^+1)

dc

ca(7rr +1 ) - P(n7+1) J^ D

Proof of Lemma

12.5. If [ir^TTi+i) G Mu from (12.11) and (12.12) we obtain: x(7Ti+i\c) - x(-Ki\c) = /

[a(t) - f3(t)} dt

Note that x{iTi\c) and x(Ki+\\c) can only take values from the set {0,0}. Therefore, differentiating with respect to c we obtain:

which gives dni+1 _

a(7r+)-/3(TT+)

dm

dc ~ afc ) - P(n7 ) 8c Proof of Lemma 12.6. Recall that £n = nn,in is ^he ^nd of a NBP and x((n; c) — 0 or 0. Therefore [7r n ,/ n -i,7r n ,/ n ), the last interval in the NBP, is either an M\ or an M3 interval. Case 1: If it is an M\ interval, according to Lemma 12.5 we obtain: d(n

_ -^(^n,/ n -l)

d7Tn,In-l

328

NEXT GENERATION INTERNET

Case 2: If it is an M3 interval, according to Lemma 12.4 we obtain: d(n _ M^nJn-l)

d-Knjn-i

_ x(7Tnj.n) ~ x(nnjn-i)

+ H(7TnJu,-1, 7Tn,/T, )

Since a;((n;c) = 0, the above equation becomes 0(n

A

(^nJn-l)

3>Knjn-i

6 - x(TTntIn-l) + H(*n,In - 1 , 7Tn,fH)

,

,

Moreover because [7rn,/n-i,7rn,/n) is an M3 interval, x(nn,in-i]c) — 0 or 4>. If a^TTn./n-iJc) = 0 — ic(7rn,/n;c), the interval [7rn,/ri-i,7rn,/n) forms a NBP itself and from Lemma 12.4 we have:

In addition, from Lemmas 12.1 and 12.2 it follows that d7Tn,In-l

_

dc

dc

-0.

We then obtain: ^n

=

_H(7Tn,In-i,7rnJn)

(12.A.6)

Similarly, if sc(7rnjjn_i;c) = 0, it follows that _^H —

n,J,,,-l

•

n Jn

'

~1

1 nj.n-l,

n,In)

Combining (12.A.5), (12.A.6) and (12.A.7) we get (12.29).

(12. A.7)

•

Proof of Lemma 12.7. The proof has two steps. First we show that, if x'(t) = x(t),

^

>

p

(12.A.8) dt - dt Before proceeding we point out that according to Assumption 12.5, u(t,x\c) is a monotonically nondecreasing function of c for any fixed t and x, i.e., u(t,x;c) < u(t,x\ c + Ac). Assume x'(t) = x(t) = xo. The value of XQ can be classified as follows: Case 1. xo = 0: In view of Assumption 12.5, u(t,xo;c) < u(t,xo\c + Ac). Thus, there are three cases to consider regarding the relative values of it(t, 0; c), u{t) 0; c + Ac) and Case 1.1. u(t, 0; c) < u(t, 0; c+ Ac) < /3(t): The buffer content will remain empty for both sample paths: dx'{t+) _ dx(t+) dt ~ dt Case 1.2. u(t,0;c) < 0(t) < w(t,0;c+ Ac): According to (12.11),

12 Multiplicative Feedback Control in Networks

329

Case 1.3. (3(t)
- u(t, 0; c + Ac) - f3{t) > u(t, 0; c) -

£. 0 < x0 < 6: In this case, (12.11) gives dx(t+) dt

= u(t,xQ;c) -f3{t)

and d t

which implies

~v,,~u,^

dx(t+) dxf(t+) dt ~ dt

according to Assumption 12.5. Case 3. xo — 0: There are three sub-cases to consider: Case 3.1. u(t,0\c) < u{t,6\ c + Ac) < 0(t): According to (12.11) and Assumption 12.5, - ^ — - = u(t, 0; c + Ac) - p(t) > u(t, 0; c) - 0(t) = —^—Case 3.2. u(t,0\c) < 0(t) < u(t,6;c-t Ac): According to (12.11), SFMP will stay at 0 so that dx\t+) dt SFMN will drop from 0 so that

Combining the above two equations gives dx'(t+) dt -

dt

Case 3.3. 0(t) < u(t,6;c) < u(t,0\c + Ac): According to (12.11), both sample paths will stay at 0, i.e., dx'{t+) _ dx(t+) dt ~ ~dt~ Combining all of the above cases, we conclude that dx(t+) dx'(t+) dt ~ dt which implies that SFMp will never go below SFMN for any t > r when x(r) = Moreover when t = 0, o/(0) = x(0) = 0. It follows that Ax(t) > 0, for all t>0.

x'(r). •

Proof of Lemma 12.8. For AJ(£), regarding all possible value combinations of x(t) and x'{t) we have the following cases: Case 1. x'(t) > x(t) > 0: According to (12.8), 5'(t) = 8(t) = P(t);

330

NEXT GENERATION INTERNET

Case 2. x'(t) > 0, x(t) = 0: According to (12.8), 8(t) = min (u(t),/?(£)) < 0(t) = 5'(t); Case 3. x'(t) = 0, x(t) > 0: According to Lemma 12.7, this case is impossible; Case 4. x'(t) = 0, x(t) = 0: According to (12.8), 8(t) = min (u(t),/3(t))> 5'(t) = min (u'(t), P(t)). Moreover Assumption 12.5 gives u(t) — u{t, 0; c) < u(ty 0; c + Ac) = u'(t), so 8(t) <Sf(t). Combining the above four cases gives A5(t) > 0. Similarly for A 0.

•

Proof of Theorem 12.5. We have the following flow balance equations for SFMM and SFMp respectively: ft

rt

pt

x(t(t) -- x(0) = /

u{r)dr - / 5{r)dr - / 7(r)dr, for allt > 0 Jo Vo Vo

and xt(t)-x(0)=

ft

/•t

rt

/ u\r)dr/ (5;(r)cir - / 7 / (r)dr, for alH > 0 Jo Jo Jo Combining the above equations and recalling that x'(0) — x(0) — 0, we obtain: Az(0 = / Au(r)dr - [ A5(r)dr - [ A^(r)dr Jo Jo Jo

(12.A.9)

Recalling Lemma 12.8, AS(r) > 0 and A7(r) > 0 for all r, 0 < r < t. Then, (12.A.9) implies Ax(t) < f Au(r)dr Jo

(12.A.10)

On the other hand, AU(T) = U(T, X(T)\ c + Ac) - U(T, X(T)\ C)

(12.A.11)

Lemma 12.7 gives x'(r) > x(r), from which we obtain u(r, x'(r); C + Ac) < U(T,X(T)\

C + Ac)

according to Assumption 12.4. Combining the above inequality with (12.A.11) gives AU(T)

< u(r, X(T);C + Ac) - u(r, x(r); c)

(12.A.12)

According to Assumption 12.6, ^(r,x(r);c +Ac) - ?i(r,x(r); c) < KAc

(12.A.13)

12 Multiplicative Feedback Control in Networks

331

Combining (12.A.12) and (12.A.13) we get: (12.A.14)

Au(r) < KAc Thus, from (12.A.10) and (12.A.14) we obtain: Ax(t) < / Au(r)dr < KtAc < KTAc Jo

for all i, 0 < t < T

•

References Akella, R. and Kumar, P.R. (1986). Optimal control of production rate in a failure prone manufacturing system. IEEE Transactions on Automatic Control, AC31:116-126. Anick, D., Mitra, D., and Sondhi, M. (1982). Stochastic theory of a data-handling system with multiple sources. The Bell System Technical Journal, 61:1871-1894. Avignon, G.R.D. and Disney, R.L. (1977/78). Queues with instantaneous feedback. Management Science, 24:168-180. Cassandras, C.G. and Lafortune, S. (1999). Introduction to Discrete Event Systems. Kluwer Academic Publishers. Cassandras, C.G., Sun, C , Panayiotou, C.G., and Wardi, Y. (2003). Perturbation analysis and control of two-class stochastic fluid models for communication networks. IEEE Transactions on Automatic Control, 48(5):770-782. Cassandras, C.G., Wardi, Y., Melamed, B., Sun, G., and Panayiotou, C.G. (2002). Perturbation analysis for online control and optimization of stochastic fluid models. IEEE Transactions on Automatic Control, 47(8): 1234-1248. Cruz, R. (1991). A calculus for network delay, Part I: Network elements in isolation. IEEE Transactions on Information Theory. Floyd, S. and Jacobson, V. (1993). Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1:397-413. Foley, R.D. and Disney, R.L. (1983). Queues with delayed feedback. Advances in Applied Probabilities, 15(1):162-182. Glasserman, P. (1991). Gradient Estimation via Perturbation Analysis. Kluwer Academic Publishers. Ho, Y. and Cao, X. (1991). Perturbation Analysis of Discrete Event Dynamic Systems. Kluwer Academic Publishers, Boston, MA. Jacobson, V. (1988). Congestion avoidance and control. Computer Communication Review, 18:314-329. Kelly, F., Maulloo, A., and Tan, D. (1998). Rate control in communication networks: Shadow price, proportional fairness and stability. Journal of the Operational Research Society. Kesidis, G., Singh, A., Cheung, D., and Kwok, W. (1996). Feasibility of fluid-driven simulation for ATM network. Proceedings of IEEE Globecom, 3:2013-2017. Kleinrock, L. (1975). Queueing Systems, vol. I: Theory. Wiley-Interscience. Kobayashi, H. and Ren, Q. (1992). A mathematical theory for transient analysis of communications networks. IEICE Transactions on Communications, E75-B:12661276. Kumaran, K. and Mitra, D. (1998). Performance and fluid simulations of a novel shared buffer management system. In: Proceedings of IEEE INFOCOM.

332

NEXT GENERATION INTERNET

Kushner, H. and Yin, G. (1997). Stochastic Approximation Algorithms and Applications. Springer-Verlag, New York, NY. Leland, W.E., Taqq, M.S., Willinger, W., and Wilson, D.V. (1993). On the self-similar nature of ethernet traffic. In: ACM SIGCOMM, pp. 183-193. Liu, B., Guo, Y., Kurose, J., Towsley, D., and Gong, W. (1999). Fluid simulation of large scale networks: Issues and tradeoffs. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada. Low, S. (2000). A duality model of TCP and queue management algorithms. In: Proceedings of ITC Specialist Seminar on IP Traffic Measurement, Modeling and Management Liu, Y. and Gong, W. (1999). Perturbation analysis for stochastic fluid queueing systems. In: Proceedings of the 38th IEEE Conference on Decision and Control, pp. 4440-4445. Paxson, V. and Floyd, S. (1995). Wide-area traffic: The failure of poisson modeling. IEEE/ACM Transactions on Networking, 3:226-244. Perkins, J. and Srikant, R. (1999). The role of queue length information in congestion control and resource pricing. In: Proceedings of the 38th Conference on Decision and Control. Perkins, J. and Srikant, R. (2001). Failure-prone production systems with uncertain demand. IEEE Transactions on Automatic Control, 46:441-449. Rubinstein, R.Y. and Shapiro, A. (1993). Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method. John Wiley and Sons, New York, NY. Sun, G., Cassandras, C.G., Wardi, Y., and Panayiotou, C.G. (2003). Perturbation analysis of stochastic flow networks. In: Proceedings of ^nd IEEE Conference on Decision and Control, pp. 4831-4836. Takacs, L. (1963). A single-server queue with feedback. Bell System Technical Journal, 42:505-519. Wardi, Y., Melamed, B., Cassandras, C.G., and Panayiotou, C.G. (2002). IPA gradient estimators in single-node stochastic fluid models. Journal of Optimization Theory and Applications, 115(2):369-406. Wortman, M.A., Disney, R.L., and Kiessler, P. (1991). The M/GI/1 bernoulli feedback queue with vacations. Queueing Systems Theory and Applications, 9(4):353-363. Yan, A. and Gong, W. (1999). Fluid simulation for high-speed networks with flowbased routing. IEEE Transactions on Information Theory, 45:1588-1599. Yu, H. and Cassandras, C.G. (2003). Perturbation analysis of feedback-controlled stochastic flow systems. In: Proceedings of 42nd IEEE Conference on Decision and Control, pp. 6277-6282. Yu, H. and Cassandras, C.G. (2004). Perturbation analysis for production control and optimization of manufacturing systems. Automatica, 40:945-956.

Chapter 13 COMPARING LOCALITY OF REFERENCE - SOME FOLK THEOREMS FOR THE MISS RATE AND THE OUTPUT OF CACHES Armand M. Makowski Sarut Vanichpun Abstract

1.

The performance of demand-driven caching is known to depend on the locality of reference exhibited by the stream of requests made to the cache. In spite of numerous efforts, no consensus has been reached on how to formalize this notion, let alone on how to compare streams of requests on the basis of their locality of reference. We take on this issue with an eye towards validating operational expectations associated with the notion of locality of reference. We focus on two "folk theorems," namely (i) The stronger the locality of reference, the smaller the miss rate of the cache; (ii) Good caching is expected to produce an output stream of requests exhibiting less locality of reference than the input stream of requests. We discuss these two folk theorems in the context of a cache operating under a demand-driven replacement policy when document requests are modeled according to the Independent Reference Model (IRM). We propose to measure strength of locality of reference in a stream of requests through the skewness of its popularity distribution, and to use the notion of majorization to capture this degree of skewness. We show that these folk theorems hold for caches operating under a large class of cache replacement policies, including the optimal policy Ao and the random policy, but may fail under the LRU policy.

Introduction

Web caching aims to reduce network traffic, server load and userperceived retrieval latency by replicating "popular" content on (proxy)

334

NEXT GENERATION INTERNET

caches that are strategically placed within the network, e.g., Wang (1999) (and references therein). This approach is a natural outgrowth of caching techniques which were originally developed for computer memory and distributed file sharing systems, e.g., Aven et al. (1987); Coffman and Denning (1973); Phalke and Gopinath (1995) (and references therein). However, the exponential growth of the World Wide Web and its specific circumstances are challenging current cache architectures to meet the complementary mandates of speed, scalability and reliability which are central to delivering a satisfactory user experience. Although these challenges have renewed interest in caching in general, some basic issues are still not well understood. Indeed, the performance of any form of caching is determined by a number of factors, chief amongst them the statistical properties of the streams of requests made to the cache. One important such property is the locality of reference present in a stream of requests whereby "bursts of references are made in the near future to objects referenced in the recent past." The importance of locality for caching was first recognized by Belady (1966) in the context of computer memory, and attempts at characterizing it were made early on by Denning (1968) through the working set model. Recently, a number of studies have shown that streams of requests for Web objects exhibit strong locality of reference1 (see e.g., Jin and Bestavros (2000b); Mahanti et al. (2000)). Like the notion of burstiness used in traffic modeling, locality of reference, while endowed with a clear intuitive content, admits no simple definition. Thus, and not surprisingly, in spite of numerous efforts, no consensus has been reached on how to formalize the notion, let alone compare streams of requests on the basis of their locality of reference.2 To the best of the authors' knowledge, this lack of consensus has precluded the formal derivation of the following "folk theorems": 1. Folk theorem on miss rates - The stronger the locality of reference in the stream of requests, the smaller the miss rate since the cache ends up being populated by objects with a higher likelihood of access in the near future. Such a property, if true, would confirm the central role played by locality of reference in shaping cache performance. In fact, the very presence of locality of reference in the stream of requests is what makes caching at all possible; and 2. Folk theorem on output streams - Good cache replacement strategies "absorb" locality of reference to a certain extent by producing a stream of misses from the cache - its so-called output *At least in the short timescales An exception can be found in a recent paper by Fonseca et al. (2003).

2

13 Comparing Locality of Reference

335

which exhibits less locality of reference than the input stream of requests. In the context of multi-level caching, this reduction property is often perceived as one of the main reasons for why caching looses its effectiveness after some level in a hierarchy of caches. Such folk theorems are expected to hold for demand-driven caching that exploits recency of reference. Interest in establishing them under a specific definition of locality of reference stems from a desire to validate its operational significance. Counterexamples would cast some doubts as to whether the particular definition indeed captures the intuitive meaning of locality of reference. Such a program has been carried out for a number of key notions of traffic engineering: For instance, the convex stochastic orderings were shown to capture the notion of variability, in the process leading to various proofs that "determinism minimizes waiting times," e.g., Baccelli and Makowski (1989). More recently, the theory of multivariate stochastic orderings has been used to formalize the belief that positive correlations lead to larger buffer levels at a discrete-time infinite capacity multiplexer queue, viz. if the input traffic is larger than its independent version in the supermodular ordering, then their corresponding buffer contents are similarly ordered in the increasing convex ordering. This has been demonstrated for a number of basic traffic models in Vanichpun and Makowski (2002) and Vanichpun and Makowski (2004c). In this chapter we survey and extend recent results obtained by the authors concerning a formal investigation into the folk theorems mentioned earlier, albeit in a simple framework. The results for miss rates and output streams are available in Vanichpun and Makowski (2004a) and Vanichpun and Makowski (2004b), respectively. Most proofs have been omitted but the interested reader is referred to these papers and to the thesis by Vanichpun (2005) for additional information. The chapter is organized as follows: We start with a brief discussion of the main contributions to locality of reference in Section 2. The basic model of cache management is given in Section 3. The miss rate and output of a cache are discussed in Sections 4 and 5, respectively. Majorization and the companion notion of Schur-convexity are introduced in Sections 6 and 7, respectively. We obtain the basic comparison results for the output in Section 8. A large class of demand-driven eviction policies called Random On-demand Replacement Algorithms (RORA) is defined in Section 9, and their ergodic properties investigated in Section 10. The comparison results for the miss rates and outputs under RORAs are given in Sections 11 and 12, respectively. Zipf-like distributions are discussed in Section 13. Comparison results for the miss rate

336

NEXT GENERATION INTERNET

and output under the LRU policy are collected in Sections 14 and 15, respectively.

2,

Modeling and comparing locality of reference

The presentation is developed in the context of demand-driven caching to be introduced more formally in Section 3: Given a universe of N cacheable documents, the system is composed of a server where a copy of each of these TV documents is available, and of a cache of size M (1 < M < TV).3 Documents are first requested by the user at the cache. If the requested document has a copy already in cache, this copy is downloaded from the cache by the user. If the requested document is not in cache, a copy is requested instead from the server to be put in the cache. If the cache is already full, then a document already in cache is evicted to make place for the copy of the document just requested.

2.1

Contributions to locality of reference

Our first task consists in identifying the notion of locality of reference to be used here. We begin with the widely accepted observation that the two main contributors to locality of reference are temporal correlations in the streams of requests and the popularity distribution of requested objects. To describe these two sources of locality, we assume the following generic setup which is used throughout: The A^ cacheable items or documents, are labeled i = 1 , . . . , AT, and we write J\f := { 1 , . . . , N}. The successive requests arriving at the cache are modeled by a sequence {Ru t = 0,1,...} of Af-valued rvs. 1. The popularity of the sequence of requests {Rt, t = 0,1,...} is defined as the pmf p = (p(i),... ,p{N)) on J\f given by 1 p{i) := lim - V 1 [RT = i]

a.s.,

i = 1 , . . . , iV

(13.1)

t—>oo t *—' T=0

whenever these limits exist (and they do in most models treated in the literature). 2. Temporal correlations are more delicate to define due to the "categorical" nature of the requests {Rt, t — 0 , 1 , . . . } . Indeed, it is somewhat meaningless to use the covariance function

3

Typically M < AT.

13 Comparing Locality of Reference

337

as a way to capture these temporal correlations as is traditionally done in other contexts. This is because the rvs {Rt, £ = 0,1,...} take values in a discrete set. We took { 1 , . . . , N} but could have selected any set of TV distinct points in an arbitrary space. Thus, the actual values of the rvs {Rt, t = 0,1,...} are of no consequence, and the focus should instead be on the recurrence patterns exhibited by requests for particular documents over time. The literature contains several metrics to do this, including the inter-reference time of Phalke and Gopinath (1995), the working set size of Denning (1968) and the stack distance, see e.g., Almeida et al. (1996). We focus exclusively on popularity as the measure of locality of reference.4 In fact, to isolate its contribution, we deal here with the situation where there are no temporal correlations in the stream of requests as would be the case under the so-called Independence Reference Model (IRM). More precisely, under the IRM with popularity pmf p = ( p ( l ) , . . . ,p(iV)), the successive requests {i?^, £ — 0,1,...} form a sequence of i.i.d. M-valued rvs, each distributed according to the pmf p, i.e., P[Rt = i)=p{i), i = l,...,JV (13.2) for all £ = 0 , 1 , . . . and (13.1) holds with the given pmf p by the Law of Large Numbers. IRMs do display locality of reference even though there is no temporal correlations. This is best appreciated by considering the limiting cases: If p is extremely unbalanced with p = (1 — 5, e , . . . , e) (with S = (N — l)s), a reference to document 1 is likely to be followed by a burst of additional references to document 1 provided (N — l)e < l - i . It seems natural to deem this situation as one exhibiting very strong locality of reference. The exact opposite conclusion holds if the popularity pmf p were uniform, i.e., p(l) = . . . = p(N) = ^ , for then the successive requests {Rt, t = 0,1,...} form a truly random sequence, in which case there is no locality of reference. Thus, the skewness of p appears to act as an indicator of the strength of locality of reference present in the stream, under the intuition that the more "balanced" the pmf p, the weaker the locality of reference. 4 Results related to how temporal correlations affect locality of reference can be found in the thesis by Vanichpun (2005).

338

2.2

NEXT GENERATION INTERNET

Comparing locality of reference via majorization — The big picture

As we restrict ourselves to the class of IRMs^ the question naturally arises as to whether popularity pmfs can be compared on the basis of their skewness. More formally, consider two IRMs with popularity pmfs p and q (on A/"), and let M(p) and M(q) denote their miss rates (13.8) under a given cache replacement policy as defined in Section 4. We seek a way to formally compare the pmfs p and g, with the interpretation that if p is less skewed than q, then the IRM with popularity pmf p has less locality of reference than the IRM with popularity pmf q, and the folk theorem on miss rates holds as M(q) < M(p).

(13.3)

In Section 6 we turn to the concept of majorization as a way to characterize such an imbalance in the components of popularity pmfs. Motivated by our earlier discussion, we shall say that the IRM with popularity pmf p has less locality of reference than the IRM with popularity pmf q if p is majorized by q, written p -< q. As elegantly demonstrated in the monograph of Marshall and Olkin (1979), this notion has found widespread use in many diverse branches of mathematics and their applications. What is more, comparison results such as (13.3) can now be explored through the rich and structured class of monotone functions associated with majorization, the so-called Schur-convex/concave functions introduced in Section 7. In fact, the comparison (13.3) is essentially a statement concerning the Schur-concavity of certain functional. Within this framework, if p* denotes the popularity pmf for the output from the cache (defined in Section 5), then the folk theorem on the stream of misses takes the form p* -< p. (13.4) Both statements (13.3) and (13.4) were investigated by the authors in the context of a cache operating under a demand-driven replacement policy when document requests are modeled according to the IRM; this chapter presents some of the findings. While the discussion given here is restricted,to the class of IRMs, we believe that similar results may hold for more general input models. 5

This may not be too much of a limitation given that the IRM is the most basic request model; it is often used for checking various properties, see e.g., Breslau et al. (1999). Moreover, recent results by Jelenkovic and Radovanovic (2003) suggest some form of insensitivity to the statistics of streams of requests. Of course, more work along these lines is needed.

13

3.

Comparing Locality of Reference

339

Demand-driven caching

We return now to the universe J\f = { 1 , . . . , TV} of TV cacheable documents introduced earlier. The system is composed of a server which holds a copy of each of these N documents, and of a cache of size M (1 < M < N). Documents are first requested by the user at the cache: If the requested document has a copy already in cache (i.e., a hit), this copy is downloaded from the cache by the user. If the requested document is not in cache (i.e., a miss), a copy is requested instead from the server to be put in the cache (which then forwards it to the user). If the cache is already full, then a document already in cache is evicted to make place for the copy of the document just requested. The document selected for eviction is determined through a cache replacement or eviction policy.6 We now develop a mathematical framework for demand-driven caching in order to address some of the issues discussed in this chapter. Additional details are available in the monographs by Aven et al. (1987) and by Coffman and Denning (1973). We begin with some notation that will be used repeatedly: Let A*(M;AT) be the collection of all unordered subsets of size M of A/", and let A(M; A/*) be the collection of all ordered sequences of M distinct elements from A/". We write {ii,..., %M] (resp. (ii,..., %M)) to denote an element in A*(M; A/*) (resp.

A(M;A0). 3.1

A simple framework

Consecutive user requests are modeled by a sequence of J\f-valued rvs {Rt) t = 0,1,...}. For simplicity we say that request Rt occurs at time t = 0,1,.... Let St denote the cache just before time t so that St is a subset of A/" with at most M elements. The decision to be performed according to the eviction policy in force is the identity Ut of the document in St which needs to be evicted in order to make room for the request Rt (if the cache is already full). Demand-driven caching considered here is then characterized by the dynamics7 n+i =

6 7

St St + Rt St -Ut-i -Rt

if Rt eSt if Rt #s \St\ <M u if Rt ?s , \St\ = M t

We use the terms interchangeably. Here, and throughout, |i4| denotes the cardinality of the set A.

(13.5)

340

NEXT GENERATION INTERNET

for all t = 0 , 1 , . . . , where St — U% + Rt denotes the subset of { 1 , . . . , N} obtained from St by removing Ut and then adding Rt to it, in that order. These dynamics reflect the following operational assumptions: (i) actions are taken only at the time requests are made, hence the expression demand-driven caching; (ii) a requested document not in cache is always added to the cache if the cache is not full at the time of request; and (iii) eviction is mandatory if the request Rt is not in cache St and the cache St is full, i.e., \St\ — M.

3,2

Admissible IRMs and reduced dynamics

Throughout the stream of requests {Rt, t = 0,1,...} is modeled according to the standard Independence Reference Model (IRM) with popularity pmf p = (p(l), •.. ,p(N)). To avoid uninteresting situations, it is always the case that p(i) > 0 ,

i = 1,...,7V.

(13.6)

A pmf p on { 1 , . . . , N} satisfying (13.6) is said to be admissible. Under this non-triviality condition (13.6), every document will eventually be requested by virtue of (13.1). Thus, as we have in mind to study long term characteristics under demand-driven replacement policies, there is no loss of generality in assuming (as we do from now on) that the cache is full, i.e., for all t = 0 , 1 , . . . , we have \St\ — M and (13.5) simplifies to 1Q 7

t — Ut + -tit

3.3

It -t^t T- &t*

Cache states and eviction policies

Consider a given eviction policy ix which determines the decisions {Ut, t = 0,1,...} used in (13.7). We assume that the cache dynamics can be characterized through the evolution of suitably defined variables {fij, t = 0,1,...} where Clt is known as the state of the cache at time t. The cache state is specific to the eviction policy and is selected with the following in mind: (i) The set St of documents in the cache at time t can be recovered from f^; (ii) the cache state fit+i ls fully determined through the knowledge of the triple (f^, Rt,Ut) in a way that is compatible with the dynamics (13.7); and (iii) the eviction decision Ut at time t can be expressed as a function of the past (£2o, Ro, J7o,. • • j SV-i> Rt-ii Ut~i, fit) Rt) (possibly through suitable randomization), i.e., for each t = 0 , 1 , . . . , there exists a mapping nt such that Ut - 7rt(fio, Ro, UOj..., fit-i, Rt-u Ut-i^u Ru S t )

13 Comparing Locality of Reference

341

where the rv S$ is taken independent of the past (fio>-Ro> • • • >£^-i> QtiRt)- Collectively the mappings {7^, t = 0,1,...} define the eviction policy 7T. We close this section with some examples of eviction policies which have been discussed in the literature, see e.g., the monographs by Aven et al. (1987) and by Coffman and Denning (1973): According to the random policy, when the cache is full, the document to be evicted is selected randomly from the cache according to the uniform distribution. Any permutation a of { 1 , . . . , TV} induces an ordering of the documents by considering the documents cr(l),cr(2),..., cr(N) as "ranked" in decreasing order. This ranking of the documents allows us to define the eviction policy Aa as follows: When at time t = 0 , 1 , . . . , the cache St is full and the requested document Rt is not in the cache, the policy A(j prescribes the eviction of the document Ut given by Ut = argmax (o~~l(j) : j G St). The documents <J(1), . . . , a ( M — 1), once loaded in the cache, will remain there, and in the steady state, the cache under the policy Aa will contain the documents p(a*(2)) > . . . > p(a*(N)). Under the random policy and the policies Aa) we can take the cache state to be the (unordered) set of documents in the cache, i.e., the cache state is an element of A*(M; J\f) and fit ~ St for all t = 0 , 1 , . . . . The FIFO policy replaces the document which has been in cache for the longest time, while the LRU policy evicts the least recently requested document already in cache. The definitions of the FIFO and LRU policies require that the cache state be an element of A(M; AT) with Clt being a permutation of the elements in St for alH = 0 , 1 , . . . .

4,

The miss rate of a cache

A standard performance metric to compare various caching policies is the miss rate of the cache. This quantity has the interpretation of being the long-term frequency of the event that the requested document is not in the cache, and therefore determines the effectiveness of a caching policy.

342

NEXT GENERATION INTERNET

Under the cache replacement policy TT, the miss rate Mn(p) is defined as the a.s. limit 1 *

Mn(p) = £m - J2 1 iRr t Sr]

a.s.

(13.8)

T—\

where ST denotes the set of documents in cache operating under the replacement policy TT at time r when the input to the cache is the request stream {Rt, t — 0 , 1 . . . } . Almost sure convergence in (13.8) (and elsewhere) is taken under the probability measure on the sequence of rvs {f^, RttUt, t = 0,1,...} induced by the underlying IRM with popularity pmf p through the eviction policy ix. Under most cache replacement policies of interest, the limit (13.8) exists and admits a simple expression under the assumption that the a.s. limit 1 * £l[S ] a.s. (13.9) M£(;P) T=l

exists for each element s in A*(M;AT). Although the limits (13.8) and (13.9) are often constants which are independent of the initial cache state Ho? this is not always the case as be seen in the discussion of RORA policies in Sections 9 and 11. 13.1 Consider an eviction policy TT such that the limits (13.9) exist under the IRM with popularity pmf p. Then, the limit (13.8) exists and is given by

THEOREM

N

Mv(p) = £ > ( i ) i=l

=

]T

ti(S]P)

(13.10)

s€A*(M;A0

]T seA*(M;A0

/4(s;p)X>«

(13.11)

iis

where A*(M; A/") denotes the set of elements in A*(M; J\f) which do not contain i, i.e., A*(M;A/") : - {s = {i\, ..AM} € A*(M;AT) : i £ s} . Theorem 13.1 is a standard result under the IRM; its proof can also be found in Vanichpun (2005). The existence of the limits (13.9) is a mild assumption which is satisfied under all eviction policies of interest considered here (and in the literature). Indeed, under the IRM with popularity pmf p, the cache states {fit, t — 0,1,...} typically form a Markov chain over a finite state space, and standard ergodic results readily yield the existence of the limits (13.9). This issue will be briefly discussed in each situation at the appropriate time.

13 Comparing Locality of Reference

343

Under the IRM {Rt) t — 0,1,...} with popularity pmf p, the policy AQ induced by p has the smallest miss rate among all eviction policies based on past cache states and requests. This can be shown by a standard Dynamic Programming argument once it is recognized that the pair (St)Rt) constitutes a natural state for the underlying Markov decision process (MDP) (Aven et al., 1987, p. 122).

5.

The output of a cache

Under the demand-driven caching operation (13.7), the output of the cache is the sequence of requests that incur a miss, i.e., when the incoming request cannot find the desired document in the cache. More precisely, a miss occurs at time t if Rt is not in 5j. Thus, we define recursively the time indices {vk) k = 0,1,...} by ^0 = 0;

vk+i := vk + rik+u fc = 0,1,...

with rfc+i := inf {^=1,2,...: RVh+t £ SVk+t} where we use the convention rjk+i = oo if either vk — oo or if vk is finite but the set of indices entering the definition of r)k+\ is empty. With S denoting an element not in A/*, we define the output process {R%i k = 1, 2,...} simply as

for each k = 1,2,.... The requests {i?£, k — 1,2,...} are those requests among {i?^,t — 0,1,...} which incur a miss and which get forwarded to the server (or to a higher level cache in a hierarchical caching system). The statistics of the output stream {R\>> k = 1, 2,...} are determined by the statistics of the input stream {Rt, t = 0,1,...} and by the cache replacement policy TT in use. We are interested in evaluating the popularity pmf p* = (p*(l),... ,p*(iV)) of the output defined by 1 K p ; ( i ) : = / l i m o - 5 ; i [ i ^ = i]

a.s.

(13,12)

k—l

for each i = 1,..., TV, whenever these limits exist. The remainder of this section is devoted to the existence and form of the limits (13.12). 13.2 Consider an eviction policy TT such that the limits (13.9) exist under the IRM with popularity pmf p. For each i = 1,..., N, the

THEOREM

344

NEXT GENERATION INTERNET

limit (13.12) exists and is given by

where we have set m^p)

:=

A proof of Theorem 13.2 is given in Vanichpun and Makowski (2004b). Note that the existence of the limits (13.9) implies

( r=l

E 1 lim->

=

t-^oo t

l[i&ST]

a.s.

(13.15)

^ T—l

for each i = 1 , , . . , iV, and mn(i]p) thus represents the fraction of times that document i will not be in the cache. This quantity is determined by the popularity pmf p of the input to the cache and by the eviction policy 7T in use. Inspection of (13.10) and (13.14) reveals that TV

mn(i;p) =

M^p)

and this leads via (13.13) to a simple connection between the miss rate of an eviction policy and the pmf of its output in the form P?rW ~ —ii/r t — '

^ = 1' • • •' N.

(13.16)

Thus, the frequency p%(i) can be viewed as the ratio between the miss rate of the cache when the requested document is i and the overall miss rate of the cache.

13 Comparing Locality of Reference

6.

345

Majorization — A primer

The concept of majorization provides a powerful tool to formalize statements concerning the relative skewness in the components of two of the vector x are "more vectors, viz., the components (XI,...,XN) spread out" or "more balanced" than the components ( y i , . . . , VN) of the vector y: For vectors x and y in R ^ , we say that x is majorized by t/, and write x -< y, whenever the conditions y{j]) 2= 1

n=l,2,...,JV-l

(13.17)

2=1

and N

N

2=1

2=1

Vi

(13.18)

hold with xji] > Xfi] > • • • > #[JV] and y^ > yp] > • • • > 2/[iV] denoting the components of x and y arranged in decreasing order, respectively. We begin with a sufficient condition for majorization which is extracted from the discussion in (Marshall and Olkin, 1979, B.I, p. 129). 13.1 Let x and y be distinct elements ofHN such that (13.18) holds. Whenever, x\ > x^ > ... > XN, if there exists some k =

PROPOSITION

1 , . . . ,7V —1 such that Xi < yi, i — 1 , . . . , k, and Xi > yi, i = fc + 1 , . . . , A/",

then the comparison x -< y holds. The following sufficient condition for majorization will be useful in the sequel; it was already announced without proof in (Marshall and Olkin, 1979, B.l.b, p. 129). 13.3 Letx andy be distinct elements ofHNsuch that (13.18) holds. Whenever x\ > x ... > XN > 0, and the ratios ^, i = 1 , . . . , N, are decreasing in i, we have the comparison x -< y.

THEOREM

With any element of R such that X/z=i ^i / 0, we associate the normalized vector x as the element of R ^ defined by N

With this notation we can present a useful corollary to Theorem 13.3. COROLLARY 13.1 Let x and y be distinct elements of HN such that YliLiVi > 0- Whenever x\ > x^ > ... > xjy > 0, and the ratios ^, i'. = 1 , . . . , TV, are decreasing in i, we have the comparison x -< y.

346

NEXT GENERATION INTERNET

The following reformulation of Corollary 13.1 is used in the sequel. LEMMA 13.1 Let x and y be distinct elements ofR,N such that X{ > 0 for each i = 1 , . . . , JV and J2iLi V% > 0- U f1 ^ f1 whenever x\ > Xj for distinct i, j = 1 , . . . , N, then the comparison x -< y holds.

7.

Schur-convexity

Key to the power of majorization is the companion notion of monotonicity associated with it: An R-valued function

(resp. cp(x) >
whenever x and y are elements in A satisfying x -< y. In other words, Schur-convexity (resp. Schur-concavity) corresponds to monotone increasingness (resp. decreasingness) for majorization (viewed as a preorder on subsets of HN). Let a denote a permutation of { 1 , . . . , N}. With any element x in R ^ , we associate the permuted vector c(x) in R ^ through the relation a(x) = ( a ^ i ) , . . . , a^Af)). Let {ai, i — 1 , . . . , A^!} be a given enumeration of all the AM permutations of { 1 , . . . , Af}; this enumeration is held fixed throughout the chapter. A subset A of R ^ is said to be symmetric if for any x in A, the element (Ji{x) also belongs to A for each i = 1 , . . . , N\. Moreover, for any subset A of R ^ , a mapping R is said to be symmetric if A is symmetric and for any x in A, we have ip(ai(x)) — ip(x) for each i = 1 , . . . , AT!. If the mapping ip : A —» R is Schur-convex (resp. Schur-concave) with symmetric A, then ip is necessarily symmetric since (Ti{x) -< x -< <Ji{x) implies (p{ai(x)) = R is defined by xh<--xiM)

xeHN.

(13.19)

{ii,...,iM}€A*(M;.A0

By convention we write EQ^N(X) = 1 for all x in R ^ . It is well known that the function EM,N is Schur-concave on R ^ for each M = 0 , 1 , . . . , Af (Marshall and Olkin,' 1979, Prop. F.I., p. 78). The following result is due to Schur (Marshall and Olkin, 1979, F.3, p. 80) and will be key to a number of proofs.

13 Comparing Locality of Reference PROPOSITION

347

13.2 For each M = 1 , . . . , N, the mapping $M,TV • R+ —>

R given by8 ,

v ,

Us

VZ -U-L_I_

is increasing,9 symmetric and concave, thus increasing and Schur-concave on R ^ .

With vectors t and x in R^, we associate the element t • x of R ^ defined by t • x := (£i#i,..., tj^xjsf). With this notation we can state 13.3 ^4ssnme the mapping I/J : R ^ -^ H to be concave and the mapping h : R * —> R to be increasing, symmetric and concave. For any non-zero vector t in R ^ ; the mapping ipt • R+ —* R defined by

PROPOSITION

(Ji ( a ; ) ) , . . . ,

is symmetric and concave, thus Schur-concave on R^f.

8.

Comparing input and output

Recall that we have in mind to compare the strength of locality of reference in two streams of requests through a majorization ordering of their popularity pmfs. The next result constitutes a first step in the process of comparing input and output popularity pmfs. 13.4 Consider an eviction policy TT such that the limits (13.9) exist under the IRM with popularity pmf p. If m^ii^p) > mn(j]p) whenever p(i)mn(i\p) < p(j) m 7r0';p) for distinct i,j = l , . . . , i V , then it holds that p* -< p provided m^^p) > 0 for each i = 1 , . . . , N.

THEOREM

Proof. This claim is a simple consequence of Lemma 13.1: We take y — p and x given by x% — p{i)mn(i]p), i — 1 , . . . , N. Thus, we have x — p\ while y = p, and the requisite monotonicity assumptions hold. • The assumptions of Theorem 13.4 ensure that m^{i\p) < m^{j\p) and p(j) < p(i) occur simultaneously for distinct i , j = 1 , . . . , N. This leads to defining a caching algorithm TT as good if for every admissible pmf p, we have mn(i]p) < mn(j;p) whenever p(j) < p{%) for distinct i, j = 1 , . . . , N. Thus, a caching policy which satisfies the assumptions of Theorem 13.4 is necessarily a good policy. However, as we shall see in 8 For x in R ^ such that EM-I,N{&) = 0, we have EM,N(&) continuity. 9 Here, increasing means increasing in each argument.

= 0 and set &M,N(&)

= 0 by

348

NEXT GENERATION INTERNET

the case of the LRU policy in Section 15, this by itself is not sufficient to ensure that the output popularity pmf is more balanced than the input popularity pmf. Repeatedly we shall encounter output pmfs which assume the generic form used in Theorem 13.5 below. 13.5 Let p be an admissible pmf on M, and for each i — 1 , . . . , N, define the (N — 1)-dimensional vector pW by

THEOREM

For each M — 1, 2 , . . . , N — 1, the pmf p*M on M defined by * = 1, • • •, JV

(16.20)

satisfies the comparison p*M -< p. A proof of this theorem builds on Lemma 13.1 and is given in Vanichpun and Makowski (2004b).

9.

Random on-demand replacement

We now introduce a large class of demand-driven eviction policies called Random On-demand Replacement Algorithms (RORA). This class of policies generalizes many well-known caching policies, e.g., the random and FIFO policies, as well as the optimal policy A$. Moreover, the Partially Preloaded Random Replacement Algorithms proposed by Gelenbe (1973) form a subclass of RORAs. A RORA policy follows the demand-driven caching rule (13.7) (under the customary assumption that the cache is initially full) and is characterized by an eviction/insertion pmf r which we organize as the M x M matrix r = ( r ^ ) , i.e., for each k,£ — 1 , . . . , M , we have r^e > 0 and YskLi Y^eLi rk£ — 1- Thereafter we refer to the RORA associated with the pmf matrix r as the RORA(r) policy. We select the cache state Q,t at time t to be an element ( i i , . . . , %M) of A(M;A/") with the understanding that document i^, k — 1 , . . . , M , is in cache position k at time t. The RORA(r) policy implements the following eviction rule: Introduce a sequence of i.i.d. rvs {(Xt, Yt), t = 0,1,...} taking values in { 1 , . . . , M} x { 1 . . . , M} with common pmf r, i.e., for each t — 0 , 1 , . . . , we have P {(Xu Yt) = {k, I)} = rke,

k,£=l,...,M.

13 Comparing Locality of Reference

349

The sequences of rvs {{Xt, Y^), t = 0,1,...} and {Rt, t — 0,1,...} are assumed mutually independent. The document Ut to be evicted at time t is given by

Ut =

l[Rt
We have Ut = 0 whenever Rt is in 5t, in line with the convention that then no replacement occurs and the cache state remains unchanged, i.e., Next, if i?t is not in St and (Xt)Yt) = (/c,f), then [/* = i*. (the document at position k is evicted) and the new document is inserted in the cache at position L If k < I, the documents i/c+i,. •., %i are shifted down to position /c,fc+ 1 . . . ,£ — 1 (in that order) while if k > £, the documents i^ . . . , i/._i are shifted up to position £ + 1 , . . . , k (in that order). When k = £, the new document simply replaces the evicted document at position k. A document initially at position i in the cache will never be replaced if

{

all k = l , . . . , i and ^ = i , . . . , M and

(13.21)

alH = 1 , . . . , i and fe = i , . . . , M. If we use row i and column i to partition the matrix r into four blocks, then condition (13.21) expresses the fact that the entries in the northwest and southeast corners all vanish (including row i and column i). Let E r denote the set of cache positions with the property that any document initially put there will never be evicted during the operation of the cache, i.e., (13.22) S r := {i = 1 , . . . , M : Eqn. (13.21) holds at i}. Throughout, let m denote the number of elements in E r , i.e., m — | S r | . Under the IRM with popularity pmf p, the cache states {f^,£ = 0,1,...} form a Markov chain on the state space A(M;A/"). The ergodic properties of this chain are determined by whether the set E r is empty (m = 0) or not (m / 0). This is discussed in Lemmas 13.2 and 13.3 in the next two sections; proofs are available in Vanichpun (2005). Throughout the discussion we always assume that the cache size M and the number of cacheable documents N satisfy M + 1 < N. We do so in order to avoid technical cases of limited interest. Indeed, the results here are still valid for the case TV = M + 1, but require slightly different arguments. We refer the interested reader to Vanichpun (2005) for details.

350

10, 10.1

NEXT GENERATION INTERNET

Ergodic properties under RORAs Case 1 (m = 0)

The set E r is empty (ra = 0), so that every document in cache is eventually replaced, i.e., for each i — 1,... , M, there exists a pair k,l (possibly depending on i) with either l 0. Here are some well-known policies which fall in this case: The random policy corresponds to the RORA(r) policy with r given by rkk — jf fo^ each k = 1 , . . . , M . The FIFO policy also belongs to RORA with two possibilities for r, namely T\M = 1 or TM\ — 1- The first (resp. second) choice corresponds to the cache state (iij..., %M) being loaded from left to right with documents ordered from the oldest to the most recent (resp. from the most recent to the oldest). In this case, the Markov chain {£)$,* = 0,1,...} is ergodic on the state space A(M;AT); its stationary distribution exists and is given in the following lemma. 13.2 Assume the input to be an IRM with popularity pmf p. For any RORA(r) policy with S r empty, the cache states {Cltjt = 0,1,...} is an ergodic Markov chain on the state space A(M;A/*) with stationary pmf on A(M; AT) given by LEMMA

ur(s;p)

! * — lim - > 1 \QT = s]

a.s.

T— 1

=

1

C(p)- p(h)p(i2)---p(iM)

(13.23)

for every s = ( i i , . . . , ZM) in A(M;J\f) with normalizing constant C{p) :=

V^

v{h)p{i2) • • *P(^M)-

(13.24)

The stationary pmf is the same for all RORAs in Case 1.

10.2

Case 2 (m ^ 0)

The set S r is not empty (m / 0), and some documents, once put in cache, will never be replaced during the operation of the cache, i.e., if Qo = ( n , . . . , i M ) , then for all i = 1,2,..., with Ctt = (ji, • • •, J M ) , we have je = ie, £€Xr. (13.25) Here are some examples of RORA policies in that category: As pointed out in Section 3.3, any permutation o of { 1 , . . . , N} induces an eviction

13 Comparing Locality of Reference

351

policy Aa which evicts the "smallest" document in cache with documents cr(l), cr(2),..., cr(N) "ranked" in decreasing order. The documents
je = ie, £ € S r }.

(13.27)

In view of (13.25), once the cache state is in A(r,so), it remains there forever. In fact all the states in the component A(r,so) communicate with each other, and this set of states is closed under the motion of the Markov chain. There are ( ^ Z ^ ) ( ^ " ~ ^ ) ' elements in A(r, so) and there are (m)wi! distinct components which form a partition of A(M; A/"). As a result, when restricted to A(r, so), this Markov chain is irreducible and aperiodic, and its ergodic behavior can be characterized as follows: 13.3 Assume the input to be an IRM with popularity pmf p. For any RORA(r) policy with | E r | = m for some m = 1 , . . . , M — 1, and initial cache state SQ, the cache states {Qt,t — 0,1,...} form an ergodic Markov chain on the component A(r, so). In particular the limit

LEMMA

1

t

T—\

always exists for every s = ( i i , . . . , %M) in A(M; A/*), and is given fry Cr (P, So)~ ] Ilt^ErCso) P^>

^

S E A

^ ' 5 °) (13.28)

352

NEXT GENERATION INTERNET

with normalizing constant

Cr(p,s0):=

J2

I I *>&)•

( 13 - 29 )

(h,...,iM)€A(r,s0) i

11. 11.1

The miss rate under RORAs Case 1 (m = 0)

Fix s = {h,... ,1M} in A*(M;A/"), and let A(s\M;M) subset of A(M;Af) defined by A(s\M;Af)

:={(JI,...,JM)

denote the

e A(M;M) : {jh ..., jM} = {h, • • •, «M}} •

By Lemma 13.2, the limit (13.9) exists and is given by

(ji,...,JM)eA(S|M;A0

=

C(p)-1M\-p(i1)p(i2)---p(iM)

(13.30)

with normalizing constant C(p) given by (13.24). The equality at (13.30) follows from the fact that there are M! elements in A(s\M;J\f). Using (13.30) in conjunction with Theorem 13.1, we readily conclude that under the RORA(r) policy of Case 1, the miss rate (13.8) exists as a constant which is independent of the initial cache state so. To acknowledge this fact, we simply denote this limiting constant by Mr(p). Specializing (13.11) leads to

Mr(p) = C{p)-lM\

J2

P(h)---p(iM)

X) P I^{U,...,JM}

p(ii)---p(*M+i)

N(p)

(13-31)

while the normalizing constant C(p) in (13.24) can be simplified as

= M\-EMAP)-

(13-32)

13 Comparing Locality of Reference

353

Combining (13.31) and (13.32) we finally get

Mr(p) = (M + 1) • ^ l f + 1 ^ y = (M + 1)*M+I,*(P)

(13.33)

and a straightforward application of Proposition 13.2 yields 13.6 Under any RORA(r) policy with E r empty, for admissible pmfs p and q on Af, it holds that Mr{q) < Mr(p) whenever p -< q.

THEOREM

11.2

Case 2 (m # 0)

Consider now the RORA(r) policy when the set S r is not empty, say with | S r | — m for some ra = 1 , . . . , M — 1, and let the cache be initially in state so in A(M;A/*). By Lemma 13.3, for each 5 = {H, . . . , £ M } i n A*(M;AT), the limit (13.9) exists and is given by Mr,, 0 (s';P)

(13-34)

where A(s|r, so) denotes the subset of A(r, so) defined by A(s|r,so) := { ( j i , . . . , J M ) € A(r,so) : { J I , - - - , J M } = { u , . . . , i M } } . The set A(s|r, so) is non-empty if and only if Sr(so)C{ii,...,iM}

(13.35)

so that fJ>rSo(s\p) = 0 whenever this inclusion (13.35) does not hold. With this in mind we define A*(r,s 0 ) := {s = { H , . . . , ^ M } e A * ( M ; j \ 0 : Eqn. (13.35) holds at s}. Going back to (13.28) and (13.29), for each s = { i i , . . . , U f } i n A*(r, so), we now conclude that

= Cr(p,sorl(M-m)\-

fl

pfo)

(13.36)

where in the last equality we combined the set equality { j i , . . . , JM} — { i i , . . . ,ijv/} with (13.35), and then made use of the identity

354

NEXT GENERATION INTERNET

Now, using (13.36) in conjunction with Theorem 13.1, we see that under the RORA(r) policy of Case 2 the miss rate (13.8) exists as a constant which depends on the initial cache state so. We record this fact in the notation by denoting this limiting constant by M r (p;so). As in Case 1, specializing (13.11) leads to Mr(p;so)

P&) {zi,...,iM}eA*(r,s0) H^r(so)

E PW ^{ii,-.^}

( 13 « 37 )

iy.-EM-m+iAt-P)

where the element t in HN is specified by t\ — 0 for document i in £ r (so) and ti = 1 otherwise. Moreover, by the same arguments as in Case 1, we can simplify the normalizing constant Cr(p, so) a s C r (p, so) = (M - m)! • EM-mMt

• p).

(13.38)

It then follows from (13.37) and (13.38) that Mr(p;s0)

= ( M - m + 1) (t'P).

(13.39)

Clearly, the documents in Sr(<so) do not contribute to the miss rate since they never generate a miss once loaded in cache - This is regardless of the order in which they appear in the cache state so. This intuitively obvious fact is in agreement with the expression (13.39) from which we see that for any two initial cache states SQ and sf0 in A(M;A/") with £ r (so) — ^r(so)> we have the equality Mr(p;so) = Mr(p;sf0). As a result, we shall find it appropriate to denote this common value by For any pmf p on JV, let £*(p) denote the set of the m most popular documents according to the pmf p. Equipped with the expression (13.39), we are now ready to establish the result for RORA policies in Case 2. T H E O R E M 13.7 Under any RORA(r) policy with | S r | = m for some ra = 1 , . . . , M — 1, for admissible pmfs p and q on J\f, it holds that

M r i E * ( p ) (p) whenever p -< q.

(13.40)

13 Comparing Locality of Reference

355

Proof. The desired result will be established if we can show that the miss rate function p —> Mr^r(So)(p) as given in (13.39) is Schur-concave whenever so is selected so that E r (so) — S*(p). As we can always relabel the documents, there is no loss of generality in assuming p(l) > p(2) > . . . > p(N), whence S*(p) — { l , . . . , m } and the element £ in (13.39) can be specified as t\ — . . . = tm = 0 and = . . . = tjv = 1. By Proposition 13.2, the mapping is increasing and Schur-concave on R ^ , and by virtue of the defining property of £*(p), we have

()

-m+l) -

The mapping /i : R^ 1 —» R : y —> min ( y i , . . . , J/JVI) is clearly increasing, symmetric and concave, while the mapping $M-m+i,7V is concave on R ^ by Proposition 13.2, Combining these facts with the expression for Mr^*(p)(p) obtained above, we conclude by Proposition 13.3 to the Schur-concavity (in the pmf vector) of the miss rate functional (13.39) • under the RORA(r) policy when E r (s 0 ) = S*(p).

12.

The output under RORAs

We now discuss the popularity pmf of the output generated under the RORA policies.

12.1

Case 1 (m = 0)

As we invoke Theorem 13.2, we can make use of the expressions (13.30) into the relation (13.14). For each i = 1,...,AT, in the notation of Theorem 13.5, this yields mr(i;p)

= sGA*(M;Af)

(13.41) where the last equality follows from (13.32). Reporting (13.41) back into (13.13), we conclude that the popularity pmf p£ of the output produced by the RORA(r) policy in Case 1 is indeed of the form (13.20), and Theorem 13.5 yields 13.8 Under any RORA(r) that Pr -< p.

THEOREM

policy with E r empty, it holds

When M = 1, any demand-driven policy TT reduces to the policy that evicts the only document in cache if the requested document is not in

356

NEXT GENERATION INTERNET

cache. Specializing the results above, we find that the output pmf p* is given by

and Theorem 13.8 immediately leads to 13.2 With M — 1, under any demand-driven replacement policy ir, the popularity pmf'p* of the output is given by (13.42), and satisfies p* -< p. COROLLARY

12,2

Case 2 (ra ^ 0)

Assume | S r | = m for some m = 1 , . . . , M — 1, and let the cache be initially in state SQ. The pmf TV on £ r (so) c is defined as the conditional pmf induced by p on £ r (so) c ; it is given by i 6 Tir(so)c.

(13.43)

For all i in E r (so), it is clear that rnr,So(i]p) = 0 while for document i not in Er(so)> with the expression for /j%SQ(s;p) given in (13.36), we find

m r , So (i;p) =

Y^

Cr{p,so)-l{M -m)\<

seA*(r,so): i£s

p)

^P-

(13.44)

where the element t^ and t^ of R ^ are specified by tj * = tj — 0 for document j in Sr(so)> 4 = 0, t\ — 1 and ^ = t^. — 1 whenever document j ^ i is no^ in Er(«so). In the second equality we made use of the expression (13.38). Combining (13.44) with (13.13), we immediately get f 0 P

*

(i) -

<

if i e S ( s o ) 7r(i)£?M-mN-m-l(7r(*))

-r • w W

^

(13«45)

Since p* s (i) = 0 whenever i belongs to E r (so) 5 it is more natural to seek a comparison between p£?So and the conditional pmf TT.

13

Comparing Locality of Reference

357

13.9 Under any RORA(r) policy with | E r | = m for some m = 1 , . . . , M — 1, it holds that p*r SQ -< n. THEOREM

Theorem 13.9 is essentially the same as Theorem 13.5. We immediately obtain the desired result upon identifying TT and E r (so) c with p and J\f in Theorem 13.5, respectively.

13*

Zipf-like pmfs

It has been observed in a number of studies that the popularity distribution of objects in request streams at Web caches is highly skewed. In Almeida et al. (1996), a good fit was provided by the Zipf distribution according to which the popularity of the ith most popular object is inversely proportional to its rank, namely 1/i. In more recent studies by Breslau et al. (1999), and by Jin and Bestavros (2000a), "Zipf-like" distributions 10 were found more appropriate; see Breslau et al. (1999) (and references therein) for an excellent summary. Such distributions form a one-parameter family. In our set-up, we say that the popularity pmf p of the AT-valued rvs {/^, £ = 0,1,...} is Zipf-like with parameter a > 0 if

i = h---,N

with Ca(N):=Y,i'a-

(13-46)

1=1

The pmf (13.46) will be denoted by pa. Note that pa(l) > pa{2) > . . . > Pa(N). The case a = 1 corresponds to the standard Zipf distribution, and the study by Breslau et al. (1999) has found the value of a to be typically in the range 0.64 — 0.83. Zipf-like pmfs are skewed towards the most popular objects. As a —> 0, the Zipf-like pmf approaches the uniform distribution u while as a —> oo, it degenerates to the pmf ( 1 , 0 , . . . , 0 ) . Extrapolating between these extreme cases, we expect the parameter a of Zipf-like pmfs (13.46) to measure the strength of skewness, with the larger a, the more skewed the pmf pa. The next result shows that majorization indeed captures this fact, and so it is warranted to call a the skewness parameter of the Zipf-like pmf. LEMMA

13.4 For 0 < a < (3, it holds that pa -< pp.

Lemma 13.4 can already be found in (Marshall and Olkin, 1979, B.2.b, p. 130), and is an easy by-product of Lemma 13.1. In the spirit of Lemma 10

Such distributions are sometimes called generalized Zipf distributions.

358

NEXT GENERATION INTERNET

13.4 and the aforementioned folk theorem (13.3), we expect the miss rate of the cache replacement policy to decrease as a increases. This has been shown to be the case using simulations in Gadde et al. (2001). Zipf-like pmfs are used in the discussion of the LRU policy in the next sections.

14,

The miss rate under the LRU policy

Under the IRM with admissible popularity pmf p, it is known (Aven et al., 1987, Thm. 9, p. 130) (CofTman and Denning, 1973, Thm. 6.5, p. 272) that the LRU cache states {fit?* = 0,1,...} form a stationary ergodic Markov chain over the finite state space A(M; A/*) with stationary distribution given by ,

1 ^

N

MLRU(^;P)

—

r

lim - 2_^ 1 [*lT — s\

a.s.

(13.47) for every s = (ii,...,ijvf) i n A(M;A/")« Consequently, t h e limit (13.9) exists for each s = { i i , . . . , %M} in A*(M; A/") as

—(1348)

TM—l

1

where A(s|M;A/") is as defined in Section 11.1. The miss rate of the LRU policy under IRM can then be evaluated from (13.11) as

14,1

A counterexample

Contrary to what transpired with RORA policies, the miss rate under the LRU policy is not Schur-concave in general, and consequently the folk theorem (13.3) does not hold. This is demonstrated through the following example developed for M = 3, AT = 4, and the family of pmfs p(x, y) = (x, 1 - 2y - x, y, y),

0
with x in the interval [^ — y, 1 — 3y]. Under these constraints, the components of the pmf p(x, y) are listed in decreasing order and for any

13

359

Comparing Locality of Reference

V

I/I

\

8

Figure 13.1. The LRU miss rate when M = 3, N = 4, p(3) = p(4) = 2/ = 0.05, p(l) = a: and p(2) = 0.9 -

0.5

0.6

0.7

0.8

0.9

Figure 13.2. The LRU miss rate when M = 3, N - 4, p(3) - p(4) = y = 0.01, p(l) = x and p(2) = 0.98 -

given y it holds that p(x,y) -<; p(x^y) whenever x < x' in the interval [i — y, 1 - 3y]. Therefore, if the miss rate under LRU was indeed a Schur-concave function in the popularity pmf, we would expect the functions x —» M\^jj{p{x,y)) to be monotone decreasing in x on the interval [^ — y, 1 — 3y]. Figures 13.1 and 13.2 display the numerical values of M L R U ( P ( ^ , 2 / ) ) as a function of x with y = 0.05 and y = 0.01, respectively; this was done by numerical evaluation of (13.49). In both cases, the miss rate of the LRU policy is not monotone decreasing in x on the range [^ — y, 1 — 3y], with the trend becoming more pronounced with decreasing y. In short, the miss rate is not Schur-concave under the LRU policy.

14.2

LRU with Zipf-like popularity pmfs

While the miss rate is not Schur-concave under the LRU policy, the desired monotonicity (13.3) is nevertheless true in an asymptotic sense when the popularity pmf is restricted to the class of Zipf-like pmfs. THEOREM 13.10 Assume the input to have a Zipf-like popularity pmf pa for some a > 0. Then, there exists a* — a*(M, N) > 0 and A > 0 such that M L R U ( P ^ ) < ^LRuGPa) whenever a* < a and a + A < (5.

This result is a byproduct of the asymptotic equivalence (13.50) established in Vanichpun and Makowski (2004a).. We have also carried out simulations of a cache operating under the LRU policy when the

360

NEXT GENERATION INTERNET

Figure 13.3. The LRU miss rate when the input has a Zipf-like popularity pmf pa with a small (0 < a < 1)

Figure 13.4- The LRU miss rate when the input has a Zipf-like popularity pmf pa with a large (a > 1)

input has a Zipf-like popularity pmf pa.11 The number of documents is set at N = 1,000 while the cache size is M — 100. The miss rate of the LRU policy is displayed in Figures 13.3 and 13.4 for a small (0 < a < 1) and a large (a > 1), respectively. It appears that the miss rate is indeed decreasing as the skewness parameter a increases across the entire range of a. This suggests that the folk theorem on miss rates probably holds for the LRU policy when the comparison is made within the class of Zipf-like popularity pmfs, hence the following 13.1 For arbitrary cache size M and number N of documents, the function a —> MLRU(PQ:) is strictly decreasing on [0,oo).

CONJECTURE

15*

The output under the LRU policy

With the expressions (13.47) for the stationary distribution of the LRU cache state, it is a simple matter to check for each i = 1 , . . . , iV, that

seAi(M;Af)

11 We choose simulations over numerical evaluation of (13.49) because this expression is not suitable for numerical evaluation due to a combinatorial explosion.

13 Comparing Locality of Reference

361

where Ai(M;J\f) denote the set of elements in A(M; AT) which do not contain i, i.e., Ai(M;J\f) :— {s = (ii,...iM) € A(M;AT) : i ^ 5}. Theorem 13.2 then gives the output popularity pmf in the form

for each i = 1,...,7V, as we make use of (13.16). We begin with a positive result, LEMMA

13.5 The LRU policy is a good policy.

In what follows, let p* denote the popularity pmf of the output induced by an input with Zipf-like popularity pmf pa (instead of the more cumbersome PLRU a)-

15.1

Another counterexample

In view of Lemma 13.5, it is tempting to expect that the majorization comparison PLRIJ "^ P a ^ s o holds under the LRU policy. This is not the case as the following example demonstrates: With M = 3 and N = 4 under the Zipf-like popularity pmf (13.46) with a = 3, we have computed the output popularity pmf under the LRU policy using (13.52). The numerical values of both input and output popularity pmfs are presented in Table 13.1. Table 13.1. The pmfs pa and p* under the LRU policy when the input distribution is Zipf-like with parameter a = 3

i

1

2

3

4

Pa Pa

0.8491 0.0118

0.1061 0.2031

0 .0314 0 3853

0.0133 0.3998

By the definition of majorization (13.17)-(13.18), the comparison p j -< pa requires min pa(i) < min p*(i), (13.53) i=l,...,iV

z=l,...,AT

in clear contradiction with Table 13.1, and therefore does not hold. On the other hand, the comparison pa -< p* is not valid either since it calls for the unmet requirement max Pa(i) < max p*(i). i

l

N

~ i l i V

a

(13.54)

362

NEXT GENERATION INTERNET

In short, pa and p* are not comparable in the majorization ordering. This situation does not represent an isolated incident as the next theorem shows; its proof is available in Vanichpun and Makowski (2004b). THEOREM 13.11 Assume the input to have a Zipf-like popularity pmf Pa for some a > 0. / / the number of documents N and the cache size M satisfy the condition N < Ml, then under the LRU policy, there exists a* — a*(M, N) such that p* -< pa does not hold whenever a > a*.

15.2

A conjecture

Theorems 13.8 and 13.9 were valid for all values of M and iV, and for arbitrary admissible pmfs. While the counterexamples discussed earlier dash our hope to get an analogous result for the LRU policy, the possibility remains, fueled by Corollary 13.2, that the positive result is nevertheless valid in some appropriate range of the parameters M and N. We now explore this issue still with Zipf-like popularity pmfs (13.46). CONJECTURE 13.2 Assume that the popularity pmf is the Zipf-like pmf (13.46) with a > 0. For each N = 1,2,..., there exists an integer M* = M*(a;N) with 1 < M* < N such that p* -< pa under the LRU policy whenever M — 1 , . . . , M*. In support of this conjecture, we have carried out simulations of the cache operating under the LRU policy when the input pmf is Zipf-like with parameter a = 0.8,1 and 2 and with N = 1,000. because We find the output popularity pmfs for different values of cache size, namely M — 10, 50,100, 500. The resulting output popularity pmfs in the original order of documents are shown in Figure 13.5, while the results after rearranging documents in the decreasing order of their output probabilities are displayed in Figure 13.6. From Figure 13.6 (a), when a = 0.8, the comparison p* -< pa holds for M — 10, 50. Indeed, from their respective plots, we observe that the pmfs pa and p£, when arranged in decreasing order intersect only once, namelyp*([i]) <Pa(i), i = l,...,fc, andp*([i]) >pa(i), i = / c + l , . . . , N , for some k = 1 , . . . , N - 1, where p* ([1]) > Pad2}) > • • • > Pa([N)) a r e the components of p* arranged in decreasing order. This is the sufficient condition for majorization comparison provided in Proposition 13.1. However, for a — 0.8 and M = 100, 500, despite the fact that in Figure 13.6 (a), p* looks uniform in the range where document rank is smaller than M, the comparison p* -< pa is invalid since the necessary condition (13.53) does not hold. This violation, mini = i v . t) 7vPaW < pa(N), can be easily seen from Figure 13.5 (a) or from the subplot inside Figure 13.6 (a). For a — 1 and 2, by the same arguments, we conclude

13

363

Comparing Locality of Reference

(a) a « 0.8

10° 10"1 10"*

10**

— Zipf-like a«0.8 - - M-10 M-50 — M-100 — M-600

io-1

10- 3

10"3 io- 4

(a) a «0.8

10° — Zipf-like a-0.8 - - M-10 M-60 — M-100 — M-600

io-

/

10~5

•

IO-*

*••;

10* 6

/

10"4

10" 7

10' 7 /

950

101

10°

1000

103

102

101

10 2 document rank

— Zlpf-like a-1 - - M-10 M-60 — M-100 - - M-600

— Zlpf-like a«1 - - M-10 M-50 - * M«500

c

,,,,,'

llllHlllll

I

• io" 4

• io"»

\

060

101

10 2

103

1000

102

10°

103

document

(oja-2

/

10°

\ .

(c)a*2 — Zipf-like a«2 - - M-10 M-50 -— M-100 - - M-500

*>•;

101

n 10 2

103

document

Figure 13.5. The LRU output popularity pmf with different cache sizes M when the input has a Zipf-like pmf with (a) a = 0.8, (b) a = 1 and (c) a — 2. Documents are arranged in the original order of the input pmf pa.

101

10* document rank

Figure 13.6. The LRU output popularity pmf with different cache sizes M when the input has a Zipf-like pmf with (a) a = 0.8, (b) a = 1 and (c) a = 2. Documents are ranked according to decreasing probabilities p«(i).

364

NEXT GENERATION INTERNET

from Figures 13.5 (b)-(c) and from Figures 13.6 (b)-(c) that the comparison p* -< pa holds for M = 10 but does not hold for other cache sizes M = 50,100, 500. These experimental results agree with Conjecture 13.2, and suggest that the value of M*(a;N) in Conjecture 13.2 decreases as a increases. Acknowledgments This material is based upon work supported by the Space and Naval Warfare Systems Center - San Diego under Contract No. N66001-00-C-8063. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Space and Naval Warfare Systems Center - San Diego.

References Almeida, V., Bestavros, A., Crovella, M., and de Oliveira, A. (1996). Characterizing reference locality in the Web. In: Proceedings of PDIS'96, The IEEE Conference on Parallel and Distributed Information Systems, pp. 92-107, Miami, FL. Aven, O.I., Coffman, E.G., and Kogan, Y.A. (1987). Stochastic Analysis of Computer Storage. D. Reidel Publishing Company, Dordrecht, Holland. Baccelli, F. and Makowski, A.M. (1989). Queueing models for systems with synchronization constraints. In: Proceedings of the IEEE, 77:138-161. Belady, L.A. (1966). A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5:78-101. Breslau, L., Cao, P., Fan, L., Phillips, G., and Shenker, S. (1999). Web caching and Zipf-like distributions: Evidence and implications. In: Proceedings of IEEE INFOCOM 1999, New York, NY. Coffman, E. and Denning, P. (1973). Operating Systems Theory, Prentice-Hall, Englewood Cliffs, NJ. Denning, P.J. (1968). The working set model for program behavior. Communications of the ACM, 11:323-333. Fonseca, R., Almeida, V., Crovella, M., and Abrahao, B. (2003). On the intrinsic locality of Web reference streams. In: Proceedings of IEEE INFOCOM 2003, San Francisco, CA. Gadde, S., Chase, J.S., and Rabinovich, M. (2001). Web caching and content distribution: A view from the interior. Computer Communications, 24:222-231. Gelenbe, E. (1973). A unified approach to the evaluation of a class of replacement algorithms. IEEE Transactions on Computers, 22:611-618. Jelenkovic, P.R. and Radovanovic, A. (2003). Asymptotic insensitivity of LeastRecently-Used caching to statistical dependency. In: Proceedings of IEEE INFOCOM 2003, San Francisco, CA. Jin, S. and Bestavros, A. (2000a). GreedyDual* Web caching algorithm: Exploiting the two sources of temporal locality in Web request streams. In: Proceedings of the 5th International Web Caching and Content Delivery Workshop, Lisbon, Portugal. Jin, S. and Bestavros, A. (2000b). Sources and characteristics of Web temporal locality. In: Proceedings of MASCOTS 2000, The IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, San Francisco, CA.

13

Comparing Locality of Reference

365

Mahanti, A., Williamson, C, and Eager, D. (2000). Temporal locality and its impact on Web proxy cache performance. Performance Evaluation, Special Issue on Internet Performance Modelling, 42:187-203. Marshall, A.W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications, Academic Press, New York, NY. Phalke, V. and Gopinath, B. (1995). An inter-reference gap model for temporal locality in program behavior. In: Proceedings of ACM SIGMETRICS 1995, pp. 291-300, Ottawa, Canada. Vanichpun, S. (2005). Comparing Strength of Locality of Reference in Web Request Streams. Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of Maryland, College Park, MD. Vanichpun, S. and Makowski, A.M. (2002). The effects of positive correlations on buffer occupancy: Lower bounds via supermodular ordering. In: Proceedings of IEEE INFOCOM 2002, New York, NY. Vanichpun, S. and Makowski, A.M. (2004a). Comparing strength of locality of reference - Popularity, majorization, and some folk theorems. In: Proceedings of IEEE INFOCOM 2004, Hong Kong. Vanichpun, S. and Makowski, A.M. (2004b). The output of a cache under the independent reference model - Where did the locality of reference go? In: Proceedings of ACM SIGMETRICS - PERFORMANCE 2004, New York, NY. Vanichpun, S. and Makowski, A.M. (2004c). When are on-off sources SIS? Conditions and applications. Probability in the Engineering and Informational Sciences, 18:423-443. Wang, J. (1999). A survey of Web caching schemes for the Internet. ACM Computer Communication Review, 25:36-46.

Column Generation (Gerad 25th Anniversary Series, Volume 5)

Read more

Essays and Surveys in Global Optimization (Gerad 25th Anniversary Series)

Read more

Logistics Systems: Design and Optimization (Gerad 25th Anniversary Series)

Read more

Graph Theory and Combinatorial Optimization (Gerad 25th Anniversary Series)

Read more

Graph Theory and Combinatorial Optimization (Gerad 25th Anniversary Series)

Read more

Distributed power generation: planning and evaluation

Read more

Next-Generation Internet: Architectures and Protocols

Read more

Next-Generation Internet: Architectures and Protocols

Read more

The Next Generation

Read more

Hacking: The Next Generation

Read more

Algorithms for Next Generation Networks

Read more

Concept Mapping for Planning and Evaluation (Applied Social Research Methods)

Read more

Algorithms for Next Generation Networks

Read more

Analysis, Control and Optimization of Complex Dynamic Systems (Gerad 25th Anniversary)

Read more

Optical Performance Monitoring: Advanced Techniques for Next-Generation Photonic Networks

Read more

Hacking: The Next Generation

Read more

Hacking: The Next Generation

Read more

The Next Generation

Read more

Hacking: The Next Generation

Read more

The Dark Shadows Companion (25th Anniversary Collection)

Read more

Performance Evaluation and Benchmarking

Read more

Wireless Systems and Mobility in Next Generation Internet

Read more

Next Generation Leader

Read more

Next Generation 2

Read more

Managing It Skills Portfolios: Planning, Acquisition and Performance Evaluation

Read more

Chiefs: A Novel (25th Anniversary Edition)

Read more

Steal This Book: 25th Anniversary Facsimile Edition

Read more

The Quantum Leap: Next Generation

Read more

IP Over WDM: Building the Next Generation Optical Internet

Read more

The Next Generation CDMA Technologies

Read more

Recommend Documents

Column Generation (Gerad 25th Anniversary Series, Volume 5)

COLUMN GENERATION G E R A D 25th Anniversary Series Essays and Surveys in Global Optimization Charles Audet, Pierre ...

Essays and Surveys in Global Optimization (Gerad 25th Anniversary Series)

ESSAYS AND SURVEYS IN GLOBAL OPTIMIZATION GERAD 25th Anniversary Series w Essays and Surveys i n Global Optimization...

Logistics Systems: Design and Optimization (Gerad 25th Anniversary Series)

DESIGN AND OPTIMIZATION • Essays and Surveys in Global Optimization Charles Audet, Pierre Hansen, and Gilles Savard, ...

Graph Theory and Combinatorial Optimization (Gerad 25th Anniversary Series)

Graph Theory and Combinatorial Optimization (Gerad 25th Anniversary Series)

GRAPH THEORY AND COMBINATORIAL OPTIMIZATION GERAD 25th Anniversary Series Essays and Surveys i n Global Optimization...

Distributed power generation: planning and evaluation

Next-Generation Internet: Architectures and Protocols

This page intentionally left blank Next-Generation Internet With ever-increasing demands on capacity, quality of serv...

Next-Generation Internet: Architectures and Protocols

This page intentionally left blank Next-Generation Internet With ever-increasing demands on capacity, quality of serv...

The Next Generation

Hacking: The Next Generation